如何在 FastEmbed 中使用重排序器
重排序器
重排序器是一种改善搜索结果排序的模型。最初使用快速、简单的方法(例如 BM25 或密集嵌入)检索文档子集。然后,重排序器——一个更强大、更精确,但更慢、更重的模型——会重新评估这个子集,以优化文档与查询的相关性。
重排序器深入分析查询与每个文档之间的 token 级别交互,这使得它们使用成本较高,但在定义相关性方面更精确。它们牺牲速度以换取准确性,因此最适合用于有限的候选集,而不是整个语料库。
本教程的目标
通常使用 cross-encoder 模型作为重排序器。本教程使用 Jina Reranker v2 Base Multilingual(根据 CC-BY-NC-4.0 许可)——这是 FastEmbed 支持的 cross-encoder 重排序器。
我们使用 all-MiniLM-L6-v2 密集嵌入模型(FastEmbed 也支持)作为第一阶段检索器,然后使用 Jina Reranker v2 优化结果。
设置
使用 fastembed 安装 qdrant-client。
pip install "qdrant-client[fastembed]>=1.14.1"
导入 cross-encoders 和文本嵌入用于第一阶段检索。
from fastembed import TextEmbedding
from fastembed.rerank.cross_encoder import TextCrossEncoder
您可以使用以下命令列出 FastEmbed 支持的 cross-encoder 重排序器。
TextCrossEncoder.list_supported_models()
此命令显示可用模型,包括输出嵌入维度、模型描述、模型大小、模型来源和模型文件等详细信息。
可用模型
[{'model': 'Xenova/ms-marco-MiniLM-L-6-v2',
'size_in_GB': 0.08,
'sources': {'hf': 'Xenova/ms-marco-MiniLM-L-6-v2'},
'model_file': 'onnx/model.onnx',
'description': 'MiniLM-L-6-v2 model optimized for re-ranking tasks.',
'license': 'apache-2.0'},
{'model': 'Xenova/ms-marco-MiniLM-L-12-v2',
'size_in_GB': 0.12,
'sources': {'hf': 'Xenova/ms-marco-MiniLM-L-12-v2'},
'model_file': 'onnx/model.onnx',
'description': 'MiniLM-L-12-v2 model optimized for re-ranking tasks.',
'license': 'apache-2.0'},
{'model': 'BAAI/bge-reranker-base',
'size_in_GB': 1.04,
'sources': {'hf': 'BAAI/bge-reranker-base'},
'model_file': 'onnx/model.onnx',
'description': 'BGE reranker base model for cross-encoder re-ranking.',
'license': 'mit'},
{'model': 'jinaai/jina-reranker-v1-tiny-en',
'size_in_GB': 0.13,
'sources': {'hf': 'jinaai/jina-reranker-v1-tiny-en'},
'model_file': 'onnx/model.onnx',
'description': 'Designed for blazing-fast re-ranking with 8K context length and fewer parameters than jina-reranker-v1-turbo-en.',
'license': 'apache-2.0'},
{'model': 'jinaai/jina-reranker-v1-turbo-en',
'size_in_GB': 0.15,
'sources': {'hf': 'jinaai/jina-reranker-v1-turbo-en'},
'model_file': 'onnx/model.onnx',
'description': 'Designed for blazing-fast re-ranking with 8K context length.',
'license': 'apache-2.0'},
{'model': 'jinaai/jina-reranker-v2-base-multilingual',
'size_in_GB': 1.11,
'sources': {'hf': 'jinaai/jina-reranker-v2-base-multilingual'},
'model_file': 'onnx/model.onnx',
'description': 'A multi-lingual reranker model for cross-encoder re-ranking with 1K context length and sliding window',
'license': 'cc-by-nc-4.0'}] # some of the fields are omitted for brevity
现在,加载第一阶段检索器和重排序器。
encoder_name = "sentence-transformers/all-MiniLM-L6-v2"
dense_embedding_model = TextEmbedding(model_name=encoder_name)
reranker = TextCrossEncoder(model_name='jinaai/jina-reranker-v2-base-multilingual')
模型文件将被获取和下载,并显示进度。
为第一阶段检索嵌入 & 索引数据
我们将使用 all-MiniLM-L6-v2 模型向量化一个示例电影描述数据集,并将嵌入存储在 Qdrant 中用于第一阶段检索。
然后,我们将使用 cross-encoder 重排序模型对第一阶段检索到的一小部分数据进行重排序。
电影描述数据集
descriptions = ["In 1431, Jeanne d'Arc is placed on trial on charges of heresy. The ecclesiastical jurists attempt to force Jeanne to recant her claims of holy visions.",
"A film projectionist longs to be a detective, and puts his meagre skills to work when he is framed by a rival for stealing his girlfriend's father's pocketwatch.",
"A group of high-end professional thieves start to feel the heat from the LAPD when they unknowingly leave a clue at their latest heist.",
"A petty thief with an utter resemblance to a samurai warlord is hired as the lord's double. When the warlord later dies the thief is forced to take up arms in his place.",
"A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past.",
"A biopic detailing the 2 decades that Punjabi Sikh revolutionary Udham Singh spent planning the assassination of the man responsible for the Jallianwala Bagh massacre.",
"When a machine that allows therapists to enter their patients' dreams is stolen, all hell breaks loose. Only a young female therapist, Paprika, can stop it.",
"An ordinary word processor has the worst night of his life after he agrees to visit a girl in Soho whom he met that evening at a coffee shop.",
"A story that revolves around drug abuse in the affluent north Indian State of Punjab and how the youth there have succumbed to it en-masse resulting in a socio-economic decline.",
"A world-weary political journalist picks up the story of a woman's search for her son, who was taken away from her decades ago after she became pregnant and was forced to live in a convent.",
"Concurrent theatrical ending of the TV series Neon Genesis Evangelion (1995).",
"During World War II, a rebellious U.S. Army Major is assigned a dozen convicted murderers to train and lead them into a mass assassination mission of German officers.",
"The toys are mistakenly delivered to a day-care center instead of the attic right before Andy leaves for college, and it's up to Woody to convince the other toys that they weren't abandoned and to return home.",
"A soldier fighting aliens gets to relive the same day over and over again, the day restarting every time he dies.",
"After two male musicians witness a mob hit, they flee the state in an all-female band disguised as women, but further complications set in.",
"Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household.",
"A renegade reporter trailing a young runaway heiress for a big story joins her on a bus heading from Florida to New York, and they end up stuck with each other when the bus leaves them behind at one of the stops.",
"Story of 40-man Turkish task force who must defend a relay station.",
"Spinal Tap, one of England's loudest bands, is chronicled by film director Marty DiBergi on what proves to be a fateful tour.",
"Oskar, an overlooked and bullied boy, finds love and revenge through Eli, a beautiful but peculiar girl."]
descriptions_embeddings = list(
dense_embedding_model.embed(descriptions)
)
让我们将嵌入上传到 Qdrant。
Qdrant Client 提供简单的内存模式,允许您在本地使用少量数据进行实验。
或者,您可以在 Qdrant Cloud 中使用免费集群进行实验。
from qdrant_client import QdrantClient, models
client = QdrantClient(":memory:") # Qdrant is running from RAM.
让我们用我们的电影数据创建一个 collection。
client.create_collection(
collection_name="movies",
vectors_config={
"embedding": models.VectorParams(
size=client.get_embedding_size("sentence-transformers/all-MiniLM-L6-v2"),
distance=models.Distance.COSINE
)
}
)
并将嵌入上传到其中。
client.upload_points(
collection_name="movies",
points=[
models.PointStruct(
id=idx,
payload={"description": description},
vector={"embedding": vector}
)
for idx, (description, vector) in enumerate(
zip(descriptions, descriptions_embeddings)
)
],
)
通过隐式嵌入计算上传
client.upload_points(
collection_name="movies",
points=[
models.PointStruct(
id=idx,
payload={"description": description},
vector={"embedding": models.Document(text=description, model=encoder_name)},
)
for idx, description in enumerate(descriptions)
],
)
第一阶段检索
让我们看看仅使用基于 all-MiniLM-L6-v2 的密集检索器,结果的相关性如何。
query = "A story about a strong historically significant female figure."
query_embedded = list(dense_embedding_model.query_embed(query))[0]
initial_retrieval = client.query_points(
collection_name="movies",
using="embedding",
query=query_embedded,
with_payload=True,
limit=10
)
description_hits = []
for i, hit in enumerate(initial_retrieval.points):
print(f'Result number {i+1} is \"{hit.payload["description"]}\"')
description_hits.append(hit.payload["description"])
通过隐式嵌入计算查询点
query = "A story about a strong historically significant female figure."
initial_retrieval = client.query_points(
collection_name="movies",
using="embedding",
query=models.Document(text=query, model=encoder_name),
with_payload=True,
limit=10
)
结果如下
Result number 1 is "A world-weary political journalist picks up the story of a woman's search for her son, who was taken away from her decades ago after she became pregnant and was forced to live in a convent."
Result number 2 is "Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household."
...
Result number 9 is "A biopic detailing the 2 decades that Punjabi Sikh revolutionary Udham Singh spent planning the assassination of the man responsible for the Jallianwala Bagh massacre."
Result number 10 is "In 1431, Jeanne d'Arc is placed on trial on charges of heresy. The ecclesiastical jurists attempt to force Jeanne to recant her claims of holy visions."
我们可以看到,最符合的描述 “圣女贞德的故事” 在结果中排在第 10 位。
让我们尝试使用 Jina Reranker v2 优化检索到的子集的顺序。它将查询和一组文档(电影描述)作为输入,并根据查询与每个文档之间的 token 级交互计算相关性分数。
new_scores = list(
reranker.rerank(query, description_hits)
) # returns scores between query and each document
ranking = [
(i, score) for i, score in enumerate(new_scores)
] # saving document indices
ranking.sort(
key=lambda x: x[1], reverse=True
) # sorting them in order of relevance defined by reranker
for i, rank in enumerate(ranking):
print(f'''Reranked result number {i+1} is \"{description_hits[rank[0]]}\"''')
重排序器根据相关性将期望的电影移至第一位。
Reranked result number 1 is "In 1431, Jeanne d'Arc is placed on trial on charges of heresy. The ecclesiastical jurists attempt to force Jeanne to recant her claims of holy visions."
Reranked result number 2 is "Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household."
...
Reranked result number 9 is "An ordinary word processor has the worst night of his life after he agrees to visit a girl in Soho whom he met that evening at a coffee shop."
Reranked result number 10 is "A biopic detailing the 2 decades that Punjabi Sikh revolutionary Udham Singh spent planning the assassination of the man responsible for the Jallianwala Bagh massacre."
结论
重排序器通过更深入的语义分析对检索到的候选进行重新排序,从而优化搜索结果。为了提高效率,应将它们仅应用于检索结果的一小部分子集。
利用重排序器的强大功能,在搜索中平衡速度和准确性!