如何使用 FastEmbed 生成 ColBERT 多向量
ColBERT
ColBERT 是一种嵌入模型,它生成输入文本的矩阵(多向量)表示,为每个词元(词元是机器学习模型的有意义文本单元)生成一个向量。与许多用单个向量表示整个输入的稠密嵌入模型相比,这种方法使 ColBERT 能够捕获更细微的输入语义。通过生成更细粒度的输入表示,ColBERT 成为一个强大的检索器。然而,与传统的稠密嵌入模型相比,这种优势带来了更高的资源消耗成本,包括速度和内存。
尽管 ColBERT 是一个强大的检索器,但其速度限制可能使其不太适合大规模检索。因此,我们通常建议使用 ColBERT 对少量已检索到的示例进行重排序,而不是用于第一阶段检索。一个简单的稠密检索器最初可以检索大约 100-500 个候选者,然后可以使用 ColBERT 对其进行重排序,将最相关的结果排在前面。
ColBERT 是 交叉编码器 的一个重要替代方案,因为它由于其 后期交互 (late interaction)
机制,在推理时间上往往更快。
后期交互 (late interaction)
是如何工作的? 交叉编码器将查询和文档连接在一起作为单一输入进行处理。交叉编码器模型将此输入划分成有意义(对模型而言)的部分,并检查这些部分如何相互关联。因此,查询和文档之间的所有交互都在模型内部“早期”发生。像 ColBERT 这样的后期交互模型只执行第一部分,生成适合比较的文档和查询部分。这些部分之间的所有交互都期望在模型外部“后期”完成。
在 Qdrant 中使用 ColBERT
Qdrant 开箱即用地支持 多向量表示,因此您可以在 Qdrant 中使用任何后期交互模型(如 ColBERT
或 ColPali
),而无需进行额外的预处理/后处理。
本教程在一个玩具数据集上使用 ColBERT 作为第一阶段检索器。您可以在我们的 多阶段查询文档 中查看如何将 ColBERT 用作重排序器。
设置
安装 fastembed
。
pip install fastembed
导入用于文本嵌入的后期交互模型。
from fastembed import LateInteractionTextEmbedding
您可以列出 FastEmbed 中支持哪些后期交互模型。
LateInteractionTextEmbedding.list_supported_models()
此命令显示可用模型。输出显示了模型的详细信息,包括输出嵌入维度、模型描述、模型大小、模型源和模型文件。
[{'model': 'colbert-ir/colbertv2.0',
'dim': 128,
'description': 'Late interaction model',
'size_in_GB': 0.44,
'sources': {'hf': 'colbert-ir/colbertv2.0'},
'model_file': 'model.onnx'},
{'model': 'answerdotai/answerai-colbert-small-v1',
'dim': 96,
'description': 'Text embeddings, Unimodal (text), Multilingual (~100 languages), 512 input tokens truncation, 2024 year',
'size_in_GB': 0.13,
'sources': {'hf': 'answerdotai/answerai-colbert-small-v1'},
'model_file': 'vespa_colbert.onnx'}]
现在,加载模型。
model_name = "colbert-ir/colbertv2.0"
embedding_model = LateInteractionTextEmbedding(model_name)
模型文件将被获取并下载,并显示进度。
嵌入数据
我们将使用 ColBERT 对玩具电影描述数据集进行向量化
电影描述数据集
descriptions = ["In 1431, Jeanne d'Arc is placed on trial on charges of heresy. The ecclesiastical jurists attempt to force Jeanne to recant her claims of holy visions.",
"A film projectionist longs to be a detective, and puts his meagre skills to work when he is framed by a rival for stealing his girlfriend's father's pocketwatch.",
"A group of high-end professional thieves start to feel the heat from the LAPD when they unknowingly leave a clue at their latest heist.",
"A petty thief with an utter resemblance to a samurai warlord is hired as the lord's double. When the warlord later dies the thief is forced to take up arms in his place.",
"A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past.",
"A biopic detailing the 2 decades that Punjabi Sikh revolutionary Udham Singh spent planning the assassination of the man responsible for the Jallianwala Bagh massacre.",
"When a machine that allows therapists to enter their patients' dreams is stolen, all hell breaks loose. Only a young female therapist, Paprika, can stop it.",
"An ordinary word processor has the worst night of his life after he agrees to visit a girl in Soho whom he met that evening at a coffee shop.",
"A story that revolves around drug abuse in the affluent north Indian State of Punjab and how the youth there have succumbed to it en-masse resulting in a socio-economic decline.",
"A world-weary political journalist picks up the story of a woman's search for her son, who was taken away from her decades ago after she became pregnant and was forced to live in a convent.",
"Concurrent theatrical ending of the TV series Neon Genesis Evangelion (1995).",
"During World War II, a rebellious U.S. Army Major is assigned a dozen convicted murderers to train and lead them into a mass assassination mission of German officers.",
"The toys are mistakenly delivered to a day-care center instead of the attic right before Andy leaves for college, and it's up to Woody to convince the other toys that they weren't abandoned and to return home.",
"A soldier fighting aliens gets to relive the same day over and over again, the day restarting every time he dies.",
"After two male musicians witness a mob hit, they flee the state in an all-female band disguised as women, but further complications set in.",
"Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household.",
"A renegade reporter trailing a young runaway heiress for a big story joins her on a bus heading from Florida to New York, and they end up stuck with each other when the bus leaves them behind at one of the stops.",
"Story of 40-man Turkish task force who must defend a relay station.",
"Spinal Tap, one of England's loudest bands, is chronicled by film director Marty DiBergi on what proves to be a fateful tour.",
"Oskar, an overlooked and bullied boy, finds love and revenge through Eli, a beautiful but peculiar girl."]
向量化是通过一个 embed
生成器函数完成的。
descriptions_embeddings = list(
embedding_model.embed(descriptions)
)
让我们检查其中一个生成的嵌入的大小。
descriptions_embeddings[0].shape
我们得到以下结果
(48, 128)
这意味着对于第一个描述,我们有 48 个长度为 128 的向量来表示它。
将嵌入上传到 Qdrant
安装 qdrant-client
pip install "qdrant-client>=1.14.2"
Qdrant Client 有一个简单的内存模式,允许您在少量数据上进行本地实验。或者,您可以选择在 Qdrant Cloud 中使用 免费集群 进行实验。
from qdrant_client import QdrantClient, models
qdrant_client = QdrantClient(":memory:") # Qdrant is running from RAM.
现在,让我们用电影数据创建一个小型的 集合 (collection)。为此,我们将使用 Qdrant 支持的 多向量 功能。要配置多向量集合,我们需要指定
- 向量间的相似性度量;
- 每个向量的大小(对于 ColBERT,它是 128);
- 多向量(矩阵)间的相似性度量,例如
最大值 (maximum)
,这样对于矩阵 A 中的向量,我们找到矩阵 B 中最相似的向量,它们的相似性得分将作为矩阵间的相似性。
qdrant_client.create_collection(
collection_name="movies",
vectors_config=models.VectorParams(
size=128, #size of each vector produced by ColBERT
distance=models.Distance.COSINE, #similarity metric between each vector
multivector_config=models.MultiVectorConfig(
comparator=models.MultiVectorComparator.MAX_SIM #similarity metric between multivectors (matrices)
),
),
)
为了使此集合更易读,让我们将电影元数据(名称、文本描述和电影长度)与嵌入的描述一起保存。
电影元数据
metadata = [{"movie_name": "The Passion of Joan of Arc", "movie_watch_time_min": 114, "movie_description": "In 1431, Jeanne d'Arc is placed on trial on charges of heresy. The ecclesiastical jurists attempt to force Jeanne to recant her claims of holy visions."},
{"movie_name": "Sherlock Jr.", "movie_watch_time_min": 45, "movie_description": "A film projectionist longs to be a detective, and puts his meagre skills to work when he is framed by a rival for stealing his girlfriend's father's pocketwatch."},
{"movie_name": "Heat", "movie_watch_time_min": 170, "movie_description": "A group of high-end professional thieves start to feel the heat from the LAPD when they unknowingly leave a clue at their latest heist."},
{"movie_name": "Kagemusha", "movie_watch_time_min": 162, "movie_description": "A petty thief with an utter resemblance to a samurai warlord is hired as the lord's double. When the warlord later dies the thief is forced to take up arms in his place."},
{"movie_name": "Kubo and the Two Strings", "movie_watch_time_min": 101, "movie_description": "A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past."},
{"movie_name": "Sardar Udham", "movie_watch_time_min": 164, "movie_description": "A biopic detailing the 2 decades that Punjabi Sikh revolutionary Udham Singh spent planning the assassination of the man responsible for the Jallianwala Bagh massacre."},
{"movie_name": "Paprika", "movie_watch_time_min": 90, "movie_description": "When a machine that allows therapists to enter their patients' dreams is stolen, all hell breaks loose. Only a young female therapist, Paprika, can stop it."},
{"movie_name": "After Hours", "movie_watch_time_min": 97, "movie_description": "An ordinary word processor has the worst night of his life after he agrees to visit a girl in Soho whom he met that evening at a coffee shop."},
{"movie_name": "Udta Punjab", "movie_watch_time_min": 148, "movie_description": "A story that revolves around drug abuse in the affluent north Indian State of Punjab and how the youth there have succumbed to it en-masse resulting in a socio-economic decline."},
{"movie_name": "Philomena", "movie_watch_time_min": 98, "movie_description": "A world-weary political journalist picks up the story of a woman's search for her son, who was taken away from her decades ago after she became pregnant and was forced to live in a convent."},
{"movie_name": "Neon Genesis Evangelion: The End of Evangelion", "movie_watch_time_min": 87, "movie_description": "Concurrent theatrical ending of the TV series Neon Genesis Evangelion (1995)."},
{"movie_name": "The Dirty Dozen", "movie_watch_time_min": 150, "movie_description": "During World War II, a rebellious U.S. Army Major is assigned a dozen convicted murderers to train and lead them into a mass assassination mission of German officers."},
{"movie_name": "Toy Story 3", "movie_watch_time_min": 103, "movie_description": "The toys are mistakenly delivered to a day-care center instead of the attic right before Andy leaves for college, and it's up to Woody to convince the other toys that they weren't abandoned and to return home."},
{"movie_name": "Edge of Tomorrow", "movie_watch_time_min": 113, "movie_description": "A soldier fighting aliens gets to relive the same day over and over again, the day restarting every time he dies."},
{"movie_name": "Some Like It Hot", "movie_watch_time_min": 121, "movie_description": "After two male musicians witness a mob hit, they flee the state in an all-female band disguised as women, but further complications set in."},
{"movie_name": "Snow White and the Seven Dwarfs", "movie_watch_time_min": 83, "movie_description": "Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household."},
{"movie_name": "It Happened One Night", "movie_watch_time_min": 105, "movie_description": "A renegade reporter trailing a young runaway heiress for a big story joins her on a bus heading from Florida to New York, and they end up stuck with each other when the bus leaves them behind at one of the stops."},
{"movie_name": "Nefes: Vatan Sagolsun", "movie_watch_time_min": 128, "movie_description": "Story of 40-man Turkish task force who must defend a relay station."},
{"movie_name": "This Is Spinal Tap", "movie_watch_time_min": 82, "movie_description": "Spinal Tap, one of England's loudest bands, is chronicled by film director Marty DiBergi on what proves to be a fateful tour."},
{"movie_name": "Let the Right One In", "movie_watch_time_min": 114, "movie_description": "Oskar, an overlooked and bullied boy, finds love and revenge through Eli, a beautiful but peculiar girl."}]
qdrant_client.upload_points(
collection_name="movies",
points=[
models.PointStruct(
id=idx,
payload=metadata[idx],
vector=vector
)
for idx, vector in enumerate(descriptions_embeddings)
],
)
使用隐式嵌入计算进行上传
description_documents = [models.Document(text=description, model=model_name) for description in descriptions]
qdrant_client.upload_points(
collection_name="movies",
points=[
models.PointStruct(
id=idx,
payload=metadata[idx],
vector=description_document
)
for idx, description_document in enumerate(description_documents)
],
)
查询
ColBERT 使用两种不同的方法来嵌入文档和查询,这与我们在 Fastembed 中做的一样。但是,我们修改了 ColBERT 中使用的查询预处理,因此我们不必在 32 个词元长度后截断所有查询,而是直接处理更长的查询。
qdrant_client.query_points(
collection_name="movies",
query=list(embedding_model.query_embed("A movie for kids with fantasy elements and wonders"))[0], #converting generator object into numpy.ndarray
limit=1, #How many closest to the query movies we would like to get
#with_vectors=True, #If this option is used, vectors will also be returned
with_payload=True #So metadata is provided in the output
)
使用隐式嵌入计算的查询点
query_document = models.Document(text="A movie for kids with fantasy elements and wonders", model=model_name)
qdrant_client.query_points(
collection_name="movies",
query=query_document,
limit=1,
)
结果如下
QueryResponse(points=[ScoredPoint(id=4, version=0, score=12.063469,
payload={'movie_name': 'Kubo and the Two Strings', 'movie_watch_time_min': 101,
'movie_description': 'A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past.'},
vector=None, shard_key=None, order_value=None)])