如何使用 FastEmbed 生成 ColBERT 多向量

ColBERT

ColBERT 是一种嵌入模型,它为输入文本生成一个矩阵(多向量)表示,每个 token(对机器学习模型而言有意义的文本单元)生成一个向量。这种方法使得 ColBERT 能够比许多密集嵌入模型捕捉更细微的输入语义,后者用单个向量表示整个输入。通过生成更精细的输入表示,ColBERT 成为一个强大的检索器。然而,与传统的密集嵌入模型相比,这种优势的代价是资源消耗增加,无论是速度还是内存。

尽管 ColBERT 是一个强大的检索器,但其速度限制可能使其不适合大规模检索。因此,我们通常建议将 ColBERT 用于对少量已检索到的示例进行重新排序,而不是用于第一阶段检索。一个简单的密集检索器可以初步检索大约 100-500 个候选,然后可以使用 ColBERT 对其进行重新排序,将最相关的结果排在前面。

ColBERT 是 交叉编码器 的一个重要替代方案,作为重新排序模型,因为它倾向于在推理时间上更快,这得益于其 late interaction 机制。

late interaction 是如何工作的?交叉编码器将查询和文档粘合在一起作为单个输入。交叉编码器模型将此输入划分为有意义的(对模型而言)部分,并检查这些部分如何关联。因此,查询和文档之间的所有交互都“早期”发生在模型内部。晚期交互模型,如 ColBERT,只做第一部分,生成适合比较的文档和查询部分。这些部分之间的所有交互都期望在模型“外部”稍后完成。

在 Qdrant 中使用 ColBERT

Qdrant 原生支持多向量表示,因此您可以在 Qdrant 中使用任何晚期交互模型,如 ColBERTColPali,而无需任何额外的预处理/后处理。

本教程将 ColBERT 作为玩具数据集上的第一阶段检索器。您可以在我们的多阶段查询文档中了解如何将 ColBERT 用作重新排序器。

设置

安装 fastembed

pip install fastembed

导入用于文本嵌入的晚期交互模型。

from fastembed import LateInteractionTextEmbedding

您可以列出 FastEmbed 中支持哪些晚期交互模型。

LateInteractionTextEmbedding.list_supported_models()

此命令显示可用模型。输出显示了模型的详细信息,包括输出嵌入维度、模型描述、模型大小、模型来源和模型文件。

[{'model': 'colbert-ir/colbertv2.0',
  'dim': 128,
  'description': 'Late interaction model',
  'size_in_GB': 0.44,
  'sources': {'hf': 'colbert-ir/colbertv2.0'},
  'model_file': 'model.onnx'},
 {'model': 'answerdotai/answerai-colbert-small-v1',
  'dim': 96,
  'description': 'Text embeddings, Unimodal (text), Multilingual (~100 languages), 512 input tokens truncation, 2024 year',
  'size_in_GB': 0.13,
  'sources': {'hf': 'answerdotai/answerai-colbert-small-v1'},
  'model_file': 'vespa_colbert.onnx'}]

现在,加载模型。

model_name = "colbert-ir/colbertv2.0"
embedding_model = LateInteractionTextEmbedding(model_name)

模型文件将被获取和下载,并显示进度。

嵌入数据

我们将使用 ColBERT 对一个玩具电影描述数据集进行向量化

电影描述数据集
descriptions = ["In 1431, Jeanne d'Arc is placed on trial on charges of heresy. The ecclesiastical jurists attempt to force Jeanne to recant her claims of holy visions.",
 "A film projectionist longs to be a detective, and puts his meagre skills to work when he is framed by a rival for stealing his girlfriend's father's pocketwatch.",
 "A group of high-end professional thieves start to feel the heat from the LAPD when they unknowingly leave a clue at their latest heist.",
 "A petty thief with an utter resemblance to a samurai warlord is hired as the lord's double. When the warlord later dies the thief is forced to take up arms in his place.",
 "A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past.",
 "A biopic detailing the 2 decades that Punjabi Sikh revolutionary Udham Singh spent planning the assassination of the man responsible for the Jallianwala Bagh massacre.",
 "When a machine that allows therapists to enter their patients' dreams is stolen, all hell breaks loose. Only a young female therapist, Paprika, can stop it.",
 "An ordinary word processor has the worst night of his life after he agrees to visit a girl in Soho whom he met that evening at a coffee shop.",
 "A story that revolves around drug abuse in the affluent north Indian State of Punjab and how the youth there have succumbed to it en-masse resulting in a socio-economic decline.",
 "A world-weary political journalist picks up the story of a woman's search for her son, who was taken away from her decades ago after she became pregnant and was forced to live in a convent.",
 "Concurrent theatrical ending of the TV series Neon Genesis Evangelion (1995).",
 "During World War II, a rebellious U.S. Army Major is assigned a dozen convicted murderers to train and lead them into a mass assassination mission of German officers.",
 "The toys are mistakenly delivered to a day-care center instead of the attic right before Andy leaves for college, and it's up to Woody to convince the other toys that they weren't abandoned and to return home.",
 "A soldier fighting aliens gets to relive the same day over and over again, the day restarting every time he dies.",
 "After two male musicians witness a mob hit, they flee the state in an all-female band disguised as women, but further complications set in.",
 "Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household.",
 "A renegade reporter trailing a young runaway heiress for a big story joins her on a bus heading from Florida to New York, and they end up stuck with each other when the bus leaves them behind at one of the stops.",
 "Story of 40-man Turkish task force who must defend a relay station.",
 "Spinal Tap, one of England's loudest bands, is chronicled by film director Marty DiBergi on what proves to be a fateful tour.",
 "Oskar, an overlooked and bullied boy, finds love and revenge through Eli, a beautiful but peculiar girl."]

向量化是通过 embed 生成器函数完成的。

descriptions_embeddings = list(
    embedding_model.embed(descriptions)
)

让我们检查一下生成的嵌入之一的大小。

descriptions_embeddings[0].shape

我们得到以下结果

(48, 128)

这意味着对于第一个描述,我们有 48 个长度为 128 的向量来表示它。

将嵌入上传到 Qdrant

安装 qdrant-client

pip install "qdrant-client>=1.14.2"

Qdrant 客户端有一个简单的内存模式,允许您在本地对小数据量进行实验。或者,您可以使用 Qdrant Cloud 中的免费集群进行实验。

from qdrant_client import QdrantClient, models

qdrant_client = QdrantClient(":memory:") # Qdrant is running from RAM.

现在,让我们使用我们的电影数据创建一个小的集合。为此,我们将使用 Qdrant 中支持的多向量功能。要配置多向量集合,我们需要指定

  • 向量之间的相似性度量;
  • 每个向量的大小(对于 ColBERT,它是 128);
  • 多向量(矩阵)之间的相似性度量,例如 maximum,这样对于矩阵 A 中的向量,我们找到矩阵 B 中最相似的向量,它们的相似性得分将是我们的矩阵相似性。
qdrant_client.create_collection(
    collection_name="movies",
    vectors_config=models.VectorParams(
        size=128, #size of each vector produced by ColBERT
        distance=models.Distance.COSINE, #similarity metric between each vector
        multivector_config=models.MultiVectorConfig(
            comparator=models.MultiVectorComparator.MAX_SIM #similarity metric between multivectors (matrices)
        ),
    ),
)

为了使此集合易于阅读,我们将电影元数据(名称、文本形式的描述和电影长度)与嵌入的描述一起保存。

电影元数据
metadata = [{"movie_name": "The Passion of Joan of Arc", "movie_watch_time_min": 114, "movie_description": "In 1431, Jeanne d'Arc is placed on trial on charges of heresy. The ecclesiastical jurists attempt to force Jeanne to recant her claims of holy visions."},
{"movie_name": "Sherlock Jr.", "movie_watch_time_min": 45, "movie_description": "A film projectionist longs to be a detective, and puts his meagre skills to work when he is framed by a rival for stealing his girlfriend's father's pocketwatch."},
{"movie_name": "Heat", "movie_watch_time_min": 170, "movie_description": "A group of high-end professional thieves start to feel the heat from the LAPD when they unknowingly leave a clue at their latest heist."},
{"movie_name": "Kagemusha", "movie_watch_time_min": 162, "movie_description": "A petty thief with an utter resemblance to a samurai warlord is hired as the lord's double. When the warlord later dies the thief is forced to take up arms in his place."},
{"movie_name": "Kubo and the Two Strings", "movie_watch_time_min": 101, "movie_description": "A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past."},
{"movie_name": "Sardar Udham", "movie_watch_time_min": 164, "movie_description": "A biopic detailing the 2 decades that Punjabi Sikh revolutionary Udham Singh spent planning the assassination of the man responsible for the Jallianwala Bagh massacre."},
{"movie_name": "Paprika", "movie_watch_time_min": 90, "movie_description": "When a machine that allows therapists to enter their patients' dreams is stolen, all hell breaks loose. Only a young female therapist, Paprika, can stop it."},
{"movie_name": "After Hours", "movie_watch_time_min": 97, "movie_description": "An ordinary word processor has the worst night of his life after he agrees to visit a girl in Soho whom he met that evening at a coffee shop."},
{"movie_name": "Udta Punjab", "movie_watch_time_min": 148, "movie_description": "A story that revolves around drug abuse in the affluent north Indian State of Punjab and how the youth there have succumbed to it en-masse resulting in a socio-economic decline."},
{"movie_name": "Philomena", "movie_watch_time_min": 98, "movie_description": "A world-weary political journalist picks up the story of a woman's search for her son, who was taken away from her decades ago after she became pregnant and was forced to live in a convent."},
{"movie_name": "Neon Genesis Evangelion: The End of Evangelion", "movie_watch_time_min": 87, "movie_description": "Concurrent theatrical ending of the TV series Neon Genesis Evangelion (1995)."},
{"movie_name": "The Dirty Dozen", "movie_watch_time_min": 150, "movie_description": "During World War II, a rebellious U.S. Army Major is assigned a dozen convicted murderers to train and lead them into a mass assassination mission of German officers."},
{"movie_name": "Toy Story 3", "movie_watch_time_min": 103, "movie_description": "The toys are mistakenly delivered to a day-care center instead of the attic right before Andy leaves for college, and it's up to Woody to convince the other toys that they weren't abandoned and to return home."},
{"movie_name": "Edge of Tomorrow", "movie_watch_time_min": 113, "movie_description": "A soldier fighting aliens gets to relive the same day over and over again, the day restarting every time he dies."},
{"movie_name": "Some Like It Hot", "movie_watch_time_min": 121, "movie_description": "After two male musicians witness a mob hit, they flee the state in an all-female band disguised as women, but further complications set in."},
{"movie_name": "Snow White and the Seven Dwarfs", "movie_watch_time_min": 83, "movie_description": "Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household."},
{"movie_name": "It Happened One Night", "movie_watch_time_min": 105, "movie_description": "A renegade reporter trailing a young runaway heiress for a big story joins her on a bus heading from Florida to New York, and they end up stuck with each other when the bus leaves them behind at one of the stops."},
{"movie_name": "Nefes: Vatan Sagolsun", "movie_watch_time_min": 128, "movie_description": "Story of 40-man Turkish task force who must defend a relay station."},
{"movie_name": "This Is Spinal Tap", "movie_watch_time_min": 82, "movie_description": "Spinal Tap, one of England's loudest bands, is chronicled by film director Marty DiBergi on what proves to be a fateful tour."},
{"movie_name": "Let the Right One In", "movie_watch_time_min": 114, "movie_description": "Oskar, an overlooked and bullied boy, finds love and revenge through Eli, a beautiful but peculiar girl."}]
qdrant_client.upload_points(
    collection_name="movies",
    points=[
        models.PointStruct(
            id=idx,
            payload=metadata[idx],
            vector=vector
        )
        for idx, vector in enumerate(descriptions_embeddings)
    ],
)
使用隐式嵌入计算上传
description_documents = [models.Document(text=description, model=model_name) for description in descriptions]
qdrant_client.upload_points(
    collection_name="movies",
    points=[
        models.PointStruct(
            id=idx,
            payload=metadata[idx],
            vector=description_document
        )
        for idx, description_document in enumerate(description_documents)
    ],
)

查询

ColBERT 使用两种不同的方法来嵌入文档和查询,Fastembed 也是如此。然而,我们修改了 ColBERT 中使用的查询预处理,因此我们不必在 32 个 token 长度后截断所有查询,而是直接摄取更长的查询。

qdrant_client.query_points(
    collection_name="movies",
    query=list(embedding_model.query_embed("A movie for kids with fantasy elements and wonders"))[0], #converting generator object into numpy.ndarray
    limit=1, #How many closest to the query movies we would like to get
    #with_vectors=True, #If this option is used, vectors will also be returned
    with_payload=True #So metadata is provided in the output
)
使用隐式嵌入计算的查询点
query_document = models.Document(text="A movie for kids with fantasy elements and wonders", model=model_name)
qdrant_client.query_points(
    collection_name="movies",
    query=query_document,
    limit=1,
)

结果如下

QueryResponse(points=[ScoredPoint(id=4, version=0, score=12.063469,
payload={'movie_name': 'Kubo and the Two Strings', 'movie_watch_time_min': 101, 
'movie_description': 'A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past.'},
vector=None, shard_key=None, order_value=None)])
此页面有用吗?

感谢您的反馈!🙏

我们很抱歉听到这个消息。😔 您可以在 GitHub 上编辑此页面,或创建一个 GitHub 问题。