如何使用 FastEmbed 生成 ColBERT 多向量 (Multivectors)

ColBERT

ColBERT 是一种嵌入模型,它为输入文本生成矩阵(多向量)表示,即每个 token 生成一个向量(token 是机器学习模型中有意义的文本单元)。这种方法使 ColBERT 能够捕捉比许多密集嵌入模型(用单个向量表示整个输入)更细致的输入语义。通过生成更细粒度的输入表示,ColBERT 成为了一个强大的检索器。然而,这种优势是以增加资源消耗(包括速度和内存)为代价的,相较于传统的密集嵌入模型而言。

尽管 ColBERT 是一个强大的检索器,但其速度限制可能使其不太适合大规模检索。因此,我们通常建议将 ColBERT 用于对已检索到的一小部分候选结果进行重排序 (reranking),而不是用于第一阶段检索。简单的密集检索器可以初步检索出约 100-500 个候选结果,然后使用 ColBERT 对其进行重排序,从而将最相关的结果排在最前面。

ColBERT 是 交叉编码器 (cross-encoders) 的一个出色重排序模型替代方案,因为它得益于其 后期交互 (late interaction) 机制,在推理时间上通常更快。

后期交互是如何工作的?交叉编码器将查询和文档粘合在一起作为单个输入。交叉编码器模型将此输入划分为(对模型而言)有意义的部分,并检查这些部分如何关联。因此,查询和文档之间的所有交互都在模型内部“早期”发生。而像 ColBERT 这样的后期交互模型只进行第一部分工作,即生成适合比较的文档和查询部分。这些部分之间的所有交互预期在模型外部“稍后”完成。

在 Qdrant 中使用 ColBERT

Qdrant 开箱即支持 多向量表示,因此您可以在 Qdrant 中直接使用任何后期交互模型(如 ColBERTColPali),而无需任何额外的前/后处理。

本教程演示了如何在一个玩具数据集上将 ColBERT 用作第一阶段检索器。您可以在我们的 多阶段查询文档 中查看如何将 ColBERT 用作重排序器。

设置

安装 fastembed

pip install fastembed

导入用于文本嵌入的后期交互模型。

from fastembed import LateInteractionTextEmbedding

您可以列出 FastEmbed 支持哪些后期交互模型。

LateInteractionTextEmbedding.list_supported_models()

此命令显示可用模型。输出展示了有关模型的详细信息,包括输出嵌入维度、模型描述、模型大小、模型来源和模型文件。

[{'model': 'colbert-ir/colbertv2.0',
  'dim': 128,
  'description': 'Late interaction model',
  'size_in_GB': 0.44,
  'sources': {'hf': 'colbert-ir/colbertv2.0'},
  'model_file': 'model.onnx'},
 {'model': 'answerdotai/answerai-colbert-small-v1',
  'dim': 96,
  'description': 'Text embeddings, Unimodal (text), Multilingual (~100 languages), 512 input tokens truncation, 2024 year',
  'size_in_GB': 0.13,
  'sources': {'hf': 'answerdotai/answerai-colbert-small-v1'},
  'model_file': 'vespa_colbert.onnx'}]

现在,加载模型。

model_name = "colbert-ir/colbertv2.0"
embedding_model = LateInteractionTextEmbedding(model_name)

模型文件将被获取并下载,并显示下载进度。

嵌入数据

我们将使用 ColBERT 对一个玩具电影描述数据集进行向量化。

电影描述数据集
descriptions = ["In 1431, Jeanne d'Arc is placed on trial on charges of heresy. The ecclesiastical jurists attempt to force Jeanne to recant her claims of holy visions.",
 "A film projectionist longs to be a detective, and puts his meagre skills to work when he is framed by a rival for stealing his girlfriend's father's pocketwatch.",
 "A group of high-end professional thieves start to feel the heat from the LAPD when they unknowingly leave a clue at their latest heist.",
 "A petty thief with an utter resemblance to a samurai warlord is hired as the lord's double. When the warlord later dies the thief is forced to take up arms in his place.",
 "A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past.",
 "A biopic detailing the 2 decades that Punjabi Sikh revolutionary Udham Singh spent planning the assassination of the man responsible for the Jallianwala Bagh massacre.",
 "When a machine that allows therapists to enter their patients' dreams is stolen, all hell breaks loose. Only a young female therapist, Paprika, can stop it.",
 "An ordinary word processor has the worst night of his life after he agrees to visit a girl in Soho whom he met that evening at a coffee shop.",
 "A story that revolves around drug abuse in the affluent north Indian State of Punjab and how the youth there have succumbed to it en-masse resulting in a socio-economic decline.",
 "A world-weary political journalist picks up the story of a woman's search for her son, who was taken away from her decades ago after she became pregnant and was forced to live in a convent.",
 "Concurrent theatrical ending of the TV series Neon Genesis Evangelion (1995).",
 "During World War II, a rebellious U.S. Army Major is assigned a dozen convicted murderers to train and lead them into a mass assassination mission of German officers.",
 "The toys are mistakenly delivered to a day-care center instead of the attic right before Andy leaves for college, and it's up to Woody to convince the other toys that they weren't abandoned and to return home.",
 "A soldier fighting aliens gets to relive the same day over and over again, the day restarting every time he dies.",
 "After two male musicians witness a mob hit, they flee the state in an all-female band disguised as women, but further complications set in.",
 "Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household.",
 "A renegade reporter trailing a young runaway heiress for a big story joins her on a bus heading from Florida to New York, and they end up stuck with each other when the bus leaves them behind at one of the stops.",
 "Story of 40-man Turkish task force who must defend a relay station.",
 "Spinal Tap, one of England's loudest bands, is chronicled by film director Marty DiBergi on what proves to be a fateful tour.",
 "Oskar, an overlooked and bullied boy, finds love and revenge through Eli, a beautiful but peculiar girl."]

向量化通过 embed 生成器函数完成。

descriptions_embeddings = list(
    embedding_model.embed(descriptions)
)

让我们检查其中一个生成的嵌入的大小。

descriptions_embeddings[0].shape

我们得到以下结果。

(48, 128)

这意味着对于第一个描述,我们有 48 个维度为 128 的向量来表示它。

将嵌入上传到 Qdrant

安装 qdrant-client

pip install "qdrant-client>=1.14.2"

Qdrant 客户端提供了一种简单的内存模式,允许您在小数据量下进行本地实验。或者,您也可以在实验中使用 Qdrant Cloud 中的 免费集群

from qdrant_client import QdrantClient, models

qdrant_client = QdrantClient(":memory:") # Qdrant is running from RAM.

现在,让我们用电影数据创建一个小的 集合 (collection)。为此,我们将使用 Qdrant 支持的 多向量 功能。要配置多向量集合,我们需要指定:

  • 向量之间的相似度度量;
  • 每个向量的大小(对于 ColBERT,为 128);
  • 多向量(矩阵)之间的相似度度量,例如 maximum(最大值)。这样,对于矩阵 A 中的向量,我们找到矩阵 B 中最相似的向量,它们的相似度分数即为矩阵间的相似度。
qdrant_client.create_collection(
    collection_name="movies",
    vectors_config=models.VectorParams(
        size=128, #size of each vector produced by ColBERT
        distance=models.Distance.COSINE, #similarity metric between each vector
        multivector_config=models.MultiVectorConfig(
            comparator=models.MultiVectorComparator.MAX_SIM #similarity metric between multivectors (matrices)
        ),
    ),
)

为了使此集合易于阅读,我们将电影元数据(名称、文本形式的描述和电影时长)与嵌入后的描述一起保存。

电影元数据
metadata = [{"movie_name": "The Passion of Joan of Arc", "movie_watch_time_min": 114, "movie_description": "In 1431, Jeanne d'Arc is placed on trial on charges of heresy. The ecclesiastical jurists attempt to force Jeanne to recant her claims of holy visions."},
{"movie_name": "Sherlock Jr.", "movie_watch_time_min": 45, "movie_description": "A film projectionist longs to be a detective, and puts his meagre skills to work when he is framed by a rival for stealing his girlfriend's father's pocketwatch."},
{"movie_name": "Heat", "movie_watch_time_min": 170, "movie_description": "A group of high-end professional thieves start to feel the heat from the LAPD when they unknowingly leave a clue at their latest heist."},
{"movie_name": "Kagemusha", "movie_watch_time_min": 162, "movie_description": "A petty thief with an utter resemblance to a samurai warlord is hired as the lord's double. When the warlord later dies the thief is forced to take up arms in his place."},
{"movie_name": "Kubo and the Two Strings", "movie_watch_time_min": 101, "movie_description": "A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past."},
{"movie_name": "Sardar Udham", "movie_watch_time_min": 164, "movie_description": "A biopic detailing the 2 decades that Punjabi Sikh revolutionary Udham Singh spent planning the assassination of the man responsible for the Jallianwala Bagh massacre."},
{"movie_name": "Paprika", "movie_watch_time_min": 90, "movie_description": "When a machine that allows therapists to enter their patients' dreams is stolen, all hell breaks loose. Only a young female therapist, Paprika, can stop it."},
{"movie_name": "After Hours", "movie_watch_time_min": 97, "movie_description": "An ordinary word processor has the worst night of his life after he agrees to visit a girl in Soho whom he met that evening at a coffee shop."},
{"movie_name": "Udta Punjab", "movie_watch_time_min": 148, "movie_description": "A story that revolves around drug abuse in the affluent north Indian State of Punjab and how the youth there have succumbed to it en-masse resulting in a socio-economic decline."},
{"movie_name": "Philomena", "movie_watch_time_min": 98, "movie_description": "A world-weary political journalist picks up the story of a woman's search for her son, who was taken away from her decades ago after she became pregnant and was forced to live in a convent."},
{"movie_name": "Neon Genesis Evangelion: The End of Evangelion", "movie_watch_time_min": 87, "movie_description": "Concurrent theatrical ending of the TV series Neon Genesis Evangelion (1995)."},
{"movie_name": "The Dirty Dozen", "movie_watch_time_min": 150, "movie_description": "During World War II, a rebellious U.S. Army Major is assigned a dozen convicted murderers to train and lead them into a mass assassination mission of German officers."},
{"movie_name": "Toy Story 3", "movie_watch_time_min": 103, "movie_description": "The toys are mistakenly delivered to a day-care center instead of the attic right before Andy leaves for college, and it's up to Woody to convince the other toys that they weren't abandoned and to return home."},
{"movie_name": "Edge of Tomorrow", "movie_watch_time_min": 113, "movie_description": "A soldier fighting aliens gets to relive the same day over and over again, the day restarting every time he dies."},
{"movie_name": "Some Like It Hot", "movie_watch_time_min": 121, "movie_description": "After two male musicians witness a mob hit, they flee the state in an all-female band disguised as women, but further complications set in."},
{"movie_name": "Snow White and the Seven Dwarfs", "movie_watch_time_min": 83, "movie_description": "Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household."},
{"movie_name": "It Happened One Night", "movie_watch_time_min": 105, "movie_description": "A renegade reporter trailing a young runaway heiress for a big story joins her on a bus heading from Florida to New York, and they end up stuck with each other when the bus leaves them behind at one of the stops."},
{"movie_name": "Nefes: Vatan Sagolsun", "movie_watch_time_min": 128, "movie_description": "Story of 40-man Turkish task force who must defend a relay station."},
{"movie_name": "This Is Spinal Tap", "movie_watch_time_min": 82, "movie_description": "Spinal Tap, one of England's loudest bands, is chronicled by film director Marty DiBergi on what proves to be a fateful tour."},
{"movie_name": "Let the Right One In", "movie_watch_time_min": 114, "movie_description": "Oskar, an overlooked and bullied boy, finds love and revenge through Eli, a beautiful but peculiar girl."}]
qdrant_client.upload_points(
    collection_name="movies",
    points=[
        models.PointStruct(
            id=idx,
            payload=metadata[idx],
            vector=vector
        )
        for idx, vector in enumerate(descriptions_embeddings)
    ],
)
使用隐式嵌入计算上传
description_documents = [models.Document(text=description, model=model_name) for description in descriptions]
qdrant_client.upload_points(
    collection_name="movies",
    points=[
        models.PointStruct(
            id=idx,
            payload=metadata[idx],
            vector=description_document
        )
        for idx, description_document in enumerate(description_documents)
    ],
)

查询

ColBERT 对文档和查询使用两种不同的嵌入方法,我们在 Fastembed 中也是如此。但是,我们更改了 ColBERT 中使用的查询预处理方式,因此我们不必将所有查询截断在 32 个 token 长度,而是可以直接输入更长的查询。

qdrant_client.query_points(
    collection_name="movies",
    query=list(embedding_model.query_embed("A movie for kids with fantasy elements and wonders"))[0], #converting generator object into numpy.ndarray
    limit=1, #How many closest to the query movies we would like to get
    #with_vectors=True, #If this option is used, vectors will also be returned
    with_payload=True #So metadata is provided in the output
)
使用隐式嵌入计算的查询点
query_document = models.Document(text="A movie for kids with fantasy elements and wonders", model=model_name)
qdrant_client.query_points(
    collection_name="movies",
    query=query_document,
    limit=1,
)

结果如下

QueryResponse(points=[ScoredPoint(id=4, version=0, score=12.063469,
payload={'movie_name': 'Kubo and the Two Strings', 'movie_watch_time_min': 101, 
'movie_description': 'A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past.'},
vector=None, shard_key=None, order_value=None)])
此页面有用吗?

感谢您的反馈!🙏

对此我们深感抱歉。😔 您可以在 GitHub 上 编辑 此页面,或 创建 一个 GitHub issue。