多向量后处理

FastEmbed 的后处理模块提供了一系列用于在生成向量后进行转换和优化的技术。这些后处理方法可以提高搜索性能、降低存储需求,或使向量适应特定的应用场景。

目前,后处理模块包含用于加速多向量检索的 MUVERA(多向量检索算法)。更多后处理技术计划在未来版本中加入。

MUVERA

MUVERA 将变长向量序列转换为固定维度的单向量表示。这些近似值可用于使用 HNSW 等传统向量搜索方法进行快速的初始检索。一旦快速检索到一小部分候选集,您就可以使用原始的多向量表示对它们进行重排序,从而获得最大的准确性。

这种混合方法结合了单向量搜索的速度和多向量检索的准确性。MUVERA 无需使用昂贵的多向量相似度计算来比较集合中的所有文档,而是让您可以

  1. 快速检索:使用 MUVERA 向量并通过带有过采样的传统向量搜索找到顶级候选对象
  2. 精确重排序:仅针对此小型候选集应用带有 MaxSim 的多向量相似度计算

这种方式的代价是增加了存储需求,因为您需要同时存储 MUVERA 向量和原始的多向量表示。然而,性能的提升使得该方案对于拥有大型文档集合的生产系统非常实用,并且其他技术(如卸载至磁盘)可能有助于降低成本。

有关 MUVERA 工作原理的详细技术说明,请参阅我们的文章:MUVERA:让多向量性能更强

在 FastEmbed 中使用 MUVERA

本教程演示如何在玩具数据集上使用 MUVERA 进行快速检索以及 ColBERT 重排序。如果您不熟悉多向量嵌入,我们建议先阅读如何使用 FastEmbed 生成 ColBERT 多向量

设置

安装 FastEmbed 0.7.2 或更高版本以使用 MUVERA 后处理功能。

pip install "fastembed>=0.7.2"

导入用于晚期交互(late interaction)向量和 MUVERA 后处理的必要模块。

from fastembed import LateInteractionTextEmbedding
from fastembed.postprocess import Muvera

加载模型并创建 MUVERA 处理器

加载 ColBERT 模型并将其包装在 MUVERA 处理器中。

model_name = "colbert-ir/colbertv2.0"
model = LateInteractionTextEmbedding(model_name=model_name)

# Create MUVERA processor with recommended parameters
muvera = Muvera.from_multivector_model(
    model=model,
    k_sim=6,
    dim_proj=32,
    r_reps=20
)

MUVERA 参数控制生成向量的大小和质量。这些推荐值平衡了速度和准确性。k_sim 参数决定簇的数量 (2^6 = 64),dim_proj 设置投影维度,r_reps 指定为了鲁棒性而进行的重复次数。

使用 ColBERT 嵌入数据

我们将使用一个电影描述玩具数据集来演示 MUVERA。

电影描述数据集
descriptions = [
    "In 1431, Jeanne d'Arc is placed on trial on charges of heresy. The ecclesiastical jurists attempt to force Jeanne to recant her claims of holy visions.",
    "A film projectionist longs to be a detective, and puts his meagre skills to work when he is framed by a rival for stealing his girlfriend's father's pocketwatch.",
    "A group of high-end professional thieves start to feel the heat from the LAPD when they unknowingly leave a clue at their latest heist.",
    "A petty thief with an utter resemblance to a samurai warlord is hired as the lord's double. When the warlord later dies the thief is forced to take up arms in his place.",
    "A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past.",
    "A biopic detailing the 2 decades that Punjabi Sikh revolutionary Udham Singh spent planning the assassination of the man responsible for the Jallianwala Bagh massacre.",
    "When a machine that allows therapists to enter their patients' dreams is stolen, all hell breaks loose. Only a young female therapist, Paprika, can stop it.",
    "An ordinary word processor has the worst night of his life after he agrees to visit a girl in Soho whom he met that evening at a coffee shop.",
    "A story that revolves around drug abuse in the affluent north Indian State of Punjab and how the youth there have succumbed to it en-masse resulting in a socio-economic decline.",
    "A world-weary political journalist picks up the story of a woman's search for her son, who was taken away from her decades ago after she became pregnant and was forced to live in a convent.",
    "Concurrent theatrical ending of the TV series Neon Genesis Evangelion (1995).",
    "During World War II, a rebellious U.S. Army Major is assigned a dozen convicted murderers to train and lead them into a mass assassination mission of German officers.",
    "The toys are mistakenly delivered to a day-care center instead of the attic right before Andy leaves for college, and it's up to Woody to convince the other toys that they weren't abandoned and to return home.",
    "A soldier fighting aliens gets to relive the same day over and over again, the day restarting every time he dies.",
    "After two male musicians witness a mob hit, they flee the state in an all-female band disguised as women, but further complications set in.",
    "Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household.",
    "A renegade reporter trailing a young runaway heiress for a big story joins her on a bus heading from Florida to New York, and they end up stuck with each other when the bus leaves them behind at one of the stops.",
    "Story of 40-man Turkish task force who must defend a relay station.",
    "Spinal Tap, one of England's loudest bands, is chronicled by film director Marty DiBergi on what proves to be a fateful tour.",
    "Oskar, an overlooked and bullied boy, finds love and revenge through Eli, a beautiful but peculiar girl."
]

为电影描述生成多向量嵌入。

descriptions_embeddings = list(model.embed(descriptions))

让我们检查其中一个多向量嵌入的形状。

descriptions_embeddings[0].shape

第一个文档由 33 个 token 组成,每个 token 由一个 128 维的向量表示。

(33, 128)

使用 MUVERA 处理

现在,将多向量嵌入转换为 MUVERA 的单向量表示。

muvera_embeddings = [
    muvera.process_document(emb) for emb in descriptions_embeddings
]

让我们检查一个 MUVERA 向量的形状。

muvera_embeddings[0].shape

MUVERA 的维度取决于方法的配置,在我们的例子中,它会非常高。

(40960,)

MUVERA 嵌入是一个 40,960 维的单向量。虽然这比典型的稠密向量(通常为几百到几千维)要大,但它的搜索速度比原始的多向量表示要快得多,因为 HNSW 等传统向量搜索索引是针对单向量相似度优化的。

上传至 Qdrant

安装 qdrant-client

pip install "qdrant-client>=1.14.2"

在此示例中,我们将使用在 Docker 容器中本地运行的 Qdrant。或者,您也可以使用 Qdrant Cloud 中的免费集群

from qdrant_client import QdrantClient, models

client = QdrantClient("https://:6333")

创建一个集合,使用命名向量同时存储 MUVERA 向量和原始多向量表示。

client.create_collection(
    collection_name="movies-muvera",
    vectors_config={
        "muvera": models.VectorParams(
            size=muvera.embedding_size,  # Depends on the MUVERA configuration
            distance=models.Distance.COSINE
        ),
        "colbert": models.VectorParams(
            size=model.embedding_size,  # Model-specific
            distance=models.Distance.COSINE,
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM
            )
        )
    }
)

将两种表示都上传到集合中。

client.upload_points(
    collection_name="movies-muvera",
    points=[
        models.PointStruct(
            id=idx,
            payload={"description": description},
            vector={
                "muvera": muvera_emb,
                "colbert": colbert_emb
            }
        )
        for idx, (description, muvera_emb, colbert_emb) in enumerate(
            zip(descriptions, muvera_embeddings, descriptions_embeddings)
        )
    ]
)

混合搜索:MUVERA 检索 + ColBERT 重排序

现在让我们使用混合方法进行搜索。Qdrant 通过 prefetch 参数支持多阶段查询,这使我们能够在单个查询中结合 MUVERA 的快速检索和 ColBERT 的精确重评分。

首先,以两种格式创建查询向量。

query = "A movie for kids with fantasy elements and wonders"
query_multivec = list(model.query_embed(query))[0]
query_muvera = muvera.process_query(query_multivec)

现在使用 Qdrant 的原生多阶段搜索执行两阶段查询

results = client.query_points(
    collection_name="movies-muvera",
    prefetch=models.Prefetch(
        query=query_muvera,
        using="muvera",
        limit=20,  # Stage 1: Retrieve 20 candidates with MUVERA (fast)
    ),
    query=query_multivec,  # Stage 2: Rescore with ColBERT multi-vector (accurate)
    using="colbert",
    limit=3,  # Return top 3 results after rescoring
    with_payload=True
)

prefetch 参数使用 MUVERA 检索候选对象,然后主 query 使用 ColBERT 的多向量表示对这些候选对象进行重评分。Qdrant 会自动处理多向量相似度的 MaxSim 计算。

显示结果。

for i, point in enumerate(results.points, 1):
    print(f'Result {i}: "{point.payload["description"]}"')
    print(f"Score: {point.score:.2f}\n")

这是在我们使用的玩具数据集中它们应该呈现的样子

Result 1: "A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past."
Score: 12.06

Result 2: "Oskar, an overlooked and bullied boy, finds love and revenge through Eli, a beautiful but peculiar girl."
Score: 10.75

Result 3: "When a machine that allows therapists to enter their patients' dreams is stolen, all hell breaks loose. Only a young female therapist, Paprika, can stop it."
Score: 10.04

这种混合方法在保持与整个集合的全多向量搜索几乎相同的准确性的同时,速度要快得多。昂贵的多向量相似度计算仅应用于由 MUVERA 检索出的 20 个候选对象,而不是集合中的所有文档。

结论

MUVERA 后处理通过创建易于搜索的多向量近似值,实现了实用的超大规模多向量搜索。推荐的方法是将 MUVERA 与 Qdrant 的原生多阶段查询能力相结合

  1. 快速检索:使用带有 prefetch 的 MUVERA 向量来检索候选文档
  2. 精确重排序:Qdrant 自动使用 ColBERT 多向量对候选对象进行重评分

通过将昂贵的多向量计算限制在仅由 MUVERA 检索到的候选集上,这种混合模式可以高效地扩展到大型集合,同时保持与纯多向量搜索几乎相同的搜索质量。代价是增加了存储空间,因为您需要在集合中维护两种表示。

MUVERA 对于拥有大型文档集合的生产系统特别有价值,否则多向量搜索对于第一阶段检索来说会太慢。FastEmbed 的 MUVERA 后处理与 Qdrant 的多阶段查询的结合,提供了一个无缝且高性能的解决方案。

通过 pip install --upgrade fastembed 升级到 FastEmbed 0.7.2 或更高版本,立即开始使用 MUVERA 后处理。

此页面有用吗?

感谢您的反馈!🙏

很遗憾听到这个消息。😔 您可以在 GitHub 上编辑此页面,或创建一个 GitHub issue。