Calendar 第 5 天

演示:用于混合检索的通用查询

在此动手演示中,我们将使用 arXiv 数据集构建一个研究论文发现系统,展示 Qdrant 通用查询 API 的全部功能。您将看到如何结合密集语义、稀疏关键词和 ColBERT 重排,帮助研究人员准确找到他们需要的论文 - 所有这些都在一个查询中完成。

在 Colab 中跟随: 在 Colab 中打开

挑战:智能研究发现

想象您是一名机器学习研究人员,正在寻找“用于多模态学习并带有注意力机制的 Transformer 架构”。您需要:

  1. 广撒网检索,利用对研究概念的语义理解(密集向量)
  2. 精确匹配技术术语,如“transformer”和“attention”(稀疏向量)
  3. 智能重排,利用细粒度文本理解(ColBERT)
  4. 应用研究筛选器,如发布日期、引用次数和研究领域

传统上,这需要跨多个系统进行多次搜索,手动合并结果,以及复杂的排名逻辑。通过通用查询 API,只需一个声明式请求。

第 1 步:创建研究论文集合

初始化带有向量配置的集合

首先,让我们设置一个包含三种向量类型的集合 - 每种类型在我们的研究发现管道中都有不同的用途

from datetime import datetime, timedelta

from qdrant_client import QdrantClient, models
import os

client = QdrantClient(url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY"))

# For Colab:
# from google.colab import userdata
# client = QdrantClient(url=userdata.get("QDRANT_URL"), api_key=userdata.get("QDRANT_API_KEY"))

collection_name = "research-papers"

# Clean state
if client.collection_exists(collection_name=collection_name):
    client.delete_collection(collection_name=collection_name)

# Create collection with three vector types
client.create_collection(
    collection_name=collection_name,
    vectors_config={
        # Dense vectors for semantic understanding of research concepts
        "dense": models.VectorParams(size=384, distance=models.Distance.COSINE),
        # ColBERT multivectors for fine-grained text understanding
        "colbert": models.VectorParams(
            size=128,
            distance=models.Distance.COSINE,
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM
            ),
        ),
    },
    sparse_vectors_config={
        # Sparse vectors for exact technical term matching
        "sparse": models.SparseVectorParams(
            index=models.SparseIndexParams(on_disk=False)
        )
    },
)

创建 Payload 索引

在摄取任何数据之前,我们为我们将要过滤的字段创建 Payload 索引。Qdrant 版本的 HNSW 将 Payload 过滤直接集成到搜索过程中,以提高效率。

# Index fields that will be used for filtering
client.create_payload_index(
    collection_name=collection_name,
    field_name="research_area",
    field_schema="keyword",  # For filtering by domain (ML, CV, NLP)
)
client.create_payload_index(
    collection_name=collection_name,
    field_name="open_access",
    field_schema="bool",  # For filtering open access papers
)
client.create_payload_index(
    collection_name=collection_name,
    field_name="published_date",
    field_schema="datetime",
)
client.create_payload_index(
    collection_name=collection_name,
    field_name="impact_score",
    field_schema="float",
)
client.create_payload_index(
    collection_name=collection_name,
    field_name="citation_count",
    field_schema="integer",
)

准备并摄取研究论文数据

现在我们的集合已配置了向量和 Payload 索引,让我们获取一些示例研究论文

sample_data = [
    {
        "title": "Zero-Shot Retrieval for Scalable Visual Search in a Two-Sided Marketplace",
        "authors": ["Andre Rusli", "Shoma Ishimoto", "Sho Akiyama", "Aman Kumar Singh"],
        "abstract": "Visual search offers an intuitive way for customers to explore diverse product catalogs, particularly in consumer-to-consumer (C2C) marketplaces where listings are often unstructured and visually driven. This paper presents a scalable visual search system deployed in Mercari's C2C marketplace...",
        "research_area": "computer_vision",
        "published_date": "2025-07-31",
        "impact_score": 0.78,
        "citation_count": 12,
        "open_access": True,
    },
    {
        "title": "TALI: Towards A Lightweight Information Retrieval Framework for Neural Search",
        "authors": ["Chaoqun Liu", "Yuanming Zhang", "Jianmin Zhang", "Jiawei Han"],
        "abstract": "Neural search systems have emerged as a promising approach to enhance user engagement in information retrieval. However, their high computational costs and memory usage have limited their widespread adoption. In this paper, we present TALI, a lightweight information retrieval framework for neural search that efficiently addresses these challenges...",
        "research_area": "machine_learning",
        "published_date": "2025-07-31",
        "impact_score": 0.78,
        "citation_count": 12,
        "open_access": True,
    },
    {
        "title": "Zero-Shot Retrieval for Scalable Visual Search in a Two-Sided Marketplace",
        "authors": ["Andre Rusli", "Shoma Ishimoto", "Sho Akiyama", "Aman Kumar Singh"],
        "abstract": "Visual search offers an intuitive way for customers to explore diverse product catalogs, particularly in consumer-to-consumer (C2C) marketplaces where listings are often unstructured and visually driven. This paper presents a scalable visual search system deployed in Mercari's C2C marketplace...",
        "research_area": "computer_vision",
        "published_date": "2025-07-31",
        "impact_score": 0.78,
        "citation_count": 12,
        "open_access": True,
    },
]

texts = [it["abstract"] for it in sample_data]

我们将使用 FastEmbed 为摘要生成密集、稀疏和 ColBERT 嵌入。然后我们将所有内容上传到 Qdrant

from fastembed import TextEmbedding, SparseTextEmbedding, LateInteractionTextEmbedding

DENSE_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"  # 384-dim
SPARSE_MODEL_ID = "prithivida/Splade_PP_en_v1"  # SPLADE sparse
COLBERT_MODEL_ID = "colbert-ir/colbertv2.0"  # 128-dim multivector

dense_model = TextEmbedding(DENSE_MODEL_ID)
sparse_model = SparseTextEmbedding(SPARSE_MODEL_ID)
colbert_model = LateInteractionTextEmbedding(COLBERT_MODEL_ID)

dense_embeds = list(dense_model.embed(texts, parallel=0))
sparse_embeds = list(sparse_model.embed(texts, parallel=0))
colbert_multivectors = list(colbert_model.embed(texts, parallel=0))

points = []
for i, text in enumerate(texts):
    sparse_embed = sparse_embeds[i].as_object()
    dense_embed = dense_embeds[i]
    colbert_embed = colbert_multivectors[i]

    points.append(
        models.PointStruct(
            id=i,
            vector={
                "dense": dense_embed,
                "sparse": sparse_embed,
                "colbert": colbert_embed,
            },
            payload=sample_data[i],
        )
    )

client.upload_points(
    collection_name=collection_name,
    points=points,
)

第 3 步:通用查询实战

让我们一步步构建一个复杂的研究发现查询。我们将在一个 API 调用中协调密集搜索、稀疏搜索、RRF 融合和 ColBERT 重排。

准备查询嵌入

首先,我们使用所有三个嵌入模型对研究查询进行编码

research_query = "transformer architectures for multimodal learning"

research_query_dense = next(dense_model.query_embed(research_query))
research_query_sparse = next(sparse_model.query_embed(research_query)).as_object()
research_query_colbert = next(colbert_model.query_embed(research_query))

我们生成相同查询的三种不同表示形式 - 每种表示形式都针对我们检索管道的不同阶段进行了优化。

定义带有自动传播的全局过滤器

现在我们定义将在整个搜索管道中应用的质量约束

# Define global filter - this will be propagated to ALL prefetch stages
global_filter = models.Filter(
    must=[
        # Research domain filtering
        models.FieldCondition(
            key="research_area",
            match=models.MatchAny(any=[
                "machine_learning",
                "computer_vision",
                "nlp",
            ]),
        ),
        # Open access only
        models.FieldCondition(
            key="open_access",
            match=models.MatchValue(value=True)
        ),
        # Recent research only (last 6 years)
        models.FieldCondition(
            key="published_date",
            range=models.DatetimeRange(
                gte=(datetime.now() - timedelta(days=365 * 6)).isoformat()
            ),
        ),
        # High-impact papers
        models.FieldCondition(key="impact_score", range=models.Range(gte=0.6)),
        # Well-cited work
        models.FieldCondition(key="citation_count", range=models.Range(gte=5)),
    ]
)

关键见解:此过滤器将自动传播到所有预取阶段。Qdrant 不进行“延迟过滤”或“后过滤” - 过滤器在 HNSW 搜索级别应用,以实现最大效率。这得益于我们在第 1 步中创建的 Payload 索引。

设置并行预取查询

接下来,我们配置带有两个并发搜索的混合检索

# Prefetch queries - global filter will be automatically applied to both
hybrid_query = [
    # Dense retrieval: semantic understanding
    models.Prefetch(query=research_query_dense, using="dense", limit=100),
    # Sparse retrieval: exact technical term matching
    models.Prefetch(query=research_query_sparse, using="sparse", limit=100),
]

这两个预取查询并行运行

  • 密集搜索:根据研究概念查找语义相似的论文 - 但只考虑符合我们全局过滤器的论文(ML/CV/NLP 领域、开放获取、近期、高影响力、高引用)
  • 稀疏搜索:精确匹配“transformer”、“multimodal”、“attention”等技术术语 - 但仅限于相同的过滤子集
  • 两个搜索同时执行以实现最大速度
  • 过滤器传播自动发生 - 无需手动协调

我们总共可以检索多达 200 个候选(每个搜索 100 个),但许多会重叠。

添加融合阶段

现在我们使用倒数排名融合(Reciprocal Rank Fusion)合并两个过滤后的候选列表

# Fusion stage - combines dense and sparse results
fusion_query = models.Prefetch(
    prefetch=hybrid_query,
    query=models.FusionQuery(fusion=models.Fusion.RRF),
    limit=100,
)

RRF 算法

  • 智能地合并两个过滤后的候选列表
  • 在两个列表中排名靠前的论文会获得更高的分数
  • 创建了一个统一的排名,平衡了语义相似性和技术术语相关性
  • 所有结果仍然满足全局过滤器约束

执行带有 ColBERT 重排的通用查询

最后,我们发送将所有内容连接在一起的完整查询

# The Universal Query: Global filter propagates through all stages
response = client.query_points(
    collection_name=collection_name,
    prefetch=fusion_query,
    query=research_query_colbert,
    using="colbert",
    query_filter=global_filter,  # Propagates to all prefetch stages
    limit=10,
    with_payload=True,
)

这个最终阶段将 ColBERT 重排应用于融合结果

  • 令牌级别的后期交互评分检查细粒度文本对齐
  • 每个查询令牌都与每个摘要令牌进行比较
  • MaxSim 聚合找到最佳的概念对齐
  • 返回前 10 篇具有精确相关性分数的论文

为什么这很重要:通过在每个阶段(而不是检索后)应用过滤器,Qdrant 保持了高精度,同时避免了在无论如何都会被过滤掉的论文上浪费计算。

显示结果

# Display results
print("Top Research Papers:")
for i, hit in enumerate(response.points or [], 1):
    paper = hit.payload
    print(f"{i}. {paper['title']}")
    print(f"   Authors: {', '.join(paper['authors'][:3])}{'...' if len(paper['authors']) > 3 else ''}")
    print(f"   Published: {paper['published_date']} | Citations: {paper['citation_count']}")
    print(f"   Research Area: {paper['research_area']}")
    print(f"   Open Access: {paper['open_access']}")
    print(f"   Score: {hit.score:.4f}\n")

瞧 - 一个在单个声明式查询中的复杂多阶段研究发现系统!

真实的 ArXiv 数据集集成

以下是您如何用真实数据填充集合(如果端点没有损坏)

# ! pip install arxiv
import arxiv

arxiv_client = arxiv.Client()

search = arxiv.Search(
    query="transformer AND multimodal",
    max_results=2,
    sort_by=arxiv.SortCriterion.SubmittedDate,
)

points = []
for i, paper in enumerate(arxiv_client.results(search)):
    print(paper)
    # Create dense embedding from abstract
    dense_vector = next(dense_model.embed(paper["abstract"]))

    # Create sparse vector from technical terms (simplified)
    # In practice, you'd use a proper sparse encoder like SPLADE or BM25
    sparse_vector = next(sparse_model.embed(paper["abstract"])).as_object()

    # Create ColBERT multivector (simplified)
    colbert_vector = next(colbert_model.embed(paper["abstract"]))

    point = models.PointStruct(
        id=i,  # Extract arXiv ID
        payload={
            "title": paper["title"],
            "authors": [author for author in paper["authors"]],
            "abstract": paper["abstract"],
            "published_date": datetime.strptime(
                paper["published_date"], "%Y-%m-%d"
            ).isoformat(),
            "citation_count": 0,  # Would need external API
            "venue": "arXiv",
            "research_area": paper["research_area"],
            "impact_score": paper["impact_score"],
            "open_access": True,
        },
        vector={
            "dense": dense_vector,
            "sparse": sparse_vector,
            "colbert": colbert_vector,
        },
    )
    points.append(point)

# Upload to Qdrant
client.upsert(collection_name=collection_name, points=points)
print(f"Uploaded {len(points)} research papers to collection")

关键要点

  • 单个请求:一个 API 调用中实现复杂的多阶段研究发现
  • 并行执行:密集和稀疏搜索并发运行
  • 智能过滤:在最佳阶段应用研究质量过滤器
  • 真实数据:与实际 arXiv 数据集和研究元数据一起工作
  • 生产就绪:扩展到数百万篇论文,亚秒级延迟

通用查询 API 消除了构建多轮检索系统的复杂性。以前需要语义搜索引擎、关键词系统和重排模型之间协调的工作,现在在一个优化请求中即可完成 - 非常适合学术搜索、文献综述和研究推荐系统。

下一步

在下一课中,您将以此为基础,构建一个完整的推荐服务,包括真实数据摄取和用户画像。