演示:用于混合检索的通用查询
在此动手演示中,我们将使用 arXiv 数据集构建一个研究论文发现系统,展示 Qdrant 通用查询 API 的全部功能。您将看到如何结合密集语义、稀疏关键词和 ColBERT 重排,帮助研究人员准确找到他们需要的论文 - 所有这些都在一个查询中完成。
挑战:智能研究发现
想象您是一名机器学习研究人员,正在寻找“用于多模态学习并带有注意力机制的 Transformer 架构”。您需要:
- 广撒网检索,利用对研究概念的语义理解(密集向量)
- 精确匹配技术术语,如“transformer”和“attention”(稀疏向量)
- 智能重排,利用细粒度文本理解(ColBERT)
- 应用研究筛选器,如发布日期、引用次数和研究领域
传统上,这需要跨多个系统进行多次搜索,手动合并结果,以及复杂的排名逻辑。通过通用查询 API,只需一个声明式请求。
第 1 步:创建研究论文集合
初始化带有向量配置的集合
首先,让我们设置一个包含三种向量类型的集合 - 每种类型在我们的研究发现管道中都有不同的用途
from datetime import datetime, timedelta
from qdrant_client import QdrantClient, models
import os
client = QdrantClient(url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY"))
# For Colab:
# from google.colab import userdata
# client = QdrantClient(url=userdata.get("QDRANT_URL"), api_key=userdata.get("QDRANT_API_KEY"))
collection_name = "research-papers"
# Clean state
if client.collection_exists(collection_name=collection_name):
client.delete_collection(collection_name=collection_name)
# Create collection with three vector types
client.create_collection(
collection_name=collection_name,
vectors_config={
# Dense vectors for semantic understanding of research concepts
"dense": models.VectorParams(size=384, distance=models.Distance.COSINE),
# ColBERT multivectors for fine-grained text understanding
"colbert": models.VectorParams(
size=128,
distance=models.Distance.COSINE,
multivector_config=models.MultiVectorConfig(
comparator=models.MultiVectorComparator.MAX_SIM
),
),
},
sparse_vectors_config={
# Sparse vectors for exact technical term matching
"sparse": models.SparseVectorParams(
index=models.SparseIndexParams(on_disk=False)
)
},
)
创建 Payload 索引
在摄取任何数据之前,我们为我们将要过滤的字段创建 Payload 索引。Qdrant 版本的 HNSW 将 Payload 过滤直接集成到搜索过程中,以提高效率。
# Index fields that will be used for filtering
client.create_payload_index(
collection_name=collection_name,
field_name="research_area",
field_schema="keyword", # For filtering by domain (ML, CV, NLP)
)
client.create_payload_index(
collection_name=collection_name,
field_name="open_access",
field_schema="bool", # For filtering open access papers
)
client.create_payload_index(
collection_name=collection_name,
field_name="published_date",
field_schema="datetime",
)
client.create_payload_index(
collection_name=collection_name,
field_name="impact_score",
field_schema="float",
)
client.create_payload_index(
collection_name=collection_name,
field_name="citation_count",
field_schema="integer",
)
准备并摄取研究论文数据
现在我们的集合已配置了向量和 Payload 索引,让我们获取一些示例研究论文
sample_data = [
{
"title": "Zero-Shot Retrieval for Scalable Visual Search in a Two-Sided Marketplace",
"authors": ["Andre Rusli", "Shoma Ishimoto", "Sho Akiyama", "Aman Kumar Singh"],
"abstract": "Visual search offers an intuitive way for customers to explore diverse product catalogs, particularly in consumer-to-consumer (C2C) marketplaces where listings are often unstructured and visually driven. This paper presents a scalable visual search system deployed in Mercari's C2C marketplace...",
"research_area": "computer_vision",
"published_date": "2025-07-31",
"impact_score": 0.78,
"citation_count": 12,
"open_access": True,
},
{
"title": "TALI: Towards A Lightweight Information Retrieval Framework for Neural Search",
"authors": ["Chaoqun Liu", "Yuanming Zhang", "Jianmin Zhang", "Jiawei Han"],
"abstract": "Neural search systems have emerged as a promising approach to enhance user engagement in information retrieval. However, their high computational costs and memory usage have limited their widespread adoption. In this paper, we present TALI, a lightweight information retrieval framework for neural search that efficiently addresses these challenges...",
"research_area": "machine_learning",
"published_date": "2025-07-31",
"impact_score": 0.78,
"citation_count": 12,
"open_access": True,
},
{
"title": "Zero-Shot Retrieval for Scalable Visual Search in a Two-Sided Marketplace",
"authors": ["Andre Rusli", "Shoma Ishimoto", "Sho Akiyama", "Aman Kumar Singh"],
"abstract": "Visual search offers an intuitive way for customers to explore diverse product catalogs, particularly in consumer-to-consumer (C2C) marketplaces where listings are often unstructured and visually driven. This paper presents a scalable visual search system deployed in Mercari's C2C marketplace...",
"research_area": "computer_vision",
"published_date": "2025-07-31",
"impact_score": 0.78,
"citation_count": 12,
"open_access": True,
},
]
texts = [it["abstract"] for it in sample_data]
我们将使用 FastEmbed 为摘要生成密集、稀疏和 ColBERT 嵌入。然后我们将所有内容上传到 Qdrant
from fastembed import TextEmbedding, SparseTextEmbedding, LateInteractionTextEmbedding
DENSE_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2" # 384-dim
SPARSE_MODEL_ID = "prithivida/Splade_PP_en_v1" # SPLADE sparse
COLBERT_MODEL_ID = "colbert-ir/colbertv2.0" # 128-dim multivector
dense_model = TextEmbedding(DENSE_MODEL_ID)
sparse_model = SparseTextEmbedding(SPARSE_MODEL_ID)
colbert_model = LateInteractionTextEmbedding(COLBERT_MODEL_ID)
dense_embeds = list(dense_model.embed(texts, parallel=0))
sparse_embeds = list(sparse_model.embed(texts, parallel=0))
colbert_multivectors = list(colbert_model.embed(texts, parallel=0))
points = []
for i, text in enumerate(texts):
sparse_embed = sparse_embeds[i].as_object()
dense_embed = dense_embeds[i]
colbert_embed = colbert_multivectors[i]
points.append(
models.PointStruct(
id=i,
vector={
"dense": dense_embed,
"sparse": sparse_embed,
"colbert": colbert_embed,
},
payload=sample_data[i],
)
)
client.upload_points(
collection_name=collection_name,
points=points,
)
第 3 步:通用查询实战
让我们一步步构建一个复杂的研究发现查询。我们将在一个 API 调用中协调密集搜索、稀疏搜索、RRF 融合和 ColBERT 重排。
准备查询嵌入
首先,我们使用所有三个嵌入模型对研究查询进行编码
research_query = "transformer architectures for multimodal learning"
research_query_dense = next(dense_model.query_embed(research_query))
research_query_sparse = next(sparse_model.query_embed(research_query)).as_object()
research_query_colbert = next(colbert_model.query_embed(research_query))
我们生成相同查询的三种不同表示形式 - 每种表示形式都针对我们检索管道的不同阶段进行了优化。
定义带有自动传播的全局过滤器
现在我们定义将在整个搜索管道中应用的质量约束
# Define global filter - this will be propagated to ALL prefetch stages
global_filter = models.Filter(
must=[
# Research domain filtering
models.FieldCondition(
key="research_area",
match=models.MatchAny(any=[
"machine_learning",
"computer_vision",
"nlp",
]),
),
# Open access only
models.FieldCondition(
key="open_access",
match=models.MatchValue(value=True)
),
# Recent research only (last 6 years)
models.FieldCondition(
key="published_date",
range=models.DatetimeRange(
gte=(datetime.now() - timedelta(days=365 * 6)).isoformat()
),
),
# High-impact papers
models.FieldCondition(key="impact_score", range=models.Range(gte=0.6)),
# Well-cited work
models.FieldCondition(key="citation_count", range=models.Range(gte=5)),
]
)
关键见解:此过滤器将自动传播到所有预取阶段。Qdrant 不进行“延迟过滤”或“后过滤” - 过滤器在 HNSW 搜索级别应用,以实现最大效率。这得益于我们在第 1 步中创建的 Payload 索引。
设置并行预取查询
接下来,我们配置带有两个并发搜索的混合检索
# Prefetch queries - global filter will be automatically applied to both
hybrid_query = [
# Dense retrieval: semantic understanding
models.Prefetch(query=research_query_dense, using="dense", limit=100),
# Sparse retrieval: exact technical term matching
models.Prefetch(query=research_query_sparse, using="sparse", limit=100),
]
这两个预取查询并行运行
- 密集搜索:根据研究概念查找语义相似的论文 - 但只考虑符合我们全局过滤器的论文(ML/CV/NLP 领域、开放获取、近期、高影响力、高引用)
- 稀疏搜索:精确匹配“transformer”、“multimodal”、“attention”等技术术语 - 但仅限于相同的过滤子集
- 两个搜索同时执行以实现最大速度
- 过滤器传播自动发生 - 无需手动协调
我们总共可以检索多达 200 个候选(每个搜索 100 个),但许多会重叠。
添加融合阶段
现在我们使用倒数排名融合(Reciprocal Rank Fusion)合并两个过滤后的候选列表
# Fusion stage - combines dense and sparse results
fusion_query = models.Prefetch(
prefetch=hybrid_query,
query=models.FusionQuery(fusion=models.Fusion.RRF),
limit=100,
)
RRF 算法
- 智能地合并两个过滤后的候选列表
- 在两个列表中排名靠前的论文会获得更高的分数
- 创建了一个统一的排名,平衡了语义相似性和技术术语相关性
- 所有结果仍然满足全局过滤器约束
执行带有 ColBERT 重排的通用查询
最后,我们发送将所有内容连接在一起的完整查询
# The Universal Query: Global filter propagates through all stages
response = client.query_points(
collection_name=collection_name,
prefetch=fusion_query,
query=research_query_colbert,
using="colbert",
query_filter=global_filter, # Propagates to all prefetch stages
limit=10,
with_payload=True,
)
这个最终阶段将 ColBERT 重排应用于融合结果
- 令牌级别的后期交互评分检查细粒度文本对齐
- 每个查询令牌都与每个摘要令牌进行比较
- MaxSim 聚合找到最佳的概念对齐
- 返回前 10 篇具有精确相关性分数的论文
为什么这很重要:通过在每个阶段(而不是检索后)应用过滤器,Qdrant 保持了高精度,同时避免了在无论如何都会被过滤掉的论文上浪费计算。
显示结果
# Display results
print("Top Research Papers:")
for i, hit in enumerate(response.points or [], 1):
paper = hit.payload
print(f"{i}. {paper['title']}")
print(f" Authors: {', '.join(paper['authors'][:3])}{'...' if len(paper['authors']) > 3 else ''}")
print(f" Published: {paper['published_date']} | Citations: {paper['citation_count']}")
print(f" Research Area: {paper['research_area']}")
print(f" Open Access: {paper['open_access']}")
print(f" Score: {hit.score:.4f}\n")
瞧 - 一个在单个声明式查询中的复杂多阶段研究发现系统!
真实的 ArXiv 数据集集成
以下是您如何用真实数据填充集合(如果端点没有损坏)
# ! pip install arxiv
import arxiv
arxiv_client = arxiv.Client()
search = arxiv.Search(
query="transformer AND multimodal",
max_results=2,
sort_by=arxiv.SortCriterion.SubmittedDate,
)
points = []
for i, paper in enumerate(arxiv_client.results(search)):
print(paper)
# Create dense embedding from abstract
dense_vector = next(dense_model.embed(paper["abstract"]))
# Create sparse vector from technical terms (simplified)
# In practice, you'd use a proper sparse encoder like SPLADE or BM25
sparse_vector = next(sparse_model.embed(paper["abstract"])).as_object()
# Create ColBERT multivector (simplified)
colbert_vector = next(colbert_model.embed(paper["abstract"]))
point = models.PointStruct(
id=i, # Extract arXiv ID
payload={
"title": paper["title"],
"authors": [author for author in paper["authors"]],
"abstract": paper["abstract"],
"published_date": datetime.strptime(
paper["published_date"], "%Y-%m-%d"
).isoformat(),
"citation_count": 0, # Would need external API
"venue": "arXiv",
"research_area": paper["research_area"],
"impact_score": paper["impact_score"],
"open_access": True,
},
vector={
"dense": dense_vector,
"sparse": sparse_vector,
"colbert": colbert_vector,
},
)
points.append(point)
# Upload to Qdrant
client.upsert(collection_name=collection_name, points=points)
print(f"Uploaded {len(points)} research papers to collection")
关键要点
- 单个请求:一个 API 调用中实现复杂的多阶段研究发现
- 并行执行:密集和稀疏搜索并发运行
- 智能过滤:在最佳阶段应用研究质量过滤器
- 真实数据:与实际 arXiv 数据集和研究元数据一起工作
- 生产就绪:扩展到数百万篇论文,亚秒级延迟
通用查询 API 消除了构建多轮检索系统的复杂性。以前需要语义搜索引擎、关键词系统和重排模型之间协调的工作,现在在一个优化请求中即可完成 - 非常适合学术搜索、文献综述和研究推荐系统。
下一步
在下一课中,您将以此为基础,构建一个完整的推荐服务,包括真实数据摄取和用户画像。