项目:构建混合搜索引擎
构建一个结合了密集和稀疏向量并采用倒数排名融合(Reciprocal Rank Fusion)的混合系统,展示如何兼顾语义理解和关键词精度以获得最佳效果。
你的任务
创建一个生产就绪的混合搜索系统,该系统利用密集和稀疏向量来提供卓越的搜索结果。你将实现完整的混合管道,并将其性能与单向量方法进行比较。
预计时间: 75 分钟
您将构建什么
一个展示以下内容的混合搜索系统:
- 用于语义理解的密集向量搜索
- 用于精确关键词匹配的稀疏向量搜索
- 倒数排名融合以智能地组合结果
- 混合方法与单向量方法之间的性能比较
- 针对你的特定用例的领域优化
设置
先决条件
- Qdrant 云集群(URL + API 密钥)
- Python 3.9+(或 Google Colab)
- 包:
qdrant-client,sentence-transformers
模型
- 密集编码器:
sentence-transformers/all-MiniLM-L6-v2(384-维)
数据集
一个小型领域数据集(例如,100-500 项),至少包含:
title(字符串)description(字符串) — 用于密集和稀疏编码器- 可选的元数据字段,用于后续过滤
构建步骤
步骤 1:设置混合集合
在你之前的工作基础上,创建一个包含密集和稀疏向量的集合
from qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer
import time
client = QdrantClient(
"https://your-cluster-url.cloud.qdrant.io",
api_key="your-api-key"
)
collection_name = "day3_hybrid_search"
# Create hybrid collection
client.create_collection(
collection_name=collection_name,
vectors_config={
"dense": models.VectorParams(size=384, distance=models.Distance.COSINE)
},
sparse_vectors_config={
"sparse": models.SparseVectorParams(
index=models.SparseIndexParams(on_disk=False)
)
}
)
步骤 2:实现密集和稀疏编码
# Dense embeddings
encoder = SentenceTransformer("all-MiniLM-L6-v2")
# Global vocabulary - automatically extends as new texts are processed
global_vocabulary = {}
# Simple sparse encoding (BM25-style)
def create_sparse_vector(text):
"""Create sparse vector from text using term frequency"""
from collections import Counter
import re
# Simple tokenization
words = re.findall(r"\b\w+\b", text.lower())
word_counts = Counter(words)
# Convert to sparse vector format, extending vocabulary as needed
indices = []
values = []
for word, count in word_counts.items():
if word not in global_vocabulary:
# Add new word to vocabulary with next available index
global_vocabulary[word] = len(global_vocabulary)
indices.append(global_vocabulary[word])
values.append(float(count))
return models.SparseVector(indices=indices, values=values)
# Upload hybrid data
points = []
for i, item in enumerate(your_dataset):
dense_vector = encoder.encode(item["description"]).tolist()
sparse_vector = create_sparse_vector(item["description"])
points.append(models.PointStruct(
id=i,
vector={"dense": dense_vector},
sparse_vector={"sparse": sparse_vector},
payload=item
))
client.upload_points(collection_name=collection_name, points=points)
步骤 3:使用 RRF 实现混合搜索
def hybrid_search_with_rrf(query_text, limit=10):
"""Perform hybrid search using Reciprocal Rank Fusion"""
# Encode query for both dense and sparse
query_dense = encoder.encode(query_text).tolist()
query_sparse = create_sparse_vector(query_text)
# Use Qdrant's built-in RRF
response = client.query_points(
collection_name=collection_name,
prefetch=[
models.Prefetch(
query=query_dense,
using="dense",
limit=20
),
models.Prefetch(
query=query_sparse,
using="sparse",
limit=20
)
],
query=models.FusionQuery(fusion=models.Fusion.RRF),
limit=limit
)
return response.points
# Test hybrid search
results = hybrid_search_with_rrf("your test query")
for i, point in enumerate(results, 1):
print(f"{i}. {point.payload.get('title', 'No title')} (Score: {point.score:.3f})")
步骤 4:比较搜索方法
def compare_search_methods(query_text):
"""Compare dense, sparse, and hybrid search results"""
print(f"Query: '{query_text}'\n")
# Dense-only search
dense_results = client.query_points(
collection_name=collection_name,
query=encoder.encode(query_text).tolist(),
using="dense",
limit=5
)
# Sparse-only search
sparse_results = client.query_points(
collection_name=collection_name,
query=create_sparse_vector(query_text),
using="sparse",
limit=5
)
# Hybrid search
hybrid_results = hybrid_search_with_rrf(query_text, limit=5)
print("DENSE SEARCH:")
for i, point in enumerate(dense_results.points, 1):
print(f" {i}. {point.payload.get('title', 'No title')} ({point.score:.3f})")
print("\nSPARSE SEARCH:")
for i, point in enumerate(sparse_results.points, 1):
print(f" {i}. {point.payload.get('title', 'No title')} ({point.score:.3f})")
print("\nHYBRID SEARCH (RRF):")
for i, point in enumerate(hybrid_results, 1):
print(f" {i}. {point.payload.get('title', 'No title')} ({point.score:.3f})")
print("-" * 50)
# Test with different query types
test_queries = [
"exact keyword match query",
"semantic concept query",
"mixed keyword and concept query"
]
for query in test_queries:
compare_search_methods(query)
步骤 5:分析你的结果
使用步骤 4 的输出评估混合方法与单一方法的比较。重点关注混合方法何时纠正了密集方法的失误(罕见关键词、精确标识符),以及稀疏方法何时失误(同义词/语义释义)。可选地,测量每种方法(密集/稀疏/混合)在几次查询上的时间,并记录平均延迟。
成功标准
你将在以下情况下知道你已成功:
你的混合集合包含密集和稀疏向量
你可以使用密集、稀疏和混合方法执行搜索
RRF 融合有效地结合了两种向量类型的结果
你可以展示混合搜索优于单向量方法的情况
你了解针对你的领域不同搜索方法之间的权衡
分享你的发现
步骤 1:反思你的发现
- 混合搜索何时击败了纯密集或纯稀疏搜索(给出具体的查询类型)?
- RRF 如何改变了排名与单独方法相比?
- 密集、稀疏和混合方法的延迟如何比较(平均值 + P95)?
- 你的稀疏编码选择(例如,TF-IDF/BM25/SPLADE)如何影响结果?
步骤 2:发布你的结果
**[Day 3] Building a Hybrid Search Engine**
**High-Level Summary**
- **Domain:** "I built hybrid search for [your domain]"
- **Winner:** "Hybrid/Dense/Sparse worked best because [one clear reason]"
**Reproducibility**
- **Collection:** day3_hybrid_search
- **Models:** dense=[id, dim], sparse=[method]
- **Dataset:** [N items] (snapshot: YYYY-MM-DD)
**Settings (today)**
- **Fusion:** RRF, k_dense=[..], k_sparse=[..]
- **Search:** hnsw_ef=[..] (if used)
- **Sparse encoding:** [TF-IDF/BM25/SPLADE], notes: [e.g., stopwords/stemming]
**Head-to-Head (demo query: "[your query]")**
- **Dense top-3:** 1) …, 2) …, 3) …
- **Sparse top-3:** 1) …, 2) …, 3) …
- **Hybrid top-3:** 1) …, 2) …, 3) …
**Latency**
- **Dense:** avg=[..] ms (P95=[..] ms)
- **Sparse:** avg=[..] ms (P95=[..] ms)
- **Hybrid (RRF):** avg=[..] ms (P95=[..] ms)
**Why these won**
- [one line on synonyms vs exact IDs/keywords, etc.]
**Surprise**
- "[one unexpected finding]"
**Next step**
- "[one concrete action for tomorrow]"
可选:更进一步
高级融合策略
测试基于分布的分数融合(DBSF)作为 RRF 的替代方案
# Compare RRF vs DBSF
dbsf_results = client.query_points(
collection_name=collection_name,
prefetch=[
models.Prefetch(query=query_dense, using="dense", limit=20),
models.Prefetch(query=query_sparse, using="sparse", limit=20)
],
query=models.FusionQuery(fusion=models.Fusion.DBSF),
limit=10
)
性能基准测试
测量并比较搜索延迟
def benchmark_search_methods(query_text, iterations=10):
"""Benchmark different search approaches"""
methods = {
"dense": lambda: client.query_points(
collection_name=collection_name,
query=encoder.encode(query_text).tolist(),
using="dense", limit=10
),
"sparse": lambda: client.query_points(
collection_name=collection_name,
query=create_sparse_vector(query_text),
using="sparse", limit=10
),
"hybrid": lambda: hybrid_search_with_rrf(query_text)
}
for method_name, method_func in methods.items():
times = []
for _ in range(iterations):
start = time.time()
method_func()
times.append((time.time() - start) * 1000)
avg_time = sum(times) / len(times)
print(f"{method_name}: {avg_time:.2f}ms average")
自定义稀疏编码
实现基于你的数据集训练的 TFIDF 稀疏编码
from sklearn.feature_extraction.text import TfidfVectorizer
def create_tfidf_sparse_vector(text, vectorizer):
"""Create sparse vector using TF-IDF"""
tfidf_matrix = vectorizer.transform([text])
coo_matrix = tfidf_matrix.tocoo()
return models.SparseVector(
indices=coo_matrix.col.tolist(),
values=coo_matrix.data.tolist()
)