Calendar 第 3 天

项目:构建混合搜索引擎

构建一个结合了密集和稀疏向量并采用倒数排名融合(Reciprocal Rank Fusion)的混合系统,展示如何兼顾语义理解和关键词精度以获得最佳效果。

你的任务

创建一个生产就绪的混合搜索系统,该系统利用密集和稀疏向量来提供卓越的搜索结果。你将实现完整的混合管道,并将其性能与单向量方法进行比较。

预计时间: 75 分钟

您将构建什么

一个展示以下内容的混合搜索系统:

  • 用于语义理解的密集向量搜索
  • 用于精确关键词匹配的稀疏向量搜索
  • 倒数排名融合以智能地组合结果
  • 混合方法与单向量方法之间的性能比较
  • 针对你的特定用例的领域优化

设置

先决条件

  • Qdrant 云集群(URL + API 密钥)
  • Python 3.9+(或 Google Colab)
  • 包:qdrant-client, sentence-transformers

模型

  • 密集编码器:sentence-transformers/all-MiniLM-L6-v2 (384-维)

数据集

  • 一个小型领域数据集(例如,100-500 项),至少包含:

    • title (字符串)
    • description (字符串) — 用于密集和稀疏编码器
    • 可选的元数据字段,用于后续过滤

构建步骤

步骤 1:设置混合集合

在你之前的工作基础上,创建一个包含密集和稀疏向量的集合

from qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer
import time

client = QdrantClient(
    "https://your-cluster-url.cloud.qdrant.io", 
    api_key="your-api-key"
)

collection_name = "day3_hybrid_search"

# Create hybrid collection
client.create_collection(
    collection_name=collection_name,
    vectors_config={
        "dense": models.VectorParams(size=384, distance=models.Distance.COSINE)
    },
    sparse_vectors_config={
        "sparse": models.SparseVectorParams(
            index=models.SparseIndexParams(on_disk=False)
        )
    }
)

步骤 2:实现密集和稀疏编码

# Dense embeddings
encoder = SentenceTransformer("all-MiniLM-L6-v2")

# Global vocabulary - automatically extends as new texts are processed
global_vocabulary = {}

# Simple sparse encoding (BM25-style)
def create_sparse_vector(text):
    """Create sparse vector from text using term frequency"""
    from collections import Counter
    import re
    
    # Simple tokenization
    words = re.findall(r"\b\w+\b", text.lower())
    word_counts = Counter(words)
    
    # Convert to sparse vector format, extending vocabulary as needed
    indices = []
    values = []
    
    for word, count in word_counts.items():
        if word not in global_vocabulary:
            # Add new word to vocabulary with next available index
            global_vocabulary[word] = len(global_vocabulary)
        
        indices.append(global_vocabulary[word])
        values.append(float(count))
    
    return models.SparseVector(indices=indices, values=values)

# Upload hybrid data
points = []
for i, item in enumerate(your_dataset):
    dense_vector = encoder.encode(item["description"]).tolist()
    sparse_vector = create_sparse_vector(item["description"])
    
    points.append(models.PointStruct(
        id=i,
        vector={"dense": dense_vector},
        sparse_vector={"sparse": sparse_vector},
        payload=item
    ))

client.upload_points(collection_name=collection_name, points=points)

步骤 3:使用 RRF 实现混合搜索

def hybrid_search_with_rrf(query_text, limit=10):
    """Perform hybrid search using Reciprocal Rank Fusion"""
    
    # Encode query for both dense and sparse
    query_dense = encoder.encode(query_text).tolist()
    query_sparse = create_sparse_vector(query_text)
    
    # Use Qdrant's built-in RRF
    response = client.query_points(
        collection_name=collection_name,
        prefetch=[
            models.Prefetch(
                query=query_dense,
                using="dense",
                limit=20
            ),
            models.Prefetch(
                query=query_sparse,
                using="sparse",
                limit=20
            )
        ],
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        limit=limit
    )
    
    return response.points

# Test hybrid search
results = hybrid_search_with_rrf("your test query")
for i, point in enumerate(results, 1):
    print(f"{i}. {point.payload.get('title', 'No title')} (Score: {point.score:.3f})")

步骤 4:比较搜索方法

def compare_search_methods(query_text):
    """Compare dense, sparse, and hybrid search results"""
    
    print(f"Query: '{query_text}'\n")
    
    # Dense-only search
    dense_results = client.query_points(
        collection_name=collection_name,
        query=encoder.encode(query_text).tolist(),
        using="dense",
        limit=5
    )
    
    # Sparse-only search  
    sparse_results = client.query_points(
        collection_name=collection_name,
        query=create_sparse_vector(query_text),
        using="sparse",
        limit=5
    )
    
    # Hybrid search
    hybrid_results = hybrid_search_with_rrf(query_text, limit=5)
    
    print("DENSE SEARCH:")
    for i, point in enumerate(dense_results.points, 1):
        print(f"  {i}. {point.payload.get('title', 'No title')} ({point.score:.3f})")
    
    print("\nSPARSE SEARCH:")
    for i, point in enumerate(sparse_results.points, 1):
        print(f"  {i}. {point.payload.get('title', 'No title')} ({point.score:.3f})")
    
    print("\nHYBRID SEARCH (RRF):")
    for i, point in enumerate(hybrid_results, 1):
        print(f"  {i}. {point.payload.get('title', 'No title')} ({point.score:.3f})")
    
    print("-" * 50)

# Test with different query types
test_queries = [
    "exact keyword match query",
    "semantic concept query", 
    "mixed keyword and concept query"
]

for query in test_queries:
    compare_search_methods(query)

步骤 5:分析你的结果

使用步骤 4 的输出评估混合方法与单一方法的比较。重点关注混合方法何时纠正了密集方法的失误(罕见关键词、精确标识符),以及稀疏方法何时失误(同义词/语义释义)。可选地,测量每种方法(密集/稀疏/混合)在几次查询上的时间,并记录平均延迟。

成功标准

你将在以下情况下知道你已成功:

你的混合集合包含密集和稀疏向量
你可以使用密集、稀疏和混合方法执行搜索
RRF 融合有效地结合了两种向量类型的结果
你可以展示混合搜索优于单向量方法的情况
你了解针对你的领域不同搜索方法之间的权衡

分享你的发现

步骤 1:反思你的发现

  1. 混合搜索何时击败了纯密集或纯稀疏搜索(给出具体的查询类型)?
  2. RRF 如何改变了排名与单独方法相比?
  3. 密集、稀疏和混合方法的延迟如何比较(平均值 + P95)?
  4. 你的稀疏编码选择(例如,TF-IDF/BM25/SPLADE)如何影响结果?

步骤 2:发布你的结果

将你的结果发布到 Post your results in Discord ,使用此模板:

**[Day 3] Building a Hybrid Search Engine**

**High-Level Summary**
- **Domain:** "I built hybrid search for [your domain]"
- **Winner:** "Hybrid/Dense/Sparse worked best because [one clear reason]"

**Reproducibility**
- **Collection:** day3_hybrid_search
- **Models:** dense=[id, dim], sparse=[method]
- **Dataset:** [N items] (snapshot: YYYY-MM-DD)

**Settings (today)**
- **Fusion:** RRF, k_dense=[..], k_sparse=[..]
- **Search:** hnsw_ef=[..] (if used)
- **Sparse encoding:** [TF-IDF/BM25/SPLADE], notes: [e.g., stopwords/stemming]

**Head-to-Head (demo query: "[your query]")**
- **Dense top-3:** 1) …, 2) …, 3) …
- **Sparse top-3:** 1) …, 2) …, 3) …
- **Hybrid top-3:** 1) …, 2) …, 3) …

**Latency**
- **Dense:** avg=[..] ms (P95=[..] ms)
- **Sparse:** avg=[..] ms (P95=[..] ms)
- **Hybrid (RRF):** avg=[..] ms (P95=[..] ms)

**Why these won**
- [one line on synonyms vs exact IDs/keywords, etc.]

**Surprise**
- "[one unexpected finding]"

**Next step**
- "[one concrete action for tomorrow]"

可选:更进一步

高级融合策略

测试基于分布的分数融合(DBSF)作为 RRF 的替代方案

# Compare RRF vs DBSF
dbsf_results = client.query_points(
    collection_name=collection_name,
    prefetch=[
        models.Prefetch(query=query_dense, using="dense", limit=20),
        models.Prefetch(query=query_sparse, using="sparse", limit=20)
    ],
    query=models.FusionQuery(fusion=models.Fusion.DBSF),
    limit=10
)

性能基准测试

测量并比较搜索延迟

def benchmark_search_methods(query_text, iterations=10):
    """Benchmark different search approaches"""
    methods = {
        "dense": lambda: client.query_points(
            collection_name=collection_name,
            query=encoder.encode(query_text).tolist(),
            using="dense", limit=10
        ),
        "sparse": lambda: client.query_points(
            collection_name=collection_name, 
            query=create_sparse_vector(query_text),
            using="sparse", limit=10
        ),
        "hybrid": lambda: hybrid_search_with_rrf(query_text)
    }
    
    for method_name, method_func in methods.items():
        times = []
        for _ in range(iterations):
            start = time.time()
            method_func()
            times.append((time.time() - start) * 1000)
        
        avg_time = sum(times) / len(times)
        print(f"{method_name}: {avg_time:.2f}ms average")

自定义稀疏编码

实现基于你的数据集训练的 TFIDF 稀疏编码

from sklearn.feature_extraction.text import TfidfVectorizer

def create_tfidf_sparse_vector(text, vectorizer):
    """Create sparse vector using TF-IDF"""
    tfidf_matrix = vectorizer.transform([text])
    coo_matrix = tfidf_matrix.tocoo()
    
    return models.SparseVector(
        indices=coo_matrix.col.tolist(),
        values=coo_matrix.data.tolist()
    )