Calendar 第一天

项目:构建语义搜索引擎

既然你已经了解了语义搜索在电影中的工作原理,现在是时候构建你自己的了。选择一个你感兴趣的领域,创建一个不仅能理解关键词,还能理解含义的搜索引擎。

你的任务

为你选择的主题构建一个语义搜索引擎。你将发现分块策略如何影响你特定领域的搜索质量。

预计时间:120分钟

您将构建什么

一个能演示以下功能的语义搜索引擎:

  • 领域专业知识:选择你理解的内容,以便你能评估搜索质量。
  • 分块比较:测试不同的策略,看看哪种最适合你的内容类型。
  • 真正的语义理解:通过概念、主题或含义进行搜索,而不仅仅是精确的关键词。
  • 实用见解:发现是什么让分块在你特定领域中有效。

设置

先决条件

  • Qdrant 云集群(URL + API 密钥)
  • Python 3.9+(或 Colab)
  • 包:qdrant-client, sentence-transformers, google.colab(如果使用 Colab)

模型

  • SentenceTransformer:all-MiniLM-L6-v2 (384-dim)
    (你可以在“可选:进一步探索”中尝试其他模型。)

数据集

选择具有丰富描述性文本且语义搜索能增值的内容

  • 书籍/文学:搜索图书摘要、评论或节选的集合。按主题、情绪或文学风格查找书籍。示例查询:“有不可靠叙述者的成长故事”,“具有环境主题的反乌托邦小说”
  • 食谱/烹饪:索引食谱描述和说明。按烹饪技术、风味特征或饮食需求进行搜索。示例查询:“寒冷天气的舒适食物”,“具有亚洲风味的快速工作日餐”
  • 新闻/文章:收集你感兴趣领域的文章。按主题、视角或新闻报道方法进行搜索。示例查询:“远程工作趋势分析”,“城市规划中的气候变化解决方案”
  • 研究论文:你领域的学术摘要或论文。按方法论、发现或理论方法进行搜索。示例查询:“机器学习在医疗保健中的应用”,“用户行为的定性研究”
  • 产品评论:你熟悉产品的客户评论。按用户情绪、使用案例或产品特性进行搜索。示例查询:“预算内适合视频编辑的笔记本电脑”,“敏感肌肤冬季护肤方案”

构建步骤

步骤 1:初始化客户端

标准初始化(本地)

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient, models
import os

client = QdrantClient(url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY"))

# For Colab:
# from google.colab import userdata
# client = QdrantClient(url=userdata.get("QDRANT_URL"), api_key=userdata.get("QDRANT_API_KEY"))

encoder = SentenceTransformer("all-MiniLM-L6-v2")

第2步:准备你的数据集

创建8-15个具有丰富描述的项目集合

# Example: Recipe collection
my_dataset = [
    {
        "title": "Classic Beef Bourguignon",
        "description": """A rich, wine-braised beef stew from Burgundy, France. 
        Tender chunks of beef are slowly simmered with pearl onions, mushrooms, 
        and bacon in a deep red wine sauce. The long, slow cooking process 
        develops complex flavors and creates a luxurious, velvety texture. 
        Perfect for cold winter evenings when you want something hearty and 
        comforting. Traditionally served with crusty bread or creamy mashed 
        potatoes to soak up the incredible sauce.""",
        "cuisine": "French",
        "difficulty": "Intermediate",
        "time": "3 hours"
    },
    # Add 7-14 more items with similarly rich descriptions
]

第3步:实现三种分块策略

def fixed_size_chunks(text, chunk_size=100, overlap=20):
    """Split text into fixed-size chunks with overlap"""
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk_words = words[i:i + chunk_size]
        if chunk_words:  # Only add non-empty chunks
            chunks.append(' '.join(chunk_words))
    
    return chunks

def sentence_chunks(text, max_sentences=3):
    """Group sentences into chunks"""
    import re
    sentences = re.split(r'[.!?]+', text)
    sentences = [s.strip() for s in sentences if s.strip()]
    
    chunks = []
    for i in range(0, len(sentences), max_sentences):
        chunk_sentences = sentences[i:i + max_sentences]
        if chunk_sentences:
            chunks.append('. '.join(chunk_sentences) + '.')
    
    return chunks

def paragraph_chunks(text):
    """Split by paragraphs or double line breaks"""
    chunks = [chunk.strip() for chunk in text.split('\n\n') if chunk.strip()]
    return chunks if chunks else [text]  # Fallback to full text

第4步:创建集合并处理数据

注意:如果你已经熟悉Qdrant的可过滤HNSW,你会知道有效的过滤和分组通常依赖于在构建HNSW索引之前创建payload索引。为了在本教程中保持简单,我们将进行带有过滤器的基本搜索,不使用payload索引,并在本课程的第二天讨论payload索引的正确用法。

collection_name = "day1_semantic_search"

if client.collection_exists(collection_name=collection_name):
    client.delete_collection(collection_name=collection_name)

# Create a collection with three named vectors
client.create_collection(
    collection_name=collection_name,
    vectors_config={
        "fixed": models.VectorParams(size=384, distance=models.Distance.COSINE),
        "sentence": models.VectorParams(size=384, distance=models.Distance.COSINE),
        "paragraph": models.VectorParams(size=384, distance=models.Distance.COSINE),
    },
)

# Index fields for filtering (more on this on day 2)
client.create_payload_index(
    collection_name=collection_name,
    field_name="chunk_strategy",
    field_schema=models.PayloadSchemaType.KEYWORD,
)

# Process and upload data
points = []
point_id = 0

for item in my_dataset:
    description = item["description"]

    # Process with each chunking strategy
    strategies = {
        "fixed": fixed_size_chunks(description),
        "sentence": sentence_chunks(description),
        "paragraph": paragraph_chunks(description),
    }

    for strategy_name, chunks in strategies.items():
        for chunk_idx, chunk in enumerate(chunks):
            # Create vectors for this chunk
            vectors = {strategy_name: encoder.encode(chunk).tolist()}

            points.append(
                models.PointStruct(
                    id=point_id,
                    vector=vectors,
                    payload={
                        **item,  # Include all original metadata
                        "chunk": chunk,
                        "chunk_strategy": strategy_name,
                        "chunk_index": chunk_idx,
                    },
                )
            )
            point_id += 1

client.upload_points(collection_name=collection_name, points=points)
print(f"Uploaded {len(points)} chunks across three strategies")

第5步:测试和比较

def compare_search_results(query):
    """Compare search results across all chunking strategies"""
    print(f"Query: '{query}'\n")

    for strategy in ["fixed", "sentence", "paragraph"]:
        results = client.query_points(
            collection_name=collection_name,
            query=encoder.encode(query).tolist(),
            using=strategy,
            limit=3,
        )

        print(f"--- {strategy.upper()} CHUNKING ---")
        for i, point in enumerate(results.points, 1):
            print(f"{i}. {point.payload['title']}")
            print(f"   Score: {point.score:.3f}")
            print(f"   Chunk: {point.payload['chunk'][:80]}...")
        print()


# Test with domain-specific queries
test_queries = [
    "comfort food for winter",  # Adapt these to your domain
    "quick and easy weeknight dinner",
    "elegant dish for special occasions",
]

for query in test_queries:
    compare_search_results(query)

第6步:分析你的结果

运行测试后,分析你所发现的

def analyze_chunking_effectiveness():
    """Analyze which chunking strategy works best for your domain"""

    print("CHUNKING STRATEGY ANALYSIS")
    print("=" * 40)

    # Get chunk statistics for each strategy
    for strategy in ["fixed", "sentence", "paragraph"]:
        # Count chunks per strategy
        results = client.scroll(
            collection_name=collection_name,
            scroll_filter=models.Filter(
                must=[
                    models.FieldCondition(
                        key="chunk_strategy", match=models.MatchValue(value=strategy)
                    )
                ]
            ),
            limit=100,
        )

        chunks = results[0]
        chunk_sizes = [len(chunk.payload["chunk"]) for chunk in chunks]

        print(f"\n{strategy.upper()} STRATEGY:")
        print(f"  Total chunks: {len(chunks)}")
        print(f"  Avg chunk size: {sum(chunk_sizes)/len(chunk_sizes):.0f} chars")
        print(f"  Size range: {min(chunk_sizes)}-{max(chunk_sizes)} chars")


analyze_chunking_effectiveness()

成功标准

你将在以下情况下知道你已成功:

你的搜索引擎通过含义而非仅仅关键词找到相关结果
你能清楚地解释哪种分块策略最适合你的领域
你发现了关于分块如何影响搜索的一些令人惊讶的事情
你能阐明不同方法之间的权衡

分享你的发现

现在是时候分析你的结果并分享你所学到的了。按照这些步骤记录你的发现并准备分享。

步骤 1:反思你的发现

  • 领域和数据集:你选择了什么内容以及为什么;数据集大小/复杂性。
  • 分块比较:你对固定、句子、段落分块的观察结果。
  • 胜者:哪个效果最好以及为什么(一个清晰的原因)。
  • 示例查询:一个胜者击败其他策略的查询示例。

步骤 2:发布你的结果

请使用以下简短模板在 Post your results in Discord 中发布你的结果——复制、填写并发送:

**[Day 1] Building a Semantic Search Engine**

**High-Level Summary**
- **Domain:** "I built a semantic search for [recipes/books/articles/etc.]"
- **Winner:** "Best chunking strategy was [fixed/sentence/paragraph] because [one reason]"

**Project-Specific Details**
- **Collection:** day1_semantic_search (Cosine) with vectors: fixed/sentence/paragraph
- **Dataset:** [N items] (snapshot: YYYY-MM-DD)
- **Chunks:** fixed=[count]/[avg chars], sentence=[count]/[avg chars], paragraph=[count]/[avg chars]
- **Demo query:** "Try '[your example query]'" — it found [what was interesting]

**Surprise**
- "[Most unexpected finding was …]"

**Next step**
- "[What you’ll try tomorrow]"

可选:更进一步

添加元数据过滤

像我们在电影演示中看到的那样,使用过滤器增强你的搜索

# Example: Find Italian recipes that are quick to make
# Tip: You have to recreate the collection and apply create_payload_index for the 
# new filters before uploading the data again to be able to filter on the new fields.

results = client.query_points(
    collection_name=collection_name,
    query=encoder.encode("comfort food").tolist(),
    using="sentence",
    query_filter=models.Filter(
        must=[
            models.FieldCondition(key="cuisine", match=models.MatchValue(value="Italian")),
            models.FieldCondition(key="time", match=models.MatchValue(value="30 minutes"))
        ]
    ),
    limit=3
)

尝试不同的嵌入模型

尝试其他模型,看看它们如何影响结果

# Compare with a different model
encoder_large = SentenceTransformer("all-mpnet-base-v2")  # Larger, potentially better
encoder_fast = SentenceTransformer("all-MiniLM-L12-v2")   # Different size/speed tradeoff

准备好迎接第二天了吗?明天你将学习Qdrant如何通过HNSW索引使向量搜索快如闪电,以及如何优化生产工作负载。