项目:构建语义搜索引擎
既然你已经了解了语义搜索在电影中的工作原理,现在是时候构建你自己的了。选择一个你感兴趣的领域,创建一个不仅能理解关键词,还能理解含义的搜索引擎。
你的任务
为你选择的主题构建一个语义搜索引擎。你将发现分块策略如何影响你特定领域的搜索质量。
预计时间:120分钟
您将构建什么
一个能演示以下功能的语义搜索引擎:
- 领域专业知识:选择你理解的内容,以便你能评估搜索质量。
- 分块比较:测试不同的策略,看看哪种最适合你的内容类型。
- 真正的语义理解:通过概念、主题或含义进行搜索,而不仅仅是精确的关键词。
- 实用见解:发现是什么让分块在你特定领域中有效。
设置
先决条件
- Qdrant 云集群(URL + API 密钥)
- Python 3.9+(或 Colab)
- 包:
qdrant-client,sentence-transformers,google.colab(如果使用 Colab)
模型
- SentenceTransformer:
all-MiniLM-L6-v2(384-dim)
(你可以在“可选:进一步探索”中尝试其他模型。)
数据集
选择具有丰富描述性文本且语义搜索能增值的内容
- 书籍/文学:搜索图书摘要、评论或节选的集合。按主题、情绪或文学风格查找书籍。示例查询:“有不可靠叙述者的成长故事”,“具有环境主题的反乌托邦小说”
- 食谱/烹饪:索引食谱描述和说明。按烹饪技术、风味特征或饮食需求进行搜索。示例查询:“寒冷天气的舒适食物”,“具有亚洲风味的快速工作日餐”
- 新闻/文章:收集你感兴趣领域的文章。按主题、视角或新闻报道方法进行搜索。示例查询:“远程工作趋势分析”,“城市规划中的气候变化解决方案”
- 研究论文:你领域的学术摘要或论文。按方法论、发现或理论方法进行搜索。示例查询:“机器学习在医疗保健中的应用”,“用户行为的定性研究”
- 产品评论:你熟悉产品的客户评论。按用户情绪、使用案例或产品特性进行搜索。示例查询:“预算内适合视频编辑的笔记本电脑”,“敏感肌肤冬季护肤方案”
构建步骤
步骤 1:初始化客户端
标准初始化(本地)
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient, models
import os
client = QdrantClient(url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY"))
# For Colab:
# from google.colab import userdata
# client = QdrantClient(url=userdata.get("QDRANT_URL"), api_key=userdata.get("QDRANT_API_KEY"))
encoder = SentenceTransformer("all-MiniLM-L6-v2")
第2步:准备你的数据集
创建8-15个具有丰富描述的项目集合
# Example: Recipe collection
my_dataset = [
{
"title": "Classic Beef Bourguignon",
"description": """A rich, wine-braised beef stew from Burgundy, France.
Tender chunks of beef are slowly simmered with pearl onions, mushrooms,
and bacon in a deep red wine sauce. The long, slow cooking process
develops complex flavors and creates a luxurious, velvety texture.
Perfect for cold winter evenings when you want something hearty and
comforting. Traditionally served with crusty bread or creamy mashed
potatoes to soak up the incredible sauce.""",
"cuisine": "French",
"difficulty": "Intermediate",
"time": "3 hours"
},
# Add 7-14 more items with similarly rich descriptions
]
第3步:实现三种分块策略
def fixed_size_chunks(text, chunk_size=100, overlap=20):
"""Split text into fixed-size chunks with overlap"""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk_words = words[i:i + chunk_size]
if chunk_words: # Only add non-empty chunks
chunks.append(' '.join(chunk_words))
return chunks
def sentence_chunks(text, max_sentences=3):
"""Group sentences into chunks"""
import re
sentences = re.split(r'[.!?]+', text)
sentences = [s.strip() for s in sentences if s.strip()]
chunks = []
for i in range(0, len(sentences), max_sentences):
chunk_sentences = sentences[i:i + max_sentences]
if chunk_sentences:
chunks.append('. '.join(chunk_sentences) + '.')
return chunks
def paragraph_chunks(text):
"""Split by paragraphs or double line breaks"""
chunks = [chunk.strip() for chunk in text.split('\n\n') if chunk.strip()]
return chunks if chunks else [text] # Fallback to full text
第4步:创建集合并处理数据
注意:如果你已经熟悉Qdrant的可过滤HNSW,你会知道有效的过滤和分组通常依赖于在构建HNSW索引之前创建payload索引。为了在本教程中保持简单,我们将进行带有过滤器的基本搜索,不使用payload索引,并在本课程的第二天讨论payload索引的正确用法。
collection_name = "day1_semantic_search"
if client.collection_exists(collection_name=collection_name):
client.delete_collection(collection_name=collection_name)
# Create a collection with three named vectors
client.create_collection(
collection_name=collection_name,
vectors_config={
"fixed": models.VectorParams(size=384, distance=models.Distance.COSINE),
"sentence": models.VectorParams(size=384, distance=models.Distance.COSINE),
"paragraph": models.VectorParams(size=384, distance=models.Distance.COSINE),
},
)
# Index fields for filtering (more on this on day 2)
client.create_payload_index(
collection_name=collection_name,
field_name="chunk_strategy",
field_schema=models.PayloadSchemaType.KEYWORD,
)
# Process and upload data
points = []
point_id = 0
for item in my_dataset:
description = item["description"]
# Process with each chunking strategy
strategies = {
"fixed": fixed_size_chunks(description),
"sentence": sentence_chunks(description),
"paragraph": paragraph_chunks(description),
}
for strategy_name, chunks in strategies.items():
for chunk_idx, chunk in enumerate(chunks):
# Create vectors for this chunk
vectors = {strategy_name: encoder.encode(chunk).tolist()}
points.append(
models.PointStruct(
id=point_id,
vector=vectors,
payload={
**item, # Include all original metadata
"chunk": chunk,
"chunk_strategy": strategy_name,
"chunk_index": chunk_idx,
},
)
)
point_id += 1
client.upload_points(collection_name=collection_name, points=points)
print(f"Uploaded {len(points)} chunks across three strategies")
第5步:测试和比较
def compare_search_results(query):
"""Compare search results across all chunking strategies"""
print(f"Query: '{query}'\n")
for strategy in ["fixed", "sentence", "paragraph"]:
results = client.query_points(
collection_name=collection_name,
query=encoder.encode(query).tolist(),
using=strategy,
limit=3,
)
print(f"--- {strategy.upper()} CHUNKING ---")
for i, point in enumerate(results.points, 1):
print(f"{i}. {point.payload['title']}")
print(f" Score: {point.score:.3f}")
print(f" Chunk: {point.payload['chunk'][:80]}...")
print()
# Test with domain-specific queries
test_queries = [
"comfort food for winter", # Adapt these to your domain
"quick and easy weeknight dinner",
"elegant dish for special occasions",
]
for query in test_queries:
compare_search_results(query)
第6步:分析你的结果
运行测试后,分析你所发现的
def analyze_chunking_effectiveness():
"""Analyze which chunking strategy works best for your domain"""
print("CHUNKING STRATEGY ANALYSIS")
print("=" * 40)
# Get chunk statistics for each strategy
for strategy in ["fixed", "sentence", "paragraph"]:
# Count chunks per strategy
results = client.scroll(
collection_name=collection_name,
scroll_filter=models.Filter(
must=[
models.FieldCondition(
key="chunk_strategy", match=models.MatchValue(value=strategy)
)
]
),
limit=100,
)
chunks = results[0]
chunk_sizes = [len(chunk.payload["chunk"]) for chunk in chunks]
print(f"\n{strategy.upper()} STRATEGY:")
print(f" Total chunks: {len(chunks)}")
print(f" Avg chunk size: {sum(chunk_sizes)/len(chunk_sizes):.0f} chars")
print(f" Size range: {min(chunk_sizes)}-{max(chunk_sizes)} chars")
analyze_chunking_effectiveness()
成功标准
你将在以下情况下知道你已成功:
你的搜索引擎通过含义而非仅仅关键词找到相关结果
你能清楚地解释哪种分块策略最适合你的领域
你发现了关于分块如何影响搜索的一些令人惊讶的事情
你能阐明不同方法之间的权衡
分享你的发现
现在是时候分析你的结果并分享你所学到的了。按照这些步骤记录你的发现并准备分享。
步骤 1:反思你的发现
- 领域和数据集:你选择了什么内容以及为什么;数据集大小/复杂性。
- 分块比较:你对固定、句子、段落分块的观察结果。
- 胜者:哪个效果最好以及为什么(一个清晰的原因)。
- 示例查询:一个胜者击败其他策略的查询示例。
步骤 2:发布你的结果
**[Day 1] Building a Semantic Search Engine**
**High-Level Summary**
- **Domain:** "I built a semantic search for [recipes/books/articles/etc.]"
- **Winner:** "Best chunking strategy was [fixed/sentence/paragraph] because [one reason]"
**Project-Specific Details**
- **Collection:** day1_semantic_search (Cosine) with vectors: fixed/sentence/paragraph
- **Dataset:** [N items] (snapshot: YYYY-MM-DD)
- **Chunks:** fixed=[count]/[avg chars], sentence=[count]/[avg chars], paragraph=[count]/[avg chars]
- **Demo query:** "Try '[your example query]'" — it found [what was interesting]
**Surprise**
- "[Most unexpected finding was …]"
**Next step**
- "[What you’ll try tomorrow]"
可选:更进一步
添加元数据过滤
像我们在电影演示中看到的那样,使用过滤器增强你的搜索
# Example: Find Italian recipes that are quick to make
# Tip: You have to recreate the collection and apply create_payload_index for the
# new filters before uploading the data again to be able to filter on the new fields.
results = client.query_points(
collection_name=collection_name,
query=encoder.encode("comfort food").tolist(),
using="sentence",
query_filter=models.Filter(
must=[
models.FieldCondition(key="cuisine", match=models.MatchValue(value="Italian")),
models.FieldCondition(key="time", match=models.MatchValue(value="30 minutes"))
]
),
limit=3
)
尝试不同的嵌入模型
尝试其他模型,看看它们如何影响结果
# Compare with a different model
encoder_large = SentenceTransformer("all-mpnet-base-v2") # Larger, potentially better
encoder_fast = SentenceTransformer("all-MiniLM-L12-v2") # Different size/speed tradeoff
准备好迎接第二天了吗?明天你将学习Qdrant如何通过HNSW索引使向量搜索快如闪电,以及如何优化生产工作负载。