Chonkie
Chonkie 是一个简洁、超轻量且速度极快的分块库,专为 RAG(检索增强生成)应用程序设计。
Chonkie 通过 QdrantHandshake 类与 Qdrant 无缝集成,让您无需离开 Chonkie SDK 即可对文本数据进行分块、嵌入和存储。
设置
安装支持 Qdrant 的 Chonkie
pip install "chonkie[qdrant]"
基本用法
QdrantHandshake 提供了一个简单的接口来存储和搜索分块
from chonkie import QdrantHandshake, SemanticChunker
# Initialize handshake with custom embedding model
handshake = QdrantHandshake(
url="https://:6333",
collection_name="my_documents",
embedding_model="sentence-transformers/all-MiniLM-L6-v2"
)
# Create and write chunks
chunker = SemanticChunker()
chunks = chunker.chunk("Your text content here...")
handshake.write(chunks)
# Search using natural language
results = handshake.search(query="your search query", limit=5)
for result in results:
print(f"{result['score']}: {result['text']}")
Qdrant 云
handshake = QdrantHandshake(
url="https://your-cluster.qdrant.io",
api_key="your-api-key",
collection_name="my_collection",
embedding_model="BAAI/bge-small-en-v1.5" # Change to your preferred model
)
完整的 RAG 管道
使用 Chonkie 流畅的 Pipeline API 构建端到端 RAG 管道
from chonkie import Pipeline
# Process documents and store in Qdrant with custom embedding model
docs = (Pipeline()
.fetch_from("file", dir="./knowledge_base", ext=[".txt", ".md"])
.process_with("text")
.chunk_with("semantic", chunk_size=512)
.store_in("qdrant",
collection_name="knowledge",
url="https://:6333",
embedding_model="sentence-transformers/all-MiniLM-L6-v2")
.run())
print(f"Ingested {len(docs)} documents into Qdrant")
带改进的管道
from chonkie import Pipeline
# Advanced pipeline with overlapping context and custom embeddings
docs = (Pipeline()
.fetch_from("file", dir="./docs")
.process_with("text")
.chunk_with("semantic", threshold=0.8)
.refine_with("overlap", context_size=100)
.store_in("qdrant",
url="https://your-cluster.qdrant.io",
api_key="your-api-key",
collection_name="knowledge_base",
embedding_model="BAAI/bge-small-en-v1.5")
.run())