Unstructured
Unstructured 是一个旨在帮助预处理和结构化非结构化文本文档以便用于下游机器学习任务的库。
Qdrant 可以用作 Unstructured 中的摄取目标。
设置
安装 Unstructured 时带上 qdrant
附加组件。
pip install "unstructured[qdrant]"
用法
根据用例的不同,您可以选择命令行或在您的应用程序中使用它。
CLI
EMBEDDING_PROVIDER=${EMBEDDING_PROVIDER:-"langchain-huggingface"}
unstructured-ingest \
local \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-output-to-qdrant \
--strategy fast \
--chunk-elements \
--embedding-provider "$EMBEDDING_PROVIDER" \
--num-processes 2 \
--verbose \
qdrant \
--collection-name "test" \
--url "http://localhost:6333" \
--batch-size 80
要查看 CLI 接受的全部选项列表,请运行 unstructured-ingest <上游连接器> qdrant --help
编程用法
from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.connector.qdrant import (
QdrantWriteConfig,
SimpleQdrantConfig,
)
from unstructured.ingest.interfaces import (
ChunkingConfig,
EmbeddingConfig,
PartitionConfig,
ProcessorConfig,
ReadConfig,
)
from unstructured.ingest.runner import LocalRunner
from unstructured.ingest.runner.writers.base_writer import Writer
from unstructured.ingest.runner.writers.qdrant import QdrantWriter
def get_writer() -> Writer:
return QdrantWriter(
connector_config=SimpleQdrantConfig(
url="http://localhost:6333",
collection_name="test",
),
write_config=QdrantWriteConfig(batch_size=80),
)
if __name__ == "__main__":
writer = get_writer()
runner = LocalRunner(
processor_config=ProcessorConfig(
verbose=True,
output_dir="local-output-to-qdrant",
num_processes=2,
),
connector_config=SimpleLocalConfig(
input_path="example-docs/book-war-and-peace-1225p.txt",
),
read_config=ReadConfig(),
partition_config=PartitionConfig(),
chunking_config=ChunkingConfig(chunk_elements=True),
embedding_config=EmbeddingConfig(provider="langchain-huggingface"),
writer=writer,
writer_kwargs={},
)
runner.run()