Unstructured
Unstructured 是一个旨在帮助预处理、构建非结构化文本文档以用于下游机器学习任务的库。
Qdrant 可以用作 Unstructured 中的摄取目的地。
设置
安装带有 qdrant 额外功能的 Unstructured。
pip install "unstructured-ingest[qdrant]"
用法
根据用例,您可以选择命令行或在应用程序中使用它。
命令行界面 (CLI)
unstructured-ingest \
local \
--input-path $LOCAL_FILE_INPUT_DIR \
--chunking-strategy by_title \
--embedding-provider huggingface \
--partition-by-api \
--api-key $UNSTRUCTURED_API_KEY \
--partition-endpoint $UNSTRUCTURED_API_URL \
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
qdrant-cloud \
--url $QDRANT_URL \
--api-key $QDRANT_API_KEY \
--collection-name $QDRANT_COLLECTION \
--batch-size 50 \
--num-processes 1
要获取 CLI 接受的完整选项列表,请运行 unstructured-ingest <upstream connector> qdrant --help
编程用法
import os
from unstructured_ingest.pipeline.pipeline import Pipeline
from unstructured_ingest.interfaces import ProcessorConfig
from unstructured_ingest.processes.connectors.local import (
LocalIndexerConfig,
LocalDownloaderConfig,
LocalConnectionConfig
)
from unstructured_ingest.processes.partitioner import PartitionerConfig
from unstructured_ingest.processes.chunker import ChunkerConfig
from unstructured_ingest.processes.embedder import EmbedderConfig
from unstructured_ingest.processes.connectors.qdrant.cloud import (
CloudQdrantConnectionConfig,
CloudQdrantAccessConfig,
CloudQdrantUploadStagerConfig,
CloudQdrantUploaderConfig
)
if __name__ == "__main__":
Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
additional_partition_args={
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15
}
),
chunker_config=ChunkerConfig(chunking_strategy="by_title"),
embedder_config=EmbedderConfig(embedding_provider="huggingface"),
destination_connection_config=CloudQdrantConnectionConfig(
access_config=CloudQdrantAccessConfig(
api_key=os.getenv("QDRANT_API_KEY")
),
url=os.getenv("QDRANT_URL")
),
stager_config=CloudQdrantUploadStagerConfig(),
uploader_config=CloudQdrantUploaderConfig(
collection_name=os.getenv("QDRANT_COLLECTION"),
batch_size=50,
num_processes=1
)
).run()