私有 RAG 信息提取引擎
时长:90 分钟 | 难度:高级 |
---|
处理私人文档在许多行业中是一项常见任务。各种企业拥有大量以大文件形式存储的非结构化数据,这些数据必须经过处理和分析。行业报告、财务分析、法律文件以及许多其他文档都以 PDF、Word 和其他格式存储。构建在 RAG 管道之上的对话式聊天机器人是在此类文档中查找相关答案的可行解决方案之一。然而,如果我们要从这些文档中提取结构化信息,并将其传递给下游系统,我们需要采用不同的方法。
信息提取是将非结构化数据构造成机器易于处理的格式的过程。在本教程中,我们将向您展示如何使用 DSPy 对一组文档执行此过程。假设我们无法将数据发送到外部服务,我们将使用 Ollama 在本地运行我们自己的 LLM 模型,并使用 Vultr 作为云提供商。在此设置中,Qdrant 作为知识库,为给定查询提供相关的文档片段,它也将部署在 Vultr 的混合云模式下。最后缺少的组成部分——DSPy 应用程序——也将在此同一环境中运行。如果您在受监管的行业工作,或者只是需要保持数据私密,本教程适合您。
在 Vultr 上部署 Qdrant Hybrid Cloud
本教程中将使用的所有服务都将在 Vultr Kubernetes Engine 上运行。这为我们在扩展和管理资源方面提供了很大的灵活性。Vultr 管理控制平面和工作节点,并提供与其他托管服务的集成,例如负载均衡器、块存储和 DNS。
- 要在 Vultr 上开始使用托管 Kubernetes,请遵循平台特定的文档。
- Kubernetes 集群启动后,您可以开始部署 Qdrant Hybrid Cloud。
安装必要的包
我们将需要几个 Python 包来运行我们的应用程序。它们可以与 dspy-ai
包和 qdrant
extra 一起安装
pip install dspy-ai dspy-qdrant
Qdrant Hybrid Cloud
我们的文档包含了在 Vultr 的 Hybrid Cloud 模式下设置 Qdrant 的详细指南。请仔细按照说明进行操作,以使您的 Qdrant 实例正常运行。完成后,我们需要将 Qdrant URL 和 API 密钥存储在环境变量中。您可以通过运行以下命令来完成
export QDRANT_URL="https://qdrant.example.com"
export QDRANT_API_KEY="your-api-key"
import os
os.environ["QDRANT_URL"] = "https://qdrant.example.com"
os.environ["QDRANT_API_KEY"] = "your-api-key"
DSPy 是我们将要使用的框架。它已经与 Qdrant 集成,但它假定您使用 FastEmbed 来创建嵌入。DSPy 不提供数据索引的方法,而是将此任务留给用户。我们将自行创建一个集合,并用我们的文档块的嵌入来填充它。
数据索引
FastEmbed 使用 BAAI/bge-small-en
作为默认的嵌入模型。我们也将使用它。如果我们在现有的 QdrantClient
实例上调用 .add
方法,我们的集合将自动创建。在本教程中,我们将不会过多关注文档解析,因为有很多工具可以帮助完成此任务。unstructured
库是您可以在您的基础设施上部署的一个选项。在我们的简化示例中,我们将使用一个字符串列表作为我们的文档。这些是虚构技术事件的描述。每个描述应包含事件名称以及地点、开始日期和结束日期。
documents = [
"Taking place in San Francisco, USA, from the 10th to the 12th of June, 2024, the Global Developers Conference is the annual gathering spot for developers worldwide, offering insights into software engineering, web development, and mobile applications.",
"The AI Innovations Summit, scheduled for 15-17 September 2024 in London, UK, aims at professionals and researchers advancing artificial intelligence and machine learning.",
"Berlin, Germany will host the CyberSecurity World Conference between November 5th and 7th, 2024, serving as a key forum for cybersecurity professionals to exchange strategies and research on threat detection and mitigation.",
"Data Science Connect in New York City, USA, occurring from August 22nd to 24th, 2024, connects data scientists, analysts, and engineers to discuss data science's innovative methodologies, tools, and applications.",
"Set for July 14-16, 2024, in Tokyo, Japan, the Frontend Developers Fest invites developers to delve into the future of UI/UX design, web performance, and modern JavaScript frameworks.",
"The Blockchain Expo Global, happening May 20-22, 2024, in Dubai, UAE, focuses on blockchain technology's applications, opportunities, and challenges for entrepreneurs, developers, and investors.",
"Singapore's Cloud Computing Summit, scheduled for October 3-5, 2024, is where IT professionals and cloud experts will convene to discuss strategies, architectures, and cloud solutions.",
"The IoT World Forum, taking place in Barcelona, Spain from December 1st to 3rd, 2024, is the premier conference for those focused on the Internet of Things, from smart cities to IoT security.",
"Los Angeles, USA, will become the hub for game developers, designers, and enthusiasts at the Game Developers Arcade, running from April 18th to 20th, 2024, to showcase new games and discuss development tools.",
"The TechWomen Summit in Sydney, Australia, from March 8-10, 2024, aims to empower women in tech with workshops, keynotes, and networking opportunities.",
"Seoul, South Korea's Mobile Tech Conference, happening from September 29th to October 1st, 2024, will explore the future of mobile technology, including 5G networks and app development trends.",
"The Open Source Summit, to be held in Helsinki, Finland from August 11th to 13th, 2024, celebrates open source technologies and communities, offering insights into the latest software and collaboration techniques.",
"Vancouver, Canada will play host to the VR/AR Innovation Conference from June 20th to 22nd, 2024, focusing on the latest in virtual and augmented reality technologies.",
"Scheduled for May 5-7, 2024, in London, UK, the Fintech Leaders Forum brings together experts to discuss the future of finance, including innovations in blockchain, digital currencies, and payment technologies.",
"The Digital Marketing Summit, set for April 25-27, 2024, in New York City, USA, is designed for marketing professionals and strategists to discuss digital marketing and social media trends.",
"EcoTech Symposium in Paris, France, unfolds over 2024-10-09 to 2024-10-11, spotlighting sustainable technologies and green innovations for environmental scientists, tech entrepreneurs, and policy makers.",
"Set in Tokyo, Japan, from 16th to 18th May '24, the Robotic Innovations Conference showcases automation, robotics, and AI-driven solutions, appealing to enthusiasts and engineers.",
"The Software Architecture World Forum in Dublin, Ireland, occurring 22-24 Sept 2024, gathers software architects and IT managers to discuss modern architecture patterns.",
"Quantum Computing Summit, convening in Silicon Valley, USA from 2024/11/12 to 2024/11/14, is a rendezvous for exploring quantum computing advancements with physicists and technologists.",
"From March 3 to 5, 2024, the Global EdTech Conference in London, UK, discusses the intersection of education and technology, featuring e-learning and digital classrooms.",
"Bangalore, India's NextGen DevOps Days, from 28 to 30 August 2024, is a hotspot for IT professionals keen on the latest DevOps tools and innovations.",
"The UX/UI Design Conference, slated for April 21-23, 2024, in New York City, USA, invites discussions on the latest in user experience and interface design among designers and developers.",
"Big Data Analytics Summit, taking place 2024 July 10-12 in Amsterdam, Netherlands, brings together data professionals to delve into big data analysis and insights.",
"Toronto, Canada, will see the HealthTech Innovation Forum from June 8 to 10, '24, focusing on technology's impact on healthcare with professionals and innovators.",
"Blockchain for Business Summit, happening in Singapore from 2024-05-02 to 2024-05-04, focuses on blockchain's business applications, from finance to supply chain.",
"Las Vegas, USA hosts the Global Gaming Expo from October 18th to 20th, 2024, a premiere event for game developers, publishers, and enthusiasts.",
"The Renewable Energy Tech Conference in Copenhagen, Denmark, from 2024/09/05 to 2024/09/07, discusses renewable energy innovations and policies.",
"Set for 2024 Apr 9-11 in Boston, USA, the Artificial Intelligence in Healthcare Summit gathers healthcare professionals to discuss AI's healthcare applications.",
"Nordic Software Engineers Conference, happening in Stockholm, Sweden from June 15 to 17, 2024, focuses on software development in the Nordic region.",
"The International Space Exploration Symposium, scheduled in Houston, USA from 2024-08-05 to 2024-08-07, invites discussions on space exploration technologies and missions."
]
我们将能够提出一般性问题,例如,关于我们感兴趣的主题或特定地点发生的事件,但期望结果以结构化格式返回。
在 Qdrant 中,如果我们定义了文档,索引操作只需要一个调用
client.add(
collection_name="document-parts",
documents=documents,
metadata=[{"document": document} for document in documents],
)
我们的集合已准备好进行查询。现在我们可以进入下一步,即设置 Ollama 模型。
在 Vultr 上部署 Ollama
Ollama 是一个很棒的工具,用于在您自己的基础设施上运行 LLM 模型。它被设计得轻量且易于使用,并且提供官方 Docker 镜像。我们可以使用它在我们的 Vultr Kubernetes 集群上运行 Ollama。对于 LLM,我们可能有一些特殊需求,例如 GPU,Vultr 提供 Vultr Kubernetes Engine for Cloud GPU,这样模型就可以在专用机器上运行。请参考官方文档,以便在您的环境中启动并运行 Ollama。完成后,我们需要将 Ollama URL 存储在环境变量中
export OLLAMA_URL="https://ollama.example.com"
os.environ["OLLAMA_URL"] = "https://ollama.example.com"
稍后在我们的应用程序中配置 Ollama 模型时,我们将引用此 URL。
设置大型语言模型
我们将使用 Ollama 中提供的一个轻量级 LLM,即 gemma:2b
模型。它由 Google DeepMind 团队开发,拥有 30 亿参数。Ollama 版本使用 4 位量化。在运行 Ollama 的机器上运行以下命令即可轻松安装该模型
ollama run gemma:2b
Ollama 模型也与 DSPy 集成,因此我们可以在应用程序中直接使用它们。
实现信息提取管道
DSPy 与其他 LLM 框架略有不同。它旨在优化管道中 LM 的提示和权重。它有点像 LM 的编译器:您用高级语言编写管道,然后 DSPy 会为您生成提示和权重。这意味着您可以构建复杂的系统,而不必担心如何为 LM 生成提示的细节,因为 DSPy 会为您完成此工作。它在某种程度上类似于 PyTorch,但应用于 LLM。
首先,我们将定义要使用的语言模型
import dspy
gemma_model = dspy.OllamaLocal(
model="gemma:2b",
base_url=os.environ.get("OLLAMA_URL"),
max_tokens=500,
)
同样,我们必须定义与我们的 Qdrant Hybrid Cloud 集群的连接
from dspy_qdrant import QdrantRM
from qdrant_client import QdrantClient, models
client = QdrantClient(
os.environ.get("QDRANT_URL"),
api_key=os.environ.get("QDRANT_API_KEY"),
)
qdrant_retriever = QdrantRM(
qdrant_collection_name="document-parts",
qdrant_client=client,
)
最后,只需调用其中一个函数,即可在 DSPy 中配置这两个组件
dspy.configure(lm=gemma_model, rm=qdrant_retriever)
应用程序逻辑
存在一个“签名”的概念,它定义了管道的输入和输出格式。我们将为事件定义一个简单的签名
class Event(dspy.Signature):
description = dspy.InputField(
desc="Textual description of the event, including name, location and dates"
)
event_name = dspy.OutputField(desc="Name of the event")
location = dspy.OutputField(desc="Location of the event")
start_date = dspy.OutputField(desc="Start date of the event, YYYY-MM-DD")
end_date = dspy.OutputField(desc="End date of the event, YYYY-MM-DD")
它旨在从事件的文本描述中提取结构化信息。现在,我们可以构建将使用它的模块,以及 Qdrant 和 Ollama 模型。我们称之为 EventExtractor
class EventExtractor(dspy.Module):
def __init__(self):
super().__init__()
# Retrieve module to get relevant documents
self.retriever = dspy.Retrieve(k=3)
# Predict module for the created signature
self.predict = dspy.Predict(Event)
def forward(self, query: str):
# Retrieve the most relevant documents
results = self.retriever.forward(query)
# Try to extract events from the retrieved documents
events = []
for document in results.passages:
event = self.predict(description=document)
events.append(event)
return events
逻辑很简单:我们从 Qdrant 中检索最相关的文档,然后使用 Event
签名尝试从中提取结构化信息。我们可以直接调用它并查看结果
extractor = EventExtractor()
extractor.forward("Blockchain events close to Europe")
输出
[
Prediction(
event_name='Event Name: Blockchain Expo Global',
location='Dubai, UAE',
start_date='2024-05-20',
end_date='2024-05-22'
),
Prediction(
event_name='Event Name: Blockchain for Business Summit',
location='Singapore',
start_date='2024-05-02',
end_date='2024-05-04'
),
Prediction(
event_name='Event Name: Open Source Summit',
location='Helsinki, Finland',
start_date='2024-08-11',
end_date='2024-08-13'
)
]
即使没有进行任何优化,任务也成功解决了。然而,每个事件都有一个我们可能想要去除的“Event Name: ”前缀。DSPy 允许优化模块,从而我们可以改进结果。优化可以通过不同的方式完成,并且在DSPy 文档中有详细介绍。
在本教程中,我们将不深入讲解优化过程。然而,我们鼓励您进行实验,因为它可能会显著提高管道的性能。
创建的模块可以轻松地存储在特定路径下,并在以后加载
extractor.save("event_extractor")
要加载,只需创建模块实例并调用 load
方法
second_extractor = EventExtractor()
second_extractor.load("event_extractor")
当您优化模块时,这尤其有用,因为优化后的版本可以存储并在以后加载,而无需在每次运行应用程序时重复优化过程。
部署提取管道
Vultr 在应用程序部署方面提供了很大的灵活性。理想情况下,我们会使用之前设置的 Kubernetes 集群来运行它。部署就像运行任何其他 Python 应用程序一样简单。这次我们不需要 GPU,因为 Ollama 已经在独立的机器上运行,而 DSPy 只需与其交互即可。
总结
在本教程中,我们展示了如何使用 DSPy、Ollama 和 Qdrant 设置私有的信息提取环境。所有组件都可以安全地托管在 Vultr 云上,让您完全控制您的数据。