使用LLM自动化过滤

我们的向量搜索过滤完整指南描述了过滤的重要性以及如何使用 Qdrant 实现过滤。然而，当您使用传统界面构建应用程序时，应用过滤器会更容易。您的用户界面可能包含一个带有复选框、滑块和其他元素的形式，用户可以使用它们来设置其条件。但是，如果您想构建一个仅具有对话界面，甚至语音命令的 RAG 应用程序呢？在这种情况下，您需要自动化过滤过程！

大型语言模型（LLM）似乎特别擅长这项任务。它们可以理解自然语言并基于此生成结构化输出。在本教程中，我们将向您展示如何使用 LLM 自动化向量搜索应用程序中的过滤。

关于 Qdrant 过滤器的一些注意事项

Qdrant Python SDK 使用 Pydantic 定义模型。该库实际上是 Python 中数据验证和序列化的标准。它允许您使用 Python 类型提示定义数据的结构。例如，我们的 Filter 模型定义如下：

class Filter(BaseModel, extra="forbid"):
    should: Optional[Union[List["Condition"], "Condition"]] = Field(
        default=None, description="At least one of those conditions should match"
    )
    min_should: Optional["MinShould"] = Field(
        default=None, description="At least minimum amount of given conditions should match"
    )
    must: Optional[Union[List["Condition"], "Condition"]] = Field(default=None, description="All conditions must match")
    must_not: Optional[Union[List["Condition"], "Condition"]] = Field(
        default=None, description="All conditions must NOT match"
    )

Qdrant 过滤器可以嵌套，您可以使用 must、should 和 must_not 表示法表达最复杂的条件。

来自 LLM 的结构化输出

使用 LLM 生成结构化输出并不少见。如果其输出旨在供不同应用程序进一步处理，则这种做法尤其有用。例如，您可以使用 LLM 生成 SQL 查询、JSON 对象，以及最重要的是，Qdrant 过滤器。Pydantic 在 LLM 生态系统中得到了很好的采用，因此有许多库使用 Pydantic 模型来定义语言模型的输出结构。

该领域一个有趣的项目是 Instructor，它允许您使用不同的 LLM 提供商并将其输出限制为特定结构。让我们安装该库并选择本教程中将使用的提供商

pip install "instructor[anthropic]"

Anthropic 并不是唯一的选择，因为 Instructor 支持许多其他提供商，包括 OpenAI、Ollama、Llama、Gemini、Vertex AI、Groq、Litellm 等。您可以选择最适合您需求或您已在 RAG 中使用的提供商。

使用 Instructor 生成 Qdrant 过滤器

Instructor 有一些辅助方法来装饰 LLM API，这样您就可以像使用其常规 SDK 一样与它们交互。对于 Anthropic，您只需将 Anthropic 类的实例传递给 from_anthropic 函数

import instructor
from anthropic import Anthropic

anthropic_client = instructor.from_anthropic(
    client=Anthropic(
        api_key="YOUR_API_KEY",
    )
)

一个经过装饰的客户端会稍微修改原始 API，因此您可以将 response_model 参数传递给 .messages.create 方法。此参数应该是一个定义输出结构的 Pydantic 模型。对于 Qdrant 过滤器，它应该是一个 Filter 模型

from qdrant_client import models

qdrant_filter = anthropic_client.messages.create(
    model="claude-3-5-sonnet-latest",
    response_model=models.Filter,
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": "red T-shirt"
        }
    ],
)

此代码的输出将是一个代表 Qdrant 过滤器的 Pydantic 模型。令人惊讶的是，无需传递额外指令即可发现用户希望按产品的颜色和类型进行过滤。输出如下所示：

Filter(
    should=None, 
    min_should=None, 
    must=[
        FieldCondition(
            key="color", 
            match=MatchValue(value="red"), 
            range=None, 
            geo_bounding_box=None, 
            geo_radius=None, 
            geo_polygon=None, 
            values_count=None
        ), 
        FieldCondition(
            key="type", 
            match=MatchValue(value="t-shirt"), 
            range=None, 
            geo_bounding_box=None, 
            geo_radius=None, 
            geo_polygon=None, 
            values_count=None
        )
    ], 
    must_not=None
)

显然，给予模型完全的自由来生成过滤器可能会导致意想不到的结果，甚至没有任何结果。您的集合可能具有特定结构的有效负载，因此使用其他任何东西都没有意义。此外，按已索引的字段进行过滤被认为是一种良好实践。这就是为什么自动确定已索引的字段并将其输出限制为这些字段有意义的原因。

限制可用字段

Qdrant 集合信息包含在特定集合上创建的索引列表。您可以使用此信息自动确定可用于过滤的字段。以下是您可以执行此操作的方法：

from qdrant_client import QdrantClient

client = QdrantClient("https://:6333")
collection_info = client.get_collection(collection_name="test_filter")
indexes = collection_info.payload_schema
print(indexes)

输出

{
    "city.location": PayloadIndexInfo(
        data_type=PayloadSchemaType.GEO,
        ...
    ),
    "city.name": PayloadIndexInfo(
        data_type=PayloadSchemaType.KEYWORD,
        ...
    ),
    "color": PayloadIndexInfo(
        data_type=PayloadSchemaType.KEYWORD,
        ...
    ),
    "fabric": PayloadIndexInfo(
        data_type=PayloadSchemaType.KEYWORD,
        ...
    ),
    "price": PayloadIndexInfo(
        data_type=PayloadSchemaType.FLOAT,
        ...
    ),
}

我们的 LLM 应该知道它可以使用的字段名称及其类型，例如，范围过滤仅对数字字段有意义，对非地理字段进行地理过滤不会产生任何有意义的结果。您可以将此信息作为提示的一部分传递给 LLM，因此让我们将其编码为字符串

formatted_indexes = "\n".join([
    f"- {index_name} - {index.data_type.name}"
    for index_name, index in indexes.items()
])
print(formatted_indexes)

输出

- fabric - KEYWORD
- city.name - KEYWORD
- color - KEYWORD
- price - FLOAT
- city.location - GEO

缓存可用字段及其类型列表是一个好主意，因为它们不应该经常更改。现在我们与 LLM 的交互应该略有不同

qdrant_filter = anthropic_client.messages.create(
    model="claude-3-5-sonnet-latest",
    response_model=models.Filter,
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": (
                "<query>color is red</query>"
                f"<indexes>\n{formatted_indexes}\n</indexes>"
            )
        }
    ],
)

输出

Filter(
    should=None, 
    min_should=None, 
    must=FieldCondition(
        key="color", 
        match=MatchValue(value="red"), 
        range=None, 
        geo_bounding_box=None, 
        geo_radius=None, 
        geo_polygon=None, 
        values_count=None
    ), 
    must_not=None
)

相同的查询，限制在可用字段内，现在生成了更好的条件，因为它不会尝试按集合中不存在的字段进行过滤。

测试 LLM 输出

尽管 LLM 功能强大，但它们并非完美无缺。如果您计划自动化过滤，进行一些测试以查看它们的表现如何是很有意义的。特别是边缘情况，例如无法表示为过滤器的查询。让我们看看 LLM 将如何处理以下查询

qdrant_filter = anthropic_client.messages.create(
    model="claude-3-5-sonnet-latest",
    response_model=models.Filter,
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": (
                "<query>fruit salad with no more than 100 calories</query>"
                f"<indexes>\n{formatted_indexes}\n</indexes>"
            )
        }
    ],
)

输出

Filter(
    should=None, 
    min_should=None, 
    must=FieldCondition(
        key="price", 
        match=None, 
        range=Range(lt=None, gt=None, gte=None, lte=100.0), 
        geo_bounding_box=None, 
        geo_radius=None, 
        geo_polygon=None, 
        values_count=None
    ), 
    must_not=None
)

令人惊讶的是，LLM 从查询中提取了卡路里信息，并根据价格字段生成了一个过滤器。它以某种方式从查询中提取任何数字信息并尝试将其与可用字段匹配。

通常，为模型提供更多关于如何解释查询的指导可能会带来更好的结果。添加一个定义查询解释规则的系统提示可能有助于模型做得更好。以下是您可以执行此操作的方法

SYSTEM_PROMPT = """
You are extracting filters from a text query. Please follow the following rules:
1. Query is provided in the form of a text enclosed in <query> tags.
2. Available indexes are put at the end of the text in the form of a list enclosed in <indexes> tags.
3. You cannot use any field that is not available in the indexes.
4. Generate a filter only if you are certain that user's intent matches the field name.
5. Prices are always in USD.
6. It's better not to generate a filter than to generate an incorrect one.
"""

qdrant_filter = anthropic_client.messages.create(
    model="claude-3-5-sonnet-latest",
    response_model=models.Filter,
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": SYSTEM_PROMPT.strip(),
        },
        {
            "role": "assistant",
            "content": "Okay, I will follow all the rules."
        },
        {
            "role": "user",
            "content": (
                "<query>fruit salad with no more than 100 calories</query>"
                f"<indexes>\n{formatted_indexes}\n</indexes>"
            )
        }
    ],
)

当前输出

Filter(
    should=None, 
    min_should=None, 
    must=None, 
    must_not=None
)

处理复杂查询

我们在集合上创建了许多索引，看到 LLM 将如何处理更复杂的查询是相当有趣的。例如，让我们看看它将如何处理以下查询

qdrant_filter = anthropic_client.messages.create(
    model="claude-3-5-sonnet-latest",
    response_model=models.Filter,
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": SYSTEM_PROMPT.strip(),
        },
        {
            "role": "assistant",
            "content": "Okay, I will follow all the rules."
        },
        {
            "role": "user",
            "content": (
                "<query>"
                "white T-shirt available no more than 30 miles from London, "
                "but not in the city itself, below $15.70, not made from polyester"
                "</query>\n"
                "<indexes>\n"
                f"{formatted_indexes}\n"
                "</indexes>"
            )
        },
    ],
)

这可能令人惊讶，但 Anthropic Claude 甚至能够生成如此复杂的过滤器。输出如下：

Filter(
    should=None, 
    min_should=None, 
    must=[
        FieldCondition(
            key="color", 
            match=MatchValue(value="white"), 
            range=None, 
            geo_bounding_box=None, 
            geo_radius=None, 
            geo_polygon=None, 
            values_count=None
        ), 
        FieldCondition(
            key="city.location", 
            match=None, 
            range=None, 
            geo_bounding_box=None, 
            geo_radius=GeoRadius(
                center=GeoPoint(lon=-0.1276, lat=51.5074), 
                radius=48280.0
            ), 
            geo_polygon=None, 
            values_count=None
        ), 
        FieldCondition(
            key="price", 
            match=None, 
            range=Range(lt=15.7, gt=None, gte=None, lte=None), 
            geo_bounding_box=None,
            geo_radius=None, 
            geo_polygon=None, 
            values_count=None
        )
    ], must_not=[
        FieldCondition(
            key="city.name", 
            match=MatchValue(value="London"), 
            range=None, 
            geo_bounding_box=None, 
            geo_radius=None, 
            geo_polygon=None, 
            values_count=None
        ), 
        FieldCondition(
            key="fabric", 
            match=MatchValue(value="polyester"),
            range=None, 
            geo_bounding_box=None, 
            geo_radius=None,
            geo_polygon=None, 
            values_count=None
        )
    ]
)

该模型甚至知道伦敦的坐标并使用它们生成地理过滤器。依靠模型生成如此复杂的过滤器并不是最好的主意，但它能够做到这一点令人印象深刻。

后续步骤

实际生产系统将需要对 LLM 输出进行更多测试和验证。构建一个包含查询和预期过滤器的真实数据集将是一个好主意。您可以使用此数据集来评估模型性能并查看它在不同场景下的行为。