在 5 分钟内构建你的第一个语义搜索引擎

时间:5 - 15 分钟级别:初学者

概述

如果你是向量数据库的新手,本教程适合你。在 5 分钟内,你将为一个科幻小说构建一个语义搜索引擎。设置好后,你将向引擎询问即将到来的外星人威胁。你创建的引擎将推荐书籍,作为应对潜在太空攻击的准备。

在开始之前,你需要安装 最新版本的 Python。如果你不知道如何在虚拟环境中运行此代码,请先参考 Python 文档中的 创建虚拟环境 部分。

本教程假定你正在使用 bash shell。请参考 Python 文档来激活虚拟环境,使用如下命令:

source tutorial-env/bin/activate

1. 安装

你需要处理你的数据,以便搜索引擎能够使用它。Sentence Transformers 框架提供了访问常用大型语言模型的能力,这些模型可以将原始数据转换为嵌入(embeddings)。

pip install -U sentence-transformers

一旦编码完成,这些数据需要存储在某个地方。Qdrant 允许你将数据存储为嵌入。你还可以使用 Qdrant 对这些数据运行搜索查询。这意味着你可以让引擎为你提供远超关键词匹配的相关答案。

pip install -U qdrant-client

导入模型

定义好两个主要框架后,你需要指定本引擎将使用的具体模型。

from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

Sentence Transformers 框架包含许多嵌入模型。对于本教程,我们将使用 all-MiniLM-L6-v2,因为它在速度和嵌入质量之间取得了很好的平衡。

encoder = SentenceTransformer("all-MiniLM-L6-v2")

2. 添加数据集

all-MiniLM-L6-v2 将对你提供的数据进行编码。在这里,你将列出图书馆中的所有科幻小说。每本书都有元数据,包括名称、作者、出版年份和简短描述。

documents = [
    {
        "name": "The Time Machine",
        "description": "A man travels through time and witnesses the evolution of humanity.",
        "author": "H.G. Wells",
        "year": 1895,
    },
    {
        "name": "Ender's Game",
        "description": "A young boy is trained to become a military leader in a war against an alien race.",
        "author": "Orson Scott Card",
        "year": 1985,
    },
    {
        "name": "Brave New World",
        "description": "A dystopian society where people are genetically engineered and conditioned to conform to a strict social hierarchy.",
        "author": "Aldous Huxley",
        "year": 1932,
    },
    {
        "name": "The Hitchhiker's Guide to the Galaxy",
        "description": "A comedic science fiction series following the misadventures of an unwitting human and his alien friend.",
        "author": "Douglas Adams",
        "year": 1979,
    },
    {
        "name": "Dune",
        "description": "A desert planet is the site of political intrigue and power struggles.",
        "author": "Frank Herbert",
        "year": 1965,
    },
    {
        "name": "Foundation",
        "description": "A mathematician develops a science to predict the future of humanity and works to save civilization from collapse.",
        "author": "Isaac Asimov",
        "year": 1951,
    },
    {
        "name": "Snow Crash",
        "description": "A futuristic world where the internet has evolved into a virtual reality metaverse.",
        "author": "Neal Stephenson",
        "year": 1992,
    },
    {
        "name": "Neuromancer",
        "description": "A hacker is hired to pull off a near-impossible hack and gets pulled into a web of intrigue.",
        "author": "William Gibson",
        "year": 1984,
    },
    {
        "name": "The War of the Worlds",
        "description": "A Martian invasion of Earth throws humanity into chaos.",
        "author": "H.G. Wells",
        "year": 1898,
    },
    {
        "name": "The Hunger Games",
        "description": "A dystopian society where teenagers are forced to fight to the death in a televised spectacle.",
        "author": "Suzanne Collins",
        "year": 2008,
    },
    {
        "name": "The Andromeda Strain",
        "description": "A deadly virus from outer space threatens to wipe out humanity.",
        "author": "Michael Crichton",
        "year": 1969,
    },
    {
        "name": "The Left Hand of Darkness",
        "description": "A human ambassador is sent to a planet where the inhabitants are genderless and can change gender at will.",
        "author": "Ursula K. Le Guin",
        "year": 1969,
    },
    {
        "name": "The Three-Body Problem",
        "description": "Humans encounter an alien civilization that lives in a dying system.",
        "author": "Liu Cixin",
        "year": 2008,
    },
]

3. 定义存储位置

你需要告诉 Qdrant 将嵌入存储在哪里。这是一个基本演示,因此你的本地计算机将使用其内存作为临时存储。

client = QdrantClient(":memory:")

4. 创建集合

Qdrant 中的所有数据都按集合(collections)组织。在本例中,你存储的是书籍,所以我们将其命名为 my_books

client.create_collection(
    collection_name="my_books",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),  # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
)
  • 参数 vector_size 定义了特定集合中向量的大小。如果它们的大小不同,则无法计算它们之间的距离。384 是编码器的输出维度。你也可以使用 model.get_sentence_embedding_dimension() 来获取你正在使用的模型的维度。

  • 参数 distance 允许你指定用于衡量两个点之间距离的函数。

5. 将数据上传到集合

告诉数据库将 documents 上传到 my_books 集合。这将为每条记录分配一个 id 和一个 payload。payload 就是数据集中的元数据。

client.upload_points(
    collection_name="my_books",
    points=[
        models.PointStruct(
            id=idx, vector=encoder.encode(doc["description"]).tolist(), payload=doc
        )
        for idx, doc in enumerate(documents)
    ],
)

6. 向引擎提问

现在数据已存储在 Qdrant 中,你可以向它提问并接收语义相关的结果。

hits = client.query_points(
    collection_name="my_books",
    query=encoder.encode("alien invasion").tolist(),
    limit=3,
).points

for hit in hits:
    print(hit.payload, "score:", hit.score)

响应

搜索引擎显示了与外星人入侵最相关的三个可能的响应。每个响应都分配了一个分数,以显示该响应与原始查询的接近程度。

{'name': 'The War of the Worlds', 'description': 'A Martian invasion of Earth throws humanity into chaos.', 'author': 'H.G. Wells', 'year': 1898} score: 0.570093257022374
{'name': "The Hitchhiker's Guide to the Galaxy", 'description': 'A comedic science fiction series following the misadventures of an unwitting human and his alien friend.', 'author': 'Douglas Adams', 'year': 1979} score: 0.5040468703143637
{'name': 'The Three-Body Problem', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'author': 'Liu Cixin', 'year': 2008} score: 0.45902943411768216

缩小查询范围

2000 年代初的最新书籍如何?

hits = client.query_points(
    collection_name="my_books",
    query=encoder.encode("alien invasion").tolist(),
    query_filter=models.Filter(
        must=[models.FieldCondition(key="year", range=models.Range(gte=2000))]
    ),
    limit=1,
).points

for hit in hits:
    print(hit.payload, "score:", hit.score)

响应

查询范围已缩小到 2008 年的一个结果。

{'name': 'The Three-Body Problem', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'author': 'Liu Cixin', 'year': 2008} score: 0.45902943411768216

下一步

恭喜,你刚刚创建了你的第一个搜索引擎!相信我们,Qdrant 的其他部分也没有那么复杂。对于你的下一个教程,你应该尝试构建一个包含完整 API 和数据集的真实神经搜索服务

本页面是否有用?

感谢你的反馈! 🙏

很抱歉听到这个。😔 你可以在 GitHub 上编辑此页面,或创建一个 GitHub issue。