Jina Embeddings

Qdrant 兼容 Jina AI embedding。您可以从 Jina Embeddings 获取免费试用密钥来获取 embedding。

Qdrant 用户使用代码 QDRANT 可享受 Jina AI API 10% 的折扣。

技术摘要

模型维度语言MRL (套娃表示学习)上下文
jina-clip-v21024多语言(100+ 种,重点支持 30 种)文本/图像
jina-embeddings-v31024多语言(89 种语言)8192
jina-embeddings-v2-base-en768英语8192
jina-embeddings-v2-base-de768德语和英语8192
jina-embeddings-v2-base-es768西班牙语和英语8192
jina-embeddings-v2-base-zh768中文和英语8192

Jina 建议对纯文本任务使用 jina-embeddings-v3,对多模态任务或需要增强视觉检索时使用 jina-clip-v2

在基础模型之上,jina-embeddings-v3 已经使用 5 个针对不同 embedding 用途的任务特定适配器进行了训练。在您的请求中包含 task 可以优化您的下游应用。

  • retrieval.query: 用于在检索任务中编码用户查询或问题。
  • retrieval.passage: 用于在索引时编码检索任务中的大型文档。
  • classification: 用于编码文本以进行文本分类任务。
  • text-matching: 用于编码文本以进行相似度匹配,例如衡量两个句子之间的相似度。
  • separation: 用于聚类或重排序任务。

jina-embeddings-v3jina-clip-v2 支持 套娃表示学习 (Matryoshka Representation Learning),这使得用户可以在性能损失最小的情况下控制 embedding 维度。
在您的请求中包含 dimensions 以选择所需的维度。
默认情况下,dimensions 设置为 1024,建议使用 256 到 1024 之间的数字。
您可以参考下表了解维度与性能的权衡:

维度32641282565127681024
平均检索性能 (nDCG@10)52.5458.5461.6462.7263.1663.363.35

jina-embeddings-v3 支持 Late Chunking(后期分块),这是一种利用模型长上下文能力生成上下文相关块 embedding 的技术。在您的请求中包含 late_chunking=True 以启用上下文分块表示。当设置为 true 时,Jina AI API 将连接输入字段中的所有句子,并将它们作为单个字符串馈送给模型。在内部,模型会 embedding 这个长连接字符串,然后执行后期分块,返回一个与输入列表大小匹配的 embedding 列表。

示例

Jina Embeddings v3

下面的代码演示了如何将 jina-embeddings-v3 与 Qdrant 一起使用。

import requests

import qdrant_client
from qdrant_client.models import Distance, VectorParams, Batch

# Provide Jina API key and choose one of the available models.
JINA_API_KEY = "jina_xxxxxxxxxxx"
MODEL = "jina-embeddings-v3"
DIMENSIONS = 1024 # Or choose your desired output vector dimensionality.
TASK = 'retrieval.passage' # For indexing, or set to retrieval.query for querying

# Get embeddings from the API
url = "https://api.jina.ai/v1/embeddings"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {JINA_API_KEY}",
}

data = {
    "input": ["Your text string goes here", "You can send multiple texts"],
    "model": MODEL,
    "dimensions": DIMENSIONS,
    "task": TASK,
    "late_chunking": True,
}

response = requests.post(url, headers=headers, json=data)
embeddings = [d["embedding"] for d in response.json()["data"]]


# Index the embeddings into Qdrant
client = qdrant_client.QdrantClient(":memory:")
client.create_collection(
    collection_name="MyCollection",
    vectors_config=VectorParams(size= DIMENSIONS, distance=Distance.DOT),
)


qdrant_client.upsert(
    collection_name="MyCollection",
    points=Batch(
        ids=list(range(len(embeddings))),
        vectors=embeddings,
    ),
)

Jina CLIP v2

下面的代码演示了如何将 jina-clip-v2 与 Qdrant 一起使用。

import requests
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# Provide your Jina API key and choose the model.
JINA_API_KEY = "jina_xxxxxxxxxxx"
MODEL = "jina-clip-v2"
DIMENSIONS = 1024  # Set the desired output vector dimensionality.

# Define the inputs
text_input = "A blue cat"
image_url = "https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg"

# Get embeddings from the Jina API
url = "https://api.jina.ai/v1/embeddings"
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {JINA_API_KEY}",
}
data = {
    "input": [
        {"text": text_input},
        {"image": image_url},
    ],
    "model": MODEL,
    "dimensions": DIMENSIONS,
}

response = requests.post(url, headers=headers, json=data)
response_data = response.json()["data"]

# The model doesn't differentiate between images and text, so we extract output based on the input order.
text_embedding = response_data[0]["embedding"]
image_embedding = response_data[1]["embedding"]

# Initialize Qdrant client
client = QdrantClient(url="http://localhost:6333/")

# Create a collection with named vectors
collection_name = "MyCollection"
client.recreate_collection(
    collection_name=collection_name,
    vectors_config={
        "text_vector": VectorParams(size=DIMENSIONS, distance=Distance.DOT),
        "image_vector": VectorParams(size=DIMENSIONS, distance=Distance.DOT),
    },
)

client.upsert(
    collection_name=collection_name,
    points=[
        PointStruct(
            id=0,
            vector={
                "text_vector": text_embedding,
                "image_vector": image_embedding,
            }
        )
    ],
)

# Now let's query the collection
search_query = "A purple cat"

# Get the embedding for the search query from the Jina API
url = "https://api.jina.ai/v1/embeddings"
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {JINA_API_KEY}",
}
data = {
    "input": [{"text": search_query}],
    "model": MODEL,
    "dimensions": DIMENSIONS,
    # "task": "retrieval.query" # Uncomment this line for text-to-text retrieval tasks
}

response = requests.post(url, headers=headers, json=data)
query_embedding = response.json()["data"][0]["embedding"]

search_results = client.query_points(
    collection_name=collection_name,
    query=query_embedding,
    using="image_vector",
    limit=5
).points

for result in search_results:
    print(f"ID: {result.id}, Score: {result.score}")
本页面是否有帮助?

感谢您的反馈!🙏

听到这个消息我们很抱歉。😔 您可以在 GitHub 上编辑本页面,或者创建一个 GitHub Issue。