如何使用 SPLADE 生成稀疏向量
SPLADE 是一种用于学习稀疏文本表示向量的新颖方法,在信息检索和文档分类等任务中优于 BM25。其主要优势在于能够生成高效且可解释的稀疏向量,使其适用于处理大规模文本数据。
设置
首先,安装 FastEmbed。
pip install -q fastembed
接下来,导入稀疏嵌入所需的模块和 Python 的 typing 模块。
from fastembed import SparseTextEmbedding, SparseEmbedding
您可以随时查看所有受支持的稀疏嵌入模型列表。
SparseTextEmbedding.list_supported_models()
这将返回一个模型列表,每个模型都包含其详细信息,例如模型名称、词汇大小、描述和来源。
[
{
'model': 'prithivida/Splade_PP_en_v1',
'sources': {'hf': 'Qdrant/Splade_PP_en_v1', ...},
'model_file': 'model.onnx',
'description': 'Independent Implementation of SPLADE++ Model for English.',
'license': 'apache-2.0',
'size_in_GB': 0.532,
'vocab_size': 30522,
...
},
...
] # part of the output was omitted
现在,加载模型。
model_name = "prithivida/Splade_PP_en_v1"
# This triggers the model download
model = SparseTextEmbedding(model_name=model_name)
嵌入数据
您需要定义一个待嵌入的文档列表。
documents: list[str] = [
"Chandrayaan-3 is India's third lunar mission",
"It aimed to land a rover on the Moon's surface - joining the US, China and Russia",
"The mission is a follow-up to Chandrayaan-2, which had partial success",
"Chandrayaan-3 will be launched by the Indian Space Research Organisation (ISRO)",
"The estimated cost of the mission is around $35 million",
"It will carry instruments to study the lunar surface and atmosphere",
"Chandrayaan-3 landed on the Moon's surface on 23rd August 2023",
"It consists of a lander named Vikram and a rover named Pragyan similar to Chandrayaan-2. Its propulsion module would act like an orbiter.",
"The propulsion module carries the lander and rover configuration until the spacecraft is in a 100-kilometre (62 mi) lunar orbit",
"The mission used GSLV Mk III rocket for its launch",
"Chandrayaan-3 was launched from the Satish Dhawan Space Centre in Sriharikota",
"Chandrayaan-3 was launched earlier in the year 2023",
]
然后,为每个文档生成稀疏嵌入。这里的batch_size
是可选的,有助于批量处理文档。
sparse_embeddings_list: list[SparseEmbedding] = list(
model.embed(documents, batch_size=6)
)
检索嵌入
sparse_embeddings_list
包含之前提供的文档的稀疏嵌入。此列表中的每个元素都是一个 SparseEmbedding
对象,其中包含文档的稀疏向量表示。
index = 0
sparse_embeddings_list[index]
此输出是列表中第一个文档的 SparseEmbedding
对象。它包含两个数组:values
和 indices
。 - values
数组表示文档中特征(标记)的权重。 - indices
数组表示这些特征在模型词汇表中的索引。
每个对应的 values
和 indices
对表示文档中的一个标记及其权重。
SparseEmbedding(values=array([0.05297208, 0.01963477, 0.36459631, 1.38508618, 0.71776593,
0.12667948, 0.46230844, 0.446771 , 0.26897505, 1.01519883,
1.5655334 , 0.29412213, 1.53102326, 0.59785569, 1.1001817 ,
0.02079751, 0.09955651, 0.44249091, 0.09747757, 1.53519952,
1.36765671, 0.15740395, 0.49882549, 0.38629025, 0.76612782,
1.25805044, 0.39058095, 0.27236196, 0.45152301, 0.48262018,
0.26085234, 1.35912788, 0.70710695, 1.71639752]), indices=array([ 1010, 1011, 1016, 1017, 2001, 2018, 2034, 2093, 2117,
2319, 2353, 2509, 2634, 2686, 2796, 2817, 2922, 2959,
3003, 3148, 3260, 3390, 3462, 3523, 3822, 4231, 4316,
4774, 5590, 5871, 6416, 11926, 12076, 16469]))
检查权重
现在,打印前 5 个特征及其权重以便更好地理解。
for i in range(5):
print(f"Token at index {sparse_embeddings_list[0].indices[i]} has weight {sparse_embeddings_list[0].values[i]}")
输出将显示第一个文档的标记索引及其对应的权重。
Token at index 1010 has weight 0.05297207832336426
Token at index 1011 has weight 0.01963476650416851
Token at index 1016 has weight 0.36459630727767944
Token at index 1017 has weight 1.385086178779602
Token at index 2001 has weight 0.7177659273147583
分析结果
让我们使用分词器的词汇表来理解这些索引。
import json
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("Qdrant/Splade_PP_en_v1")
get_tokens_and_weights
函数将 SparseEmbedding
对象和 tokenizer
作为输入。它将构建一个字典,其中键是解码后的标记,值是它们对应的权重。
def get_tokens_and_weights(sparse_embedding, tokenizer):
token_weight_dict = {}
for i in range(len(sparse_embedding.indices)):
token = tokenizer.decode([sparse_embedding.indices[i]])
weight = sparse_embedding.values[i]
token_weight_dict[token] = weight
# Sort the dictionary by weights
token_weight_dict = dict(sorted(token_weight_dict.items(), key=lambda item: item[1], reverse=True))
return token_weight_dict
# Test the function with the first SparseEmbedding
print(json.dumps(get_tokens_and_weights(sparse_embeddings_list[index], tokenizer), indent=4))
字典输出
然后按权重降序对字典进行排序。
{
"chandra": 1.7163975238800049,
"third": 1.5655333995819092,
"##ya": 1.535199522972107,
"india": 1.5310232639312744,
"3": 1.385086178779602,
"mission": 1.3676567077636719,
"lunar": 1.3591278791427612,
"moon": 1.2580504417419434,
"indian": 1.1001816987991333,
"##an": 1.015198826789856,
"3rd": 0.7661278247833252,
"was": 0.7177659273147583,
"spacecraft": 0.7071069478988647,
"space": 0.5978556871414185,
"flight": 0.4988254904747009,
"satellite": 0.4826201796531677,
"first": 0.46230843663215637,
"expedition": 0.4515230059623718,
"three": 0.4467709958553314,
"fourth": 0.44249090552330017,
"vehicle": 0.390580952167511,
"iii": 0.3862902522087097,
"2": 0.36459630727767944,
"##3": 0.2941221296787262,
"planet": 0.27236196398735046,
"second": 0.26897504925727844,
"missions": 0.2608523368835449,
"launched": 0.15740394592285156,
"had": 0.12667948007583618,
"largest": 0.09955651313066483,
"leader": 0.09747757017612457,
",": 0.05297207832336426,
"study": 0.02079751156270504,
"-": 0.01963476650416851
}
观察
- 重要性的相对顺序非常有用。句子中最重要的标记具有最高的权重。
- 术语扩展:模型可以扩展文档中的术语。这意味着模型可以为文档中不存在但与文档中的标记相关的标记生成权重。这是一个强大的功能,允许模型捕获文档的上下文。在这里,您将看到模型已将来自“third”的标记“3”和来自“lunar”的标记“moon”添加到稀疏向量中。
设计选择
- 权重未标准化。这意味着权重的总和不是 1 或 100。这在稀疏嵌入中是一种常见做法,因为它允许模型捕获文档中每个标记的重要性。
- 只有当标记存在于模型的词汇表中时,它们才会被包含在稀疏向量中。这意味着模型不会为其在训练期间未见过的标记生成权重。
- 标记不直接映射到单词——这允许您优雅地处理拼写错误和词汇表外的标记。