如何部署自然语言处理 (NLP):文本嵌入和矢量搜索
作为我们自然语言处理 (NLP) 博文系列的一部分,这篇博文将举例说明如何使用文本嵌入模型来生成文本内容的矢量表示,并演示如何对生成的矢量进行矢量相似度搜索。我们将会在 Elasticsearch 上部署一个面向所有人群开放的模型,然后在采集管道中使用它来从文本文档生成嵌入。接下来,我们会展示如何在矢量相似度搜索中使用这些嵌入来查找对于给定查询而言语义相似的文档。
矢量相似度搜索(通常也称为语义搜索)超越了传统的基于关键字的搜索,让用户可以找到语义相似但可能没有任何共同关键字的文档,从而提供更广泛的结果。矢量相似度搜索作用于密集矢量,并使用 k-最近邻搜索来查找相似矢量。为此,首先需要使用文本嵌入模型将文本形式的内容转换为其数字矢量表示。
我们将会使用 MS MARCO Passage Ranking Task 提供的公共数据集进行演示。这个数据集包含了来自 Microsoft Bing 搜索引擎的真实问题和人工生成的答案,是测试矢量相似度搜索的绝佳资源;首先,因为回答问题是矢量搜索最常见的用例之一;其次,MS MARCO 排行榜中排名靠前的论文都以某种形式使用了矢量搜索。
在我们的示例中,我们会利用这个数据集中的一个样本,使用模型来生成文本嵌入,然后对其运行矢量搜索。此外,我们还希望对矢量搜索所产生结果的质量进行快速验证。
1. 部署文本嵌入模型。
第一步是安装文本嵌入模型。在我们的模型中,使用的是 Hugging Face 中的 msmarco-MiniLM-L-12-v3。这是一个句子转换程序模型,它会取一个句子或一个段落,并将其映射到一个 384 维的密集矢量。这个模型针对语义搜索进行了优化,并专门针对 MS MARCO Passage 数据集进行了训练,从而让它适合执行我们的任务。除了这个模型,Elasticsearch 还支持许多其他的文本嵌入模型。支持的完整列表可在此处查看。
我们使用在 NER 示例中构建的 Eland docker 代理来安装模型。运行下面的脚本,将我们的模型导入到本地集群中并进行部署:
eland_import_hub_model \
--url https://<user>:<password>@localhost:9200/ \
--hub-model-id sentence-transformers/msmarco-MiniLM-L-12-v3 \
--task-type text_embedding \
--start
这一次,将 --task-type 设为 text_embedding,并将 --start 选项传递给 Eland 脚本,这样模型就会自动部署,而无需在 Model Management UI 中启动它。要想加快推理速度,您可以使用 inference_threads 参数增加推理线程数。
我们可以在 Kibana 控制台中使用这个示例来测试模型是否成功部署:
POST /_ml/trained_models/sentence-transformers__msmarco-minilm-l-12-v3/deployment/_infer
{
"docs": {
"text_field": "how is the weather in jamaica"
}
}
我们应该可以在结果中看到如下图预测的密集矢量:
{
"predicted_value" : [
0.3345310091972351,
-0.305600643157959,
0.2592800557613373,
…
]
}
2.加载初始数据
正如简介中提到的,我们会使用 MS MARCO Passage Ranking 数据集。这个数据集非常大,包含 800 多万个段落。在我们的示例中,我们使用了在 2019 TREC Deep Learning Track 的测试阶段使用的一个子集。用于重新排序任务的数据集 msmarco-passagetest2019-top1000.tsv 包含 200 个查询,对于每个查询,都有一个由简单的 IR 系统提取的相关文本段落列表。从这个数据集中,我们提取了所有带有 ID 的唯一段落,并将它们放入一个单独的 tsv 文件中,总共 182,469 个段落。我们将这个文件用作我们的数据集。
我们使用 Kibana 的文件上传功能来上传这个数据集。通过 Kibana 文件上传,我们可以为字段提供定制名称。例如,我们将它们称为 ID 类型为 long 的段落 ID,以及 text 类型为 text 的段落内容。索引名称为 collection。上传完成后,我们可以看到一个名为 collection 的索引,其中包含 182,469 个文档。
3. 创建管道
我们希望使用推理处理器来处理初始数据,以便能够为每个段落添加一个嵌入。为此,我们创建了一个文本嵌入采集管道,然后使用这个管道为初始数据重建索引。
在 Kibana 控制台中,我们创建了一个采集管道(操作方法请见上一篇博文),这次用于文本嵌入,故而称之为 text-embeddings。这些段落位于名为 text 的字段中。与之前一样,我们将会定义一个 field_map,以将文本映射到模型期望的字段 text_field。同样,将 on_failure 处理程序设为将故障索引到不同的索引中:
PUT _ingest/pipeline/text-embeddings
{
"description": "Text embedding pipeline",
"processors": [
{
"inference": {
"model_id": "sentence-transformers__msmarco-minilm-l-12-v3",
"target_field": "text_embedding",
"field_map": {
"text": "text_field"
}
}
}
],
"on_failure": [
{
"set": {
"description": "Index document to 'failed-<index>'",
"field": "_index",
"value": "failed-{{{_index}}}"
}
},
{
"set": {
"description": "Set error message",
"field": "ingest.failure",
"value": "{{_ingest.on_failure_message}}"
}
}
]
}
4. 为数据重建索引
我们希望通过 text-embeddings 管道推送文档,将文档从 collection 索引重新编制到新的 collection-with-embeddings 索引中,以便 collection-with-embeddings 索引中的文档具有用于段落嵌入的附加字段。但在此之前,我们需要为目标索引创建和定义一个映射,特别是对于采集处理器将存储嵌入的 text_embedding.predicted_value 字段。如果没有这一步,嵌入将会被索引到常规 float 字段中,并且不能用于矢量相似度搜索。我们使用的这个模型将会生成 384 维矢量的嵌入,因此,我们会使用已编制索引的 384 维 dense_vector 字段类型,如下图所示:
PUT collection-with-embeddings
{
"mappings": {
"properties": {
"text_embedding.predicted_value": {
"type": "dense_vector",
"dims": 384,
"index": true,
"similarity": "cosine"
},
"text": {
"type": "text"
}
}
}
}
终于,我们可以重建索引了。鉴于重建索引需要一些时间来处理所有文档并对它们进行推断,因此,我们会通过调用带有 wait_for_completion=false 标志的 API 在后台重建索引。
POST _reindex?wait_for_completion=false
{
"source": {
"index": "collection"
},
"dest": {
"index": "collection-with-embeddings",
"pipeline": "text-embeddings"
}
}
上述命令会返回一个任务 ID。我们可以通过以下方式来监控任务的进度:
GET _tasks/<task_id>
或者,也可以通过观察模型统计 API 或模型统计 UI 中 Inference count(推理计数)的增加来跟踪进度。
已重建索引的文档现在包含了推理结果 — 矢量嵌入。例如,其中一个文档如下图所示:
{
"id": "G7PPtn8BjSkJO8zzChzT",
"text": "This is the definition of RNA along with examples of types of RNA molecules. This is the definition of RNA along with examples of types of RNA molecules. RNA Definition",
"text_embedding":
{
"predicted_value":
[
0.057356324046850204,
0.1602816879749298,
-0.18122544884681702,
0.022277727723121643,
....
],
"model_id": "sentence-transformers__msmarco-minilm-l-12-v3"
}
}
5.矢量相似度搜索
目前,我们不支持在搜索请求期间从查询词隐式生成嵌入,因此,我们的语义搜索分为两个步骤:
- 从文本查询中获取文本嵌入。为此,我们使用模型的 _infer API。
- 使用矢量搜索来查找与查询文本语义相似的文档。在 Elasticsearch v8.0 中,我们引入了一个新的 _knn_search 终端,用于支持在已编制索引的 dense_vector 字段上进行有效的近似最近邻搜索。我们使用 _knn_search API 来查找最近的文档。
例如,给出一个文本查询“how is the weather in jamaica”(牙买加的天气怎么样),我们会首先运行 _infer API 以得到一个密集矢量的嵌入:
POST /_ml/trained_models/sentence-transformers__msmarco-minilm-l-12-v3/deployment/_infer
{
"docs": {
"text_field": "how is the weather in jamaica"
}
}
之后,我们将生成的密集矢量插入 _knn_search,如下图所示:
GET collection-with-embeddings/_knn_search
{
"knn": {
"field": "text_embedding.predicted_value",
"query_vector": [
0.3345310091972351,
-0.305600643157959,
0.2592800557613373,
…
],
"k": 10,
"num_candidates": 100
},
"_source": [
"id",
"text"
]
}
结果,我们得到了最接近查询文档的前 10 个文档,按它们与查询的接近程度排序:
"hits" : [
{
"_index" : "collection-with-embeddings",
"_id" : "47TPtn8BjSkJO8zzKq_o",
"_score" : 0.94591534,
"_source" : {
"id" : 434125,
"text" : "The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. Continue Reading."
}
},
{
"_index" : "collection-with-embeddings",
"_id" : "3LTPtn8BjSkJO8zzKJO1",
"_score" : 0.94536424,
"_source" : {
"id" : 4498474,
"text" : "The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year"
}
},
{
"_index" : "collection-with-embeddings",
"_id" : "KrXPtn8BjSkJO8zzPbDW",
"_score" : 0.9432083,
"_source" : {
"id" : 190804,
"text" : "Quick Answer. The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. Continue Reading"
}
},
...
6.快速验证
由于我们只使用了 MS MARCO 数据集的一个子集,因此我们无法进行全面评估。但是,我们可以对一些查询进行简单的验证,以确定我们确实得到了相关的结果,而不是一些随机的结果。从 TREC 2019 Deep Learning Track 对“段落排名任务”的判断中,我们选取最后 3 个查询,将它们提交到我们的矢量相似度搜索,获得前 10 个结果并参考 TREC 判断,看一看我们所收到结果的相关性如何。在段落排名任务中,段落的评分标准分为四个等级:不相关 (0)、相关(段落切题但没有回答问题)(1)、高度相关 (2) 和完全相关 (3)。
请注意,我们的验证不是严格的评估,验证结果仅用于快速演示。由于我们只对已知与查询相关的段落进行索引,因此这比原始段落检索任务要容易得多。未来我们会打算对 MS MARCO 数据集进行严格的评估。
将查询 #1124210“tracheids are part of _____”(管胞属于 _____)提交给我们的矢量搜索,返回了以下结果:
段落 ID | 相关性评分 | 段落 |
---|---|---|
2258591 | 2 - 高度相关 | Tracheid of oak shows pits along the walls.It is longer than a vessel element and has no perforation plates.Tracheids are elongated cells in the xylem of vascular plants that serve in the transport of water and mineral salts.Tracheids are one of two types of tracheary elements, vessel elements being the other.Tracheids, unlike vessel elements, do not have perforation plates.racheids provide most of the structural support in softwoods, where they are the major cell type.Because tracheids have a much higher surface to volume ratio compared to vessel elements, they serve to hold water against gravity (by adhesion) when transpiration is not occurring. |
2258592 | 3 - 完全相关 | Tracheid. a dead lignified plant cell that functions in water conduction.Tracheids are found in the xylem of all higher plants except certain angiosperms, such as cereals and sedges, in which the water-conducting function is performed by vessels, or tracheae.Tracheids are usually polygonal in cross section; their walls have annular, spiral, or scalene thickenings or rimmed pores.racheids are found in the xylem of all higher plants except certain angiosperms, such as cereals and sedges, in which the water-conducting function is performed by vessels, or tracheae.Tracheids are usually polygonal in cross section; their walls have annular, spiral, or scalene thickenings or rimmed pores. |
2258596 | 2 - 高度相关 | Woody angiosperms have also vessels.The mature tracheids form a column of superposed, cylindrical dead cells whose end walls have been perforated, resulting in a continuous tube called vessel (trachea).Tracheids are found in all vascular plants and are the only conducting elements in gymnosperms and ferns.Tracheids have Pits on their end walls.Pits are not nearly as efficient for water translocation as Perforation Plates found in vessel elements.Woody angiosperms have also vessels.The mature tracheids form a column of superposed, cylindrical dead cells whose end walls have been perforated, resulting in a continuous tube called vessel (trachea).Tracheids are found in all vascular plants and are the only conducting elements in gymnosperms and ferns |
2258595 | 2 - 高度相关 | Summary:Vessels have perforations at the end plates while tracheids do not have end plates.Tracheids are derived from single individual cells while vessels are derived from a pile of cells.Tracheids are present in all vascular plants whereas vessels are confined to angiosperms.Tracheids are thin whereas vessel elements are wide.Tracheids have a much higher surface-to-volume ratio as compared to vessel elements.Vessels are broader than tracheids with which they are associated.Morphology of the perforation plate is different from that in tracheids.Tracheids are thin whereas vessel elements are wide.Tracheids have a much higher surface-to-volume ratio as compared to vessel elements.Vessels are broader than tracheids with which they are associated.Morphology of the perforation plate is different from that in tracheids. |
131190 | 3 - 完全相关 | Xylem tracheids are pointed, elongated xylem cells, the simplest of which have continuous primary cell walls and lignified secondary wall thickenings in the form of rings, hoops, or reticulate networks. |
7443586 | 2 - 高度相关 | 1 The xylem tracheary elements consist of cells known as tracheids and vessel members, both of which are typically narrow, hollow, and elongated.Tracheids are less specialized than the vessel members and are the only type of water-conducting cells in most gymnosperms and seedless vascular plants. |
181177 | 2 - 高度相关 | In most plants, pitted tracheids function as the primary transport cells.The other type of tracheary element, besides the tracheid, is the vessel element.Vessel elements are joined by perforations into vessels.In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes. |
2947055 | 0 - 不相关 | Cholesterol belongs to the groups of lipids called _______.holesterol belongs to the groups of lipids called _______. |
6541866 | 2 - 高度相关 | In most plants, pitted tracheids function as the primary transport cells.The other type of tracheary element, besides the tracheid, is the vessel element.Vessel elements are joined by perforations into vessels.In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes.In most plants, pitted tracheids function as the primary transport cells.The other type of tracheary element, besides the tracheid, is the vessel element.Vessel elements are joined by perforations into vessels.In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes. |
查询 #1129237“hydrogen is a liquid below what temperature”返回了以下结果:
段落 ID | 相关性评分 | 段落 |
---|---|---|
8588222 | 0 - 不相关 | 回答:Hydrogen is a liquid below what temperature?By signing up, you'll get thousands of step-by-step solutions to your homework questions.... for Teachers for Schools for Companies |
128984 | 3 - 完全相关 | Hydrogen gas has the molecular formula H 2.At room temperature and under standard pressure conditions, hydrogen is a gas that is tasteless, odorless and colorless.Hydrogen can exist as a liquid under high pressure and an extremely low temperature of 20.28 kelvin (−252.87°C, −423.17 °F).Hydrogen is often stored in this way as liquid hydrogen takes up less space than hydrogen in its normal gas form.Liquid hydrogen is also used as a rocket fuel. |
8588219 | 3 - 完全相关 | User:Hydrogen is a liquid below what temperature? a.100°C;c. -183°C;b. -253°C;d.0°C Weegy:0 degrees C Weegy:Hydrogen is a liquid below 253 degrees C. User:What is the boiling point of oxygen? a.100 degrees C c. -57 degrees C b.8 degrees C d. -183 degrees C Weegy:The boiling point of oxygen is -183 degrees C. |
3905057 | 3 - 完全相关 | Hydrogen is a colorless, odorless, tasteless gas.Its density is the lowest of any chemical element, 0.08999 grams per liter.By comparison, a liter of air weighs 1.29 grams, 14 times as much as a liter of hydrogen.Hydrogen changes from a gas to a liquid at a temperature of -252.77°C (-422.99°F) and from a liquid to a solid at a temperature of -259.2°C (-434.6°F).It is slightly soluble in water, alcohol, and a few other common liquids. |
4254811 | 3 - 完全相关 | At STP (standard temperature and pressure) hydrogen is a gas.It cools to a liquid at -423 °F, which is only about 37 degrees above absolute zero.Eleven degrees cooler, at … -434 °F, it starts to solidify. |
2697752 | 2 - 高度相关 | Hydrogen's state of matter is gas at standard conditions of temperature and pressure.Hydrogen condenses into a liquid or freezes solid at extremely cold...Hydrogen's state of matter is gas at standard conditions of temperature and pressure.Hydrogen condenses into a liquid or freezes solid at extremely cold temperatures.Hydrogen's state of matter can change when the temperature changes, becoming a liquid at temperatures between minus 423.18 and minus 434.49 degrees Fahrenheit.It becomes a solid at temperatures below minus 434.49 F.Due to its high flammability, hydrogen gas is commonly used in combustion reactions, such as in rocket and automobile fuels. |
6080460 | 3 - 完全相关 | Hydrogen can exist as a liquid under high pressure and an extremely low temperature of 20.28 kelvin (−252.87°C, −423.17 °F).Hydrogen is often stored in this way as liquid hydrogen takes up less space than hydrogen in its normal gas form.Liquid hydrogen is also used as a rocket fuel.Hydrogen is found in large amounts in giant gas planets and stars, it plays a key role in powering stars through fusion reactions.Hydrogen is one of two important elements found in water (H 2 O).Each molecule of water is made up of two hydrogen atoms bonded to one oxygen atom. |
128989 | 3 - 完全相关 | Confidence votes 11.4K.At STP (standard temperature and pressure) hydrogen is a gas.It cools to a liquid at -423 °F, which is only about 37 degrees above absolute zero.Eleven degrees cooler, at -434 °F, it starts to solidify. |
1959030 | 0 - 不相关 | While below 4 °C the breakage of hydrogen bonds due to heating allows water molecules to pack closer despite the increase in the thermal motion (which tends to expand a liquid), above 4 °C water expands as the temperature increases.Water near the boiling point is about 4% less dense than water at 4 °C (39 °F) |
3905800 | 0 - 不相关 | Hydrogen is the lightest of the elements with an atomic weight of 1.0.Liquid hydrogen has a density of 0.07 grams per cubic centimeter, whereas water has a density of 1.0 g/cc and gasoline about 0.75 g/cc.These facts give hydrogen both advantages and disadvantages. |
查询 #1133167“how is the weather in jamaica”返回了以下结果:
段落 ID | 相关性评分 | 段落 |
---|---|---|
434125 | 3 - 完全相关 | The climate in Jamaica is tropical and humid with warm to hot temperatures all year round.The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit.Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. |
4498474 | 3 - 完全相关 | The climate in Jamaica is tropical and humid with warm to hot temperatures all year round.The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit.Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. |
190804 | 3 - 完全相关 | Quick Answer.The climate in Jamaica is tropical and humid with warm to hot temperatures all year round.The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit.Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year.继续阅读。Continue Reading. |
1824479 | 3 - 完全相关 | A:The climate in Jamaica is tropical and humid with warm to hot temperatures all year round.The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit.Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. |
1824480 | 3 - 完全相关 | Quick Answer.The climate in Jamaica is tropical and humid with warm to hot temperatures all year round.The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit.Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. |
1824488 | 2 - 高度相关 | Learn About the Weather of Jamaica The weather patterns you'll encounter in Jamaica can vary dramatically around the island Regardless of when you visit, the tropical climate and warm temperatures of Jamaica essentially guarantee beautiful weather during your vacation.Average temperatures in Jamaica range between 80 degrees Fahrenheit and 90 degrees Fahrenheit, with July and August being the hottest months and February the coolest. |
4922619 | 2 - 高度相关 | Weather.Jamaica averages about 80 degrees year-round, so climate is less a factor in booking travel than other destinations.The days are warm and the nights are cool.Rain usually falls for short periods in the late afternoon, with sunshine the rest of the day. |
190806 | 2 - 高度相关 | It is always important to know what the weather in Jamaica will be like before you plan and take your vacation.For the most part, the average temperature in Jamaica is between 80 °F and 90 °F (27 °FCelsius-29 °Celsius).Luckily, the weather in Jamaica is always vacation friendly.You will hardly experience long periods of rain fall, and you will become accustomed to weeks upon weeks of sunny weather. |
2613296 | 2 - 高度相关 | Average temperatures in Jamaica range between 80 degrees Fahrenheit and 90 degrees Fahrenheit, with July and August being the hottest months and February the coolest.Temperatures in Jamaica generally vary approximately 10 degrees from summer to winter |
1824486 | 2 - 高度相关 | The climate in Jamaica is tropical and humid with warm to hot temperatures all year round.The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit.Jamaican nights are considerably... |
我们可以看到,对于所有 3 个查询,Elasticsearch 返回了大部分相关的结果,并且在所有查询中,排名靠前的结果要么是高度相关,要么是完全相关。
立即试用
NLP 是 Elastic Stack 中的一项强大功能,有着令人兴奋的路线图。在 Elastic Cloud 中构建集群,即可发现新功能,紧跟最新发展动态。请立即注册以开始 14 天免费试用,并尝试一下这篇博文中的示例。