news 2026/4/15 14:44:46

Elasticsearch:如何使用 LLM 在摄入数据时提取需要的信息

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
Elasticsearch:如何使用 LLM 在摄入数据时提取需要的信息

在很多的应用场景中,我们可以使用 LLM 来帮助我们提取需要的结构化数据。这些结构化的数据可以是分类,也可以是获取同义词等等。在我之前的文章 “如何自动化同义词并使用我们的 Synonyms API 进行上传” 里,我们展示了如何使用 LLM 来生成同义词,并上传到 Elasticsearch 中。在今天的例子里,我们把 LLM 提取数据的流程放到我们的 ingest pipeline 里。这样在摄入的同时,会自动提前所需要的信息!

创建 LLM Chat completion 端点

我们可以参考之前的文章 “Elasticsearch:使用推理端点及语义搜索演示”。我们可以创建一个如下的 chat completion 端点:

PUT _inference/completion/azure_openai_completion { "service": "azureopenai", "service_settings": { "api_key": "${AZURE_API_KEY}", "resource_name": "${AZURE_RESOURCE_NAME}", "deployment_id": "${AZURE_DEPLOYMENT_ID}", "api_version": "${AZURE_API_VERSION}" } }

创建一个 ingest pipeline

我们可以使用如下的一个方法来测试 pipeline:

在上面,我们定义了一个 EXTRACTION_PROMPT 变量:

Extract audio product information from this description. Return raw JSON only. Do NOT use markdown, backticks, or code blocks. Fields: category (string, one of: Headphones/Earbuds/Speakers/Microphones/Accessories), features (array of strings from: wireless/noise_cancellation/long_battery/waterproof/voice_assistant/fast_charging/portable/surround_sound), use_case (string, one of: Travel/Office/Home/Fitness/Gaming/Studio). Description:

如果你还不了解如何定义这个变量,请参考我之前的文章 “Kibana:如何设置变量并应用它们”。

POST _ingest/pipeline/_simulate { "description": "Use LLM to interpret messages to come out categories", "pipeline": { "processors": [ { "script": { "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description", "params": { "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}" } } }, { "inference": { "model_id": "azure_openai_completion", "input_output": { "input_field": "prompt", "output_field": "ai_response" } } }, { "json": { "field": "ai_response", "add_to_root": true } }, { "json": { "field": "ai_response", "add_to_root": true } }, { "remove": { "field": [ "prompt", "ai_response" ] } } ] }, "docs": [ { "_source": { "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "price": 299.99 } } ] }

提示:你可以使用任何一个你喜欢的大模型来创建上面的端点。

上面命令运行的结果就是:

{ "docs": [ { "doc": { "_index": "_index", "_version": "-3", "_id": "_id", "_source": { "use_case": "Travel", "features": [ "wireless", "noise_cancellation", "long_battery" ], "price": 299.99, "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "model_id": "azure_openai_completion", "category": "Headphones" }, "_ingest": { "timestamp": "2026-01-22T13:56:11.926494Z" } } } ] }

上面的测试非常成功。我们可以进一步创建 pipeline:

PUT _ingest/pipeline/product-enrichment-pipeline { "processors": [ { "script": { "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description", "params": { "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}" } } }, { "inference": { "model_id": "azure_openai_completion", "input_output": { "input_field": "prompt", "output_field": "ai_response" } } }, { "json": { "field": "ai_response", "add_to_root": true } }, { "json": { "field": "ai_response", "add_to_root": true } }, { "remove": { "field": [ "prompt", "ai_response" ] } } ] }

创建索引并写入数据

我们接下来创建一个叫做 products 的索引:

PUT products { "settings": { "default_pipeline": "product-enrichment-pipeline" } }

如上所示,我们把 default_pipeline,也即默认的 pipeline 设置为 product-enrichment-pipeline。这样我们像正常地写入数据的时候,这个 pipeline 也会被自动调用:

POST _bulk { "index": { "_index": "products", "_id": "1" } } { "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "price": 299.99 } { "index": { "_index": "products", "_id": "2" } } { "name": "Portable Bluetooth Speaker", "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.", "price": 149.99 } { "index": { "_index": "products", "_id": "3" } } { "name": "Studio Condenser Microphone", "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.", "price": 199.99 }

注意:依赖于大模型的速度,上面的调用可能需要一点时间来完成!

如上所示,我们写入数据。我们使用如下的命令来查看我们的数据:

GET products/_search?filter_path=**.hits
{ "hits": { "hits": [ { "_index": "products", "_id": "1", "_score": 1, "_source": { "use_case": "Travel", "features": [ "wireless", "noise_cancellation", "long_battery" ], "price": 299.99, "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "model_id": "azure_openai_completion", "category": "Headphones" } }, { "_index": "products", "_id": "2", "_score": 1, "_source": { "use_case": "Travel", "features": [ "waterproof", "surround_sound" ], "price": 149.99, "name": "Portable Bluetooth Speaker", "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.", "model_id": "azure_openai_completion", "category": "Speakers" } }, { "_index": "products", "_id": "3", "_score": 1, "_source": { "use_case": "Studio", "features": [ "noise_cancellation", "voice_assistant" ], "price": 199.99, "name": "Studio Condenser Microphone", "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.", "model_id": "azure_openai_completion", "category": "Microphones" } } ] } }

有了如上所示的结构化数据,我们就可以针对我们的数据进行搜索或统计了。

祝大家学习愉快!

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/4/8 18:52:16

4个专业级技巧:用Equalizer APO实现精准音频均衡与音效优化

4个专业级技巧:用Equalizer APO实现精准音频均衡与音效优化 【免费下载链接】equalizerapo Equalizer APO mirror 项目地址: https://gitcode.com/gh_mirrors/eq/equalizerapo 音频均衡技术是实现专业音质优化的核心手段,而Equalizer APO作为Wind…

作者头像 李华
网站建设 2026/3/26 21:30:52

如何零门槛打造智能家居音乐中心?Docker部署终极指南

如何零门槛打造智能家居音乐中心?Docker部署终极指南 【免费下载链接】xiaomusic 使用小爱同学播放音乐,音乐使用 yt-dlp 下载。 项目地址: https://gitcode.com/GitHub_Trending/xia/xiaomusic 还在为多个音箱设备无法协同工作而烦恼吗&#xff…

作者头像 李华
网站建设 2026/4/14 16:31:21

突破空间限制:Sunshine游戏串流平台实战指南

突破空间限制:Sunshine游戏串流平台实战指南 【免费下载链接】Sunshine Sunshine: Sunshine是一个自托管的游戏流媒体服务器,支持通过Moonlight在各种设备上进行低延迟的游戏串流。 项目地址: https://gitcode.com/GitHub_Trending/su/Sunshine 在…

作者头像 李华
网站建设 2026/4/14 17:08:11

Linux 之 IOWAIT 专题

参考链接 这里解释了 https://cloud.tencent.com/developer/article/2324420

作者头像 李华
网站建设 2026/4/9 0:36:44

手把手教你用星图AI平台训练PETRV2-BEV模型

手把手教你用星图AI平台训练PETRV2-BEV模型 1. 引言:为什么选择PETRV2-BEV与星图AI平台 你是否正在寻找一个高效、可落地的BEV(Birds Eye View)感知模型训练方案?PETRV2-BEV 是当前自动驾驶领域中极具代表性的视觉感知模型&…

作者头像 李华
网站建设 2026/3/29 19:54:45

MinerU多场景应用:学术论文/财报/合同提取完整指南

MinerU多场景应用:学术论文/财报/合同提取完整指南 1. 精准提取复杂PDF内容,三步搞定学术与商业文档 你是否还在为处理格式复杂的PDF文档而头疼?尤其是那些包含多栏排版、数学公式、表格和图表的学术论文、上市公司财报或法律合同。传统工具…

作者头像 李华