搜索方法中“ _id”字段对elasticsearch的影响？

皮亚拉德

我在Elasticsearch上遇到了一些麻烦...我设法在我的机器上创建了一个可重现的示例，代码在文章结尾。

我只是创建6个用户，"Roger Sand"，"Roger Gilbert"，"Cindy Sand"，"Cindy Gilbert"，"Jean-Roger Sands"，"Sand Roger"，和它的名字指数吧。

然后，我运行查询以匹配“ Roger Sand”，并显示关联的分数。

这是同一脚本的执行，带有2组不同的id：84046至84051和84047至84052（仅移位1）。

结果顺序不同，得分也不相同：

执行84046 ... 84051

Sand Roger => 0.8838835
Roger Sand => 0.2712221
Cindy Sand => 0.22097087
Jean-Roger Sands => 0.17677669
Roger Gilbert => 0.028130025

执行84047..84052

Roger Sand => 0.2712221
Sand Roger => 0.2712221
Cindy Sand => 0.22097087
Jean-Roger Sands => 0.17677669
Roger Gilbert => 0.15891947

我的问题是，为什么“ id”会对通过“ full_name”进行的搜索产生影响？

这是可复制脚本的完整红宝石代码。

first_id = 84046 # Or 84047
client = Elasticsearch::Client.new(:log => true)
client.transport.reload_connections!
client.indices.delete({:index => 'test'})
client.indices.create({ :index => 'test' })
client.perform_request('POST', 'test/_refresh')

["Roger Sand", "Roger Gilbert", "Cindy Sand", "Cindy Gilbert", "Jean-Roger Sands", "Sand  Roger" ].each_with_index do |name, i|
  i2 = first_id + i
  client.create({
    :index => 'test', :type => 'user',
    :id => i2,
    :body => { :full_name => name }
  })
end

query_options = {
  :type => 'user', :index => 'test',
  :body => {
    :query => { :match => { :full_name => "Roger Sand" } } 
  }
}

client.perform_request('POST', 'test/_refresh')

client.search(query_options)["hits"]["hits"].each do |hit|
  $stderr.puts "#{hit["_source"]["full_name"]} => #{hit["_score"]}"
end

这是命令行

curl -XDELETE 'http://localhost:9200/test' 
curl -XPUT 'http://localhost:9200/test' 
curl -XPOST 'http://localhost:9200/test/_refresh' 
curl -XPUT 'http://localhost:9200/test/user/84047?op_type=create' -d '{"full_name":"Roger Sand"}'
curl -XPUT 'http://localhost:9200/test/user/84048?op_type=create' -d '{"full_name":"Roger Gilbert"}'
curl -XPUT 'http://localhost:9200/test/user/84049?op_type=create' -d '{"full_name":"Cindy Sand"}'
curl -XPUT 'http://localhost:9200/test/user/84050?op_type=create' -d '{"full_name":"Cindy Gilbert"}'
curl -XPUT 'http://localhost:9200/test/user/84051?op_type=create' -d '{"full_name":"Jean-Roger Sands"}'
curl -XPUT 'http://localhost:9200/test/user/84052?op_type=create' -d '{"full_name":"Sand Roger"}'
curl -XPOST 'http://localhost:9200/test/_refresh' 
curl -XPOST 'http://localhost:9200/test/user/_search?pretty' -d '{"query":{"match":{"full_name":"Roger Sand"}}}'


curl -XDELETE 'http://localhost:9200/test'
curl -XPUT 'http://localhost:9200/test'
curl -XPOST 'http://localhost:9200/test/_refresh'
curl -XPUT 'http://localhost:9200/test/user/84046?op_type=create' -d '{"full_name":"Roger Sand"}'
curl -XPUT 'http://localhost:9200/test/user/84047?op_type=create' -d '{"full_name":"Roger Gilbert"}'
curl -XPUT 'http://localhost:9200/test/user/84048?op_type=create' -d '{"full_name":"Cindy Sand"}'
curl -XPUT 'http://localhost:9200/test/user/84049?op_type=create' -d '{"full_name":"Cindy Gilbert"}'
curl -XPUT 'http://localhost:9200/test/user/84050?op_type=create' -d '{"full_name":"Jean-Roger Sands"}'
curl -XPUT 'http://localhost:9200/test/user/84051?op_type=create' -d '{"full_name":"Sand Roger"}'
curl -XPOST 'http://localhost:9200/test/_refresh'
curl -XPOST 'http://localhost:9200/test/user/_search?pretty' -d '{"query":{"match":{"full_name":"Roger Sand"}}}'

纳特沃克

问题出在分布式分数计算中。

您使用默认设置（即5个分片）创建一个新索引。每个分片都是其自己的Lucene索引。在为数据建立索引时，Elasticsearch需要确定文档应使用的分片，并且通过在_id上进行散列来实现（在没有routing参数的情况下）。

因此，通过移动ID，最终可以将文档分发到不同的分片。如上文所述，每个分片都是其自己的Lucene索引，当您跨多个分片进行搜索时，必须合并每个单独分片的不同分数，并且由于路由不同，各个分数也不同。

您可以通过添加explain到查询中来验证这一点。对于Sand Roger，idf分别计算为idf(docFreq=1, maxDocs=1) = 0.30685282和idf(docFreq=1, maxDocs=2) = 1，得出不同的结果。

您可以将分片大小更改为1，也可以将查询类型更改为dfs类型。针对进行搜索http://localhost:9200/test/user/_search?pretty&query_type=dfs_query_and_fetch会给您正确的分数，因为

初始散射阶段，该阶段将计算分布式项频率以进行更准确的评分

http://www.elasticsearch.org/guide/zh-CN/elasticsearch/reference/current/search-request-search-type.html#dfs-query-and-fetch

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-03-15

我来说两句

0 条评论

登录后参与评论

上一篇：使用afnetworking在GET方法的url中传递参数

TOP 榜单

文章

搜索方法中“ _id”字段对elasticsearch的影响？

搜索方法中“ _id”字段对elasticsearch的影响？

Android Studio Kotlin：提取为常量

IE 11中的FormData未定义

计算数据帧R中的字符串频率

如何在R中转置数据

如何使用Redux-Toolkit重置Redux Store

Excel 2016图表将增长与4个参数进行比较

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

未捕获的SyntaxError：带有Ajax帖子的意外令牌u

OpenCv：改变 putText() 的位置

ActiveModelSerializer仅显示关联的ID

算术中的c ++常量类型转换

如何开始为Ubuntu开发

将加号/减号添加到jQuery菜单

去噪自动编码器和常规自动编码器有什么区别？

获取并汇总所有关联的数据

OpenGL纹理格式的颜色错误

在 React Native Expo 中使用 react-redux 更改另一个键的值

http：// localhost：3000 /＃！/为什么我在localhost链接中得到“＃！/”。

TreeMap中的自定义排序

Redux动作正常，但减速器无效

如何对treeView的子节点进行排序