获得正确的多值字段搜索相关性

morr 发表于 Dev

莫尔

我有一个实体，可以有任意数量的标题。在某些情况下，每个实体有几十个甚至数百个标题。

这些标题以数组形式存储在elasticsearch中的单个字段中。该字段具有带有复杂标记器的复杂分析器。

问题在于，elastic将数组字段（具有一组值的字段）考虑为单个“字符串”，并且搜索结果的相关性被计算为整个“字符串”中的总相关性。但是我需要的是一个特定匹配数组元素的相关性。

下面是一个非常简化的示例。

创建索引

curl -XDELETE 'http://localhost:9200/tests'
curl -XPUT 'http://localhost:9200/tests' -d'{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "edge_ngram_tokenizer",
          "filter": ["lowercase", "asciifolding"]
        }
      },
      "tokenizer": {
        "edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "3",
          "max_gram": "12",
          "token_chars": ["letter", "digit"]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}'

人口指数

curl -XPOST 'http://localhost:9200/tests/test' -d'{ "id": 1, "name": ["text"] }'
curl -XPOST 'http://localhost:9200/tests/test' -d'{ "id": 2, "name": ["text", "text"] }'

搜索

curl -XGET 'http://localhost:9200/tests/test/_search' -d'{
  "query": {
    "match": {
      "name": "text"
    }
  }
}'

结果

{
  "took": 0,
  "timed_out": false,
  "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 },
  "hits": {
    "total": 2,
    "max_score": 0.7911257,
    "hits": [{
      "_index": "tests",
      "_type": "test",
      "_id": "AWOtIL2gdpqdbX7hdDXg",
      "_score": 0.7911257,
      "_source": { "id": 2, "name": [ "text", "text" ] }
    }, {
      "_index": "tests",
      "_type": "test",
      "_id": "AWOtIL0ldpqdbX7hdDXf",
      "_score": 0.51623213,
      "_source": { "id": 1, "name": [ "text" ] }
    }]
  }
}

如您所见，id：2具有相关性0.7911257，而id：1具有相关性0.51623213。

我需要两个结果都具有相同的相关性。

有什么办法可以实现？

我知道有两种解决方法，但是都不适合我。也许还有其他选择？

a）当标题的数量相对较少时，可以将标题分别存储在单独的字段中：name_0，name_1，name_2等。这些字段可以使用dis_max request和tie_breaker：0进行查询，并且相关性会很好。

"query": {
  "dis_max": {
    "queries": [
      { "match": { "name_0": "text" } },
      { "match": { "name_1": "text" } },
      { "match": { "name_2": "text" } }
    ],
    "tie_breaker": 0,
    "boost": 1
  }
}

b）每个标题可以弹性存储在单独的行中

curl -XPOST 'http://localhost:9200/tests/test' -d'{ "product_id": 1, "name": "text" }'
curl -XPOST 'http://localhost:9200/tests/test' -d'{ "product_id": 2, "name": "text" }'
curl -XPOST 'http://localhost:9200/tests/test' -d'{ "product_id": 2, "name": "text" }'

在这种情况下，结果必须由product_id进一步汇总。因此，我们在结果分页和结果的进一步汇总方面遇到了问题。

约旦

我认为要添加到您的name领域：

"index_options": "docs"

会做魔术。

此设置将告诉ES不在乎该字段的TF。

如果您想了解更多信息，请查阅理论。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。