以not_analyzed进行映射后获得完全匹配

拉姆赞

我有如下映射的elasticsearch类型，

mappings": {
 "jardata": {
   "properties": {
     "groupID": {
      "index": "not_analyzed",
      "type": "string"
      },
     "artifactID": {
     "index": "not_analyzed",
     "type": "string"
      },
      "directory": {
      "type": "string"
      },
      "jarFileName": {
      "index": "not_analyzed",
      "type": "string"
      },
      "version": {
      "index": "not_analyzed",
      "type": "string"
      }
    }
  }
}

我使用的是分析后的目录索引，因为我只想给出最后一个文件夹并获取结果，但是当我要搜索特定目录时，我需要给出整个路径，因为在两个路径中可以有相同的文件夹。这里的问题是，因为将对它进行分析，然后将所有数据代替我想要的特定数据。

这里的问题是我想像已分析的和未分析的那样进行操作。有办法吗？

乔安娜

假设您已将以下文档编入索引：

{
    "directory": "/home/docs/public"
}

在您的情况下，标准分析器还不够，因为它会在建立索引时创建以下术语：

[home, docs, public]

请注意，它会丢失[/home/docs/public]标记-像“ /”等字符在此处充当分隔符。

一种解决方案是将NGram标记生成器与列表中的punctuation字符类一起使用token_chars。Elasticsearch会将“ /”视为字母或数字。这将允许使用以下标记进行搜索：

[/hom, /home, ..., /home/docs/publi, /home/docs/public, ..., /docs/public, etc...]

索引映射：

{
    "settings": {
        "analysis": {
          "analyzer": {
            "ngram_analyzer": {
              "tokenizer": "my_tokenizer"
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "ngram",
              "min_gram": 4,
              "max_gram": 18,
              "token_chars": [
                "letter",
                "digit",
                "punctuation"
              ]
            }
          }
        }
      },
    "mappings": {
     "jardata": {
       "properties": {
          "directory": {
          "type": "string",
          "analyzer": "ngram_analyzer"
          }
        }
      }
    }
}

现在这两个搜索查询：

{
    "query": {
      "bool" : {
        "must" : {
          "term" : {
             "directory": "/docs/private"
           }
        }
      }
    }
}

和

{
    "query": {
      "bool" : {
        "must" : {
          "term" : {
             "directory": "/home/docs/private"
           }
        }
      }
    }
}

将给出结果中的索引文件。

您必须考虑的一件事是在"max_gram"设置中指定的令牌的最大长度。如果是目录路径，则可能需要更长的时间。

另一种解决方案是使用Whitespace tokenizer，它将短语仅在空白上分解为术语，并使用具有以下映射的NGram过滤器：

{
    "settings": {
        "analysis": {
            "filter": {
                "ngram_filter": { 
                    "type": "ngram",
                    "min_gram": 4,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "my_analyzer": {
                    "type":      "custom",
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase",
                        "ngram_filter" 
                    ]
                }
            }
        }
    },
  "mappings": {
   "jardata": {
     "properties": {
        "directory": {
        "type": "string",
        "analyzer": "my_analyzer"
        }
      }
    }
  }
}

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-11-26

我来说两句

0 条评论

登录后参与评论

以not_analyzed进行映射后获得完全匹配

以not_analyzed进行映射后获得完全匹配

UITableView的项目向下滚动后更改颜色，然后快速备份

Linux的官方Adobe Flash存储库是否已过时？

用日期数据透视表和日期顺序查询

应用发明者仅从列表中选择一个随机项一次

Mac OS X更新后的GRUB 2问题

验证REST API参数

Java Eclipse中的错误13，如何解决？

带有错误“ where”条件的查询如何返回结果？

ggplot：对齐多个分面图-所有大小不同的分面

尝试反复更改屏幕上按钮的位置 - kotlin android studio

如何从视图一次更新多行（ASP.NET - Core）

计算数据帧中每行的NA

蓝屏死机没有修复解决方案

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

离子动态工具栏背景色

VB.net将2条特定行导出到DataGridView

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

在Windows 7中无法删除文件（2）

python中的boto3文件上传

当我尝试下载 StanfordNLP en 模型时，出现错误

Node.js中未捕获的异常错误，发生调用