無法理解彈性搜索分析器正則運算式-有解無憂

有人可以幫我理解為什么我對彈性搜索分析器的理解是錯誤的嗎？

我有一個包含各種欄位的索引，特別是：

"categories": {
    "type": "text",
    "analyzer": "words_only_analyser",
    "copy_to": "all",
    "fields": {
         "tokens": {
             "type": "text",
             "analyzer": "words_only_analyser",
             "term_vector": "yes",
             "fielddata" : True
          }
      }
}

words_only_analyser看起來像：

"words_only_analyser":{
    "type":"custom",
    "tokenizer":"words_only_tokenizer",
    "char_filter" : ["html_strip"],
    "filter":[ "lowercase", "asciifolding", "stop_filter", "kstem" ]
},

words_only_tokenizer看起來像：

"tokenizer":{
    "words_only_tokenizer":{
    "type":"pattern",
    "pattern":"[^\\w-] "
    }
}

我對pattern [^\\w-] in的理解tokenizer是，它將標記一個句子，以便在任何數量的\or wor出現時將它們拆分-。例如，給定模式，一個句子：

seasonal-christmas-halloween this is a description about halloween

我希望看到：

[seasonal, christmas, hallo, een this is a description about hallo, een]

我可以從https://regex101.com/確認上述內容

但是，當我運行words_only_analyser上面的句子時：

curl -XGET localhost:9200/contextual/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer":"words_only_analyser","text":"seasonal-christmas-halloween this is a description about halloween"}'

我明白了，

{
  "tokens" : [
    {
      "token" : "seasonal-christmas-halloween",
      "start_offset" : 0,
      "end_offset" : 28,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "description",
      "start_offset" : 39,
      "end_offset" : 50,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "halloween",
      "start_offset" : 57,
      "end_offset" : 66,
      "type" : "word",
      "position" : 6
    }
  ]
}

這告訴我句子被標記為：

[seasonal-christmas-halloween, description, halloween]

在我看來，標記器模式沒有得到滿足？有人可以向我解釋我的理解不正確的地方嗎？

uj5u.com熱心網友回復：

很少有東西會改變分析器生成的最終標記，首先是標記器，然后是標記過濾器（例如：你有 stop_filter 洗掉停止詞，如this, is, a）。

您也可以使用分析 API 來測驗您tokenizer的，我創建了您的配置，它會生成以下令牌。

POST _analyze

{
    "tokenizer": "words_only_tokenizer", // Note `tokenizer` here
    "text": "seasonal-christmas-halloween this is a description about halloween"
}

結果

{
    "tokens": [
        {
            "token": "seasonal-christmas-halloween",
            "start_offset": 0,
            "end_offset": 28,
            "type": "word",
            "position": 0
        },
        {
            "token": "this",
            "start_offset": 29,
            "end_offset": 33,
            "type": "word",
            "position": 1
        },
        {
            "token": "is",
            "start_offset": 34,
            "end_offset": 36,
            "type": "word",
            "position": 2
        },
        {
            "token": "a",
            "start_offset": 37,
            "end_offset": 38,
            "type": "word",
            "position": 3
        },
        {
            "token": "description",
            "start_offset": 39,
            "end_offset": 50,
            "type": "word",
            "position": 4
        },
        {
            "token": "about",
            "start_offset": 51,
            "end_offset": 56,
            "type": "word",
            "position": 5
        },
        {
            "token": "halloween",
            "start_offset": 57,
            "end_offset": 66,
            "type": "word",
            "position": 6
        }
    ]
}

您會注意到，仍然存在停用詞，因為它只是破壞了空格上的標記而不考慮-.

現在，如果你在analyzerwhich also has上運行相同的filters，它會減少stop wordsand 給你下面的標記。

POST _analyze

{
    "analyzer": "words_only_analyser",
    "text": "seasonal-christmas-halloween this is a description about halloween"
}

結果

{
    "tokens": [
        {
            "token": "seasonal-christmas-halloween",
            "start_offset": 0,
            "end_offset": 28,
            "type": "word",
            "position": 0
        },
        {
            "token": "description",
            "start_offset": 39,
            "end_offset": 50,
            "type": "word",
            "position": 4
        },
        {
            "token": "about",
            "start_offset": 51,
            "end_offset": 56,
            "type": "word",
            "position": 5
        },
        {
            "token": "halloween",
            "start_offset": 57,
            "end_offset": 66,
            "type": "word",
            "position": 6
        }
    ]
}

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/444287.html

標籤：正则表达式弹性搜索弹性搜索分析器

上一篇：如何通過ID決議來自另一個索引的資料？

下一篇：Elasticsearch過濾器以獲取給定欄位的每個值的最后一個檔案