有人可以幫我理解為什么我對彈性搜索分析器的理解是錯誤的嗎?
我有一個包含各種欄位的索引,特別是:
"categories": {
"type": "text",
"analyzer": "words_only_analyser",
"copy_to": "all",
"fields": {
"tokens": {
"type": "text",
"analyzer": "words_only_analyser",
"term_vector": "yes",
"fielddata" : True
}
}
}
words_only_analyser看起來像:
"words_only_analyser":{
"type":"custom",
"tokenizer":"words_only_tokenizer",
"char_filter" : ["html_strip"],
"filter":[ "lowercase", "asciifolding", "stop_filter", "kstem" ]
},
words_only_tokenizer看起來像:
"tokenizer":{
"words_only_tokenizer":{
"type":"pattern",
"pattern":"[^\\w-] "
}
}
我對pattern [^\\w-] in的理解tokenizer是,它將標記一個句子,以便在任何數量的\or wor出現時將它們拆分-。例如,給定模式,一個句子:
seasonal-christmas-halloween this is a description about halloween
我希望看到:
[seasonal, christmas, hallo, een this is a description about hallo, een]
我可以從https://regex101.com/確認上述內容
但是,當我運行words_only_analyser上面的句子時:
curl -XGET localhost:9200/contextual/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer":"words_only_analyser","text":"seasonal-christmas-halloween this is a description about halloween"}'
我明白了,
{
"tokens" : [
{
"token" : "seasonal-christmas-halloween",
"start_offset" : 0,
"end_offset" : 28,
"type" : "word",
"position" : 0
},
{
"token" : "description",
"start_offset" : 39,
"end_offset" : 50,
"type" : "word",
"position" : 4
},
{
"token" : "halloween",
"start_offset" : 57,
"end_offset" : 66,
"type" : "word",
"position" : 6
}
]
}
這告訴我句子被標記為:
[seasonal-christmas-halloween, description, halloween]
在我看來,標記器模式沒有得到滿足?有人可以向我解釋我的理解不正確的地方嗎?
uj5u.com熱心網友回復:
很少有東西會改變分析器生成的最終標記,首先是標記器,然后是標記過濾器(例如:你有 stop_filter 洗掉停止詞,如this, is, a)。
您也可以使用分析 API 來測驗您tokenizer的,我創建了您的配置,它會生成以下令牌。
POST _analyze
{
"tokenizer": "words_only_tokenizer", // Note `tokenizer` here
"text": "seasonal-christmas-halloween this is a description about halloween"
}
結果
{
"tokens": [
{
"token": "seasonal-christmas-halloween",
"start_offset": 0,
"end_offset": 28,
"type": "word",
"position": 0
},
{
"token": "this",
"start_offset": 29,
"end_offset": 33,
"type": "word",
"position": 1
},
{
"token": "is",
"start_offset": 34,
"end_offset": 36,
"type": "word",
"position": 2
},
{
"token": "a",
"start_offset": 37,
"end_offset": 38,
"type": "word",
"position": 3
},
{
"token": "description",
"start_offset": 39,
"end_offset": 50,
"type": "word",
"position": 4
},
{
"token": "about",
"start_offset": 51,
"end_offset": 56,
"type": "word",
"position": 5
},
{
"token": "halloween",
"start_offset": 57,
"end_offset": 66,
"type": "word",
"position": 6
}
]
}
您會注意到,仍然存在停用詞,因為它只是破壞了空格上的標記而不考慮-.
現在,如果你在analyzerwhich also has上運行相同的filters,它會減少stop wordsand 給你下面的標記。
POST _analyze
{
"analyzer": "words_only_analyser",
"text": "seasonal-christmas-halloween this is a description about halloween"
}
結果
{
"tokens": [
{
"token": "seasonal-christmas-halloween",
"start_offset": 0,
"end_offset": 28,
"type": "word",
"position": 0
},
{
"token": "description",
"start_offset": 39,
"end_offset": 50,
"type": "word",
"position": 4
},
{
"token": "about",
"start_offset": 51,
"end_offset": 56,
"type": "word",
"position": 5
},
{
"token": "halloween",
"start_offset": 57,
"end_offset": 66,
"type": "word",
"position": 6
}
]
}
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/444287.html
