我需要從索引 Y 中的所有文本中獲取單詞 X 的計數,它只有一個欄位“內容”。請注意,我需要一個特定單詞的計數,它在所有檔案中總共出現了多少次。從我發現的情況來看,ES 并沒有為此進行很好的優化(因為這是一種文本型別),但這是針對大學作業的,所以我別無選擇。
到目前為止,我已經嘗試過(取自此處):
{
"script_fields": {
"phrase_Count": {
"script": {
"lang": "painless",
"source": "int count = 0; if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) count ; return count;",
"params": {
"phrase": "ustawa"
}
}
}
}
}
腳本方法回傳:
{
"error": {
"root_cause": [
{
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:88)",
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:40)",
"if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) ",
" ^---- HERE"
],
"script": "int count = 0; if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) count ; return count;",
"lang": "painless",
"position": {
"offset": 22,
"start": 15,
"end": 104
}
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "bills",
"node": "MXtcD7-zT-mhDyxMeRTMLw",
"reason": {
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:88)",
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:40)",
"if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) ",
" ^---- HERE"
],
"script": "int count = 0; if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) count ; return count;",
"lang": "painless",
"position": {
"offset": 22,
"start": 15,
"end": 104
},
"caused_by": {
"type": "illegal_argument_exception",
"reason": "No field found for [content.keyword] in mapping with types []"
}
}
}
]
},
"status": 400
}
上面content.keyword使用了,因為普通contentES 抱怨文本型別沒有針對此類搜索進行優化。
我也嘗試使用文本統計(從這里),但我無法讓它作業,它只計算帶有單詞的檔案(這不是我要找的)。
As my last approach I tried search with aggregation (from here), but it also just returned the count of documents, not words:
{
"query": {
"query_string": {
"fields": ["content"],
"query": "ustawa"
}
},
"aggs": {
"my-terms": {
"terms": {
"field": "content.keyword"
}
}
}
}
How can I achieve this? I'm using Python, if it matters.
EDIT Mapping for index I'm using:
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
uj5u.com熱心網友回復:
在 Elasticsearch 7.11 anonsed runtime_mappings 中。使用此功能,您可以在運行時構建新欄位,然后使用常規“總和”聚合計算所有檔案中的單詞。
例如:
PUT test/_doc/1
{
"field" : "test test test ss"
}
PUT test/_doc/2
{
"field" : "test test test ss"
}
GET test/_search
{
"size": 0,
"runtime_mappings": {
"phrase_count": {
"type": "long",
"script": """
String tmp = doc['field.keyword'].value;
Matcher m = /(test)/.matcher(tmp);
int count = 0;
while (m.find()){
count ;
}
emit(count);
"""
}
},
"query": {
"match_all": {}
},
"aggs": {
"word_count": {
"sum": {
"field": "phrase_count"
}
}
}
}
Matcher 中的“測驗” - 詞,您正在尋找并想要計數。
uj5u.com熱心網友回復:
Elasticsearch 中內置了 API 來檢索此類資訊,因為檔案和術語頻率與 Elasticsearch 中的 BM25 評分非常相關。請參閱術語向量 API和術語統計選項。您正在那里尋找“總詞頻”值。
如果您只想回傳特定術語的術語統計資訊,而不是現有檔案中的所有術語,則可以向僅包含您要查找的術語的 api發送“人工檔案”。
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/338102.html
標籤:python elasticsearch full-text-search
