Elasticsearch-計算索引中所有文本中的單詞出現次數-有解無憂

我需要從索引 Y 中的所有文本中獲取單詞 X 的計數，它只有一個欄位“內容”。請注意，我需要一個特定單詞的計數，它在所有檔案中總共出現了多少次。從我發現的情況來看，ES 并沒有為此進行很好的優化（因為這是一種文本型別），但這是針對大學作業的，所以我別無選擇。

到目前為止，我已經嘗試過（取自此處）：

{
  "script_fields": {
    "phrase_Count": {
      "script": {
        "lang": "painless",
        "source": "int count = 0; if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) count  ; return count;",
        "params": {
          "phrase": "ustawa"
        }
      }
    }
  }
}

腳本方法回傳：

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "runtime error",
        "script_stack": [
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:88)",
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:40)",
          "if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) ",
          "       ^---- HERE"
        ],
        "script": "int count = 0; if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) count  ; return count;",
        "lang": "painless",
        "position": {
          "offset": 22,
          "start": 15,
          "end": 104
        }
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "bills",
        "node": "MXtcD7-zT-mhDyxMeRTMLw",
        "reason": {
          "type": "script_exception",
          "reason": "runtime error",
          "script_stack": [
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:88)",
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:40)",
            "if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) ",
            "       ^---- HERE"
          ],
          "script": "int count = 0; if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) count  ; return count;",
          "lang": "painless",
          "position": {
            "offset": 22,
            "start": 15,
            "end": 104
          },
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "No field found for [content.keyword] in mapping with types []"
          }
        }
      }
    ]
  },
  "status": 400
}

上面content.keyword使用了，因為普通contentES 抱怨文本型別沒有針對此類搜索進行優化。

我也嘗試使用文本統計（從這里），但我無法讓它作業，它只計算帶有單詞的檔案（這不是我要找的）。

As my last approach I tried search with aggregation (from here), but it also just returned the count of documents, not words:

{
  "query": {
    "query_string": {
      "fields": ["content"],
      "query": "ustawa"
    }
  },  
  "aggs": {
    "my-terms": {
      "terms": {
        "field": "content.keyword"
      }
    }
  }
}

How can I achieve this? I'm using Python, if it matters.

EDIT Mapping for index I'm using:

  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }

uj5u.com熱心網友回復：

在 Elasticsearch 7.11 anonsed runtime_mappings 中。使用此功能，您可以在運行時構建新欄位，然后使用常規“總和”聚合計算所有檔案中的單詞。

例如：

PUT test/_doc/1
{
  "field" : "test test test ss"

}
PUT test/_doc/2
{
  "field" : "test test test ss"

}
GET test/_search
{
  "size": 0, 
  "runtime_mappings": {
    "phrase_count": {
      "type": "long",
      "script": """
         String tmp = doc['field.keyword'].value;
         Matcher m = /(test)/.matcher(tmp);
         int count = 0;
         while (m.find()){
           count  ;
         }
         emit(count);
          """
    }
  },
  "query": {
    "match_all": {}
  }, 
  "aggs": {
    "word_count": {
      "sum": {
        "field": "phrase_count"
      }
    }
  }
}

Matcher 中的“測驗” - 詞，您正在尋找并想要計數。

uj5u.com熱心網友回復：

Elasticsearch 中內置了 API 來檢索此類資訊，因為檔案和術語頻率與 Elasticsearch 中的 BM25 評分非常相關。請參閱術語向量 API和術語統計選項。您正在那里尋找“總詞頻”值。

如果您只想回傳特定術語的術語統計資訊，而不是現有檔案中的所有術語，則可以向僅包含您要查找的術語的 api發送“人工檔案”。

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/338102.html

標籤：python elasticsearch full-text-search

上一篇：KerasTensorFlowHub：開始使用簡單的ELMO網路

下一篇：映射器[NpgsqlValue]不能從型別[日期]更改為[ObjectMapper]