1、線上實戰問題

前置說明：本文是線上環境的實戰問題拆解，涉及復雜 DSL，看著會很長，但強烈建議您耐心讀完，

問題描述：

有個復雜的場景涉及到按照求和后過濾，user_id是用戶編號，gender是性別，time_label是時間標簽，時間標簽是nested結構，intent_order_count是意向訂單數量，time是對應時間，

現在要篩選出在20210510~20210610，意向訂單數總和為26的男性用戶，請問應該怎么寫dsl陳述句？

感覺這個場景很復雜，涉及到array判斷后求和，然后求和結果做篩選條件，

請幫忙看看有什么好的dsl陳述句，或者改變現有mapping結構，

這個是mapping結構如下：

PUT index_personal
{
  "mappings": {
    "properties": {
      "time_label": {
        "type": "nested",
        "properties": {
          "intent_order_count": {
            "type": "long"
          },
          "time": {
            "type": "long"
          }
        }
      },
      "user_id": {
        "type": "keyword"
      },
      "gender": {
        "type": "keyword"
      }
    }
  }
}




下面是我構造的資料：


PUT index_personal/_doc/1
{
  "user_id": "1",
  "gender": "male",
  "time_label": [
    {
      "time": 20210601,
      "intent_order_count": 3
    },
    {
      "time": 20210602,
      "intent_order_count": 2
    },
    {
      "time": 20210605,
      "intent_order_count": 20
    },
    {
      "time": 20210606,
      "intent_order_count": 1
    },
    {
      "time": 20210611,
      "intent_order_count": 15
    }
  ]
}

PUT index_personal/_doc/2
{
  "user_id": "2",
  "gender": "female"
}


PUT index_personal/_doc/3
{
  "user_id": "3",
  "gender": "male",
  "time_label": [
    {
      "time": 20210102,
      "intent_order_count": 12
    },
    {
      "time": 20210202,
      "intent_order_count": 33
    }
  ]
}

問題擴展解釋：

1、"intent_order_count"代表：是訂單數，不過都可以抽象成這個用戶某個時間買了幾個，

比如第三條資料，表示用戶編號為 3 的用戶，是男性用戶，曾經在 20210102 時有12個意向訂單（跟訂單一個意思），在 20210202 有 33 個意向訂單，

2、每個用戶除了性別還有很多屬性，篇幅受限，沒有列出，

問題來源：https://t.zsxq.com/FmEeaIY

2、資料建模探討

2.1 原問題 Nested 模型

原有資料，以 Nested 建模，存盤結構如下：

user_id	gender	time_label {time:intent_order_count}
1	male	[ {20210601:3} {20210602:2}{20210605:20}{20210606:1}{20210611:15}]
2	female
3	male	{ 20210102:12}{20210202:33}

以上表示并不嚴謹，僅是為了更直觀的闡述問題，

2.2 寬表建模方案

拿到問題后，我的第一反應：建模可能有問題，

第一：time 存盤的是日期，應該是日期型別：date，
第二：寬表拉平存盤是不是更好？！也就是說：針對：“user_id” 的用戶，一個時間資料，對應一個 document 檔案，

原有的 nested 結構，改成如下的一條條的記錄，也就是“寬表”，類似簡化存盤如下：

user_id	gender	time	intent_order_count
1	male	20210601	3
1	male	20210602	2
1	male	20210605	20
1	male	20210606	1
1	male	20210611	15
2	female
3	male	20210102	12
3	male	20210202	33

“寬表”是典型的以空間換時間的方案，我們肉眼看到的：對于 user_id=1 的用戶，user_id, gender 資訊會存盤 N 份（每多一次 time，就多存盤一次），

如前所述，每個用戶除了性別還有很多屬性，也就是屬性非常多的話，會產生大量的冗余存盤，

寬表方案優缺點如下：

優點：更利用用戶理解，寫入和更新非常方便且效率高，
缺點：存在大量冗余存盤，耗費空間大，

針對“寬表”方案，問題提出者球友的反饋如下：

“這確實也是個思路，但是我的這個場景下，每個用戶除了性別還有很多屬性，這樣會每天都會產生大量的冗余資料，

是否有辦法將一個用戶的時間資訊聚集到一個檔案下，然后也能夠查詢，對查詢效率要求不高，”

所以，還得從 Nested 建模角度基礎上，考慮如何實作查詢？

2.3 Nested 建模方案

原有建模問題無大礙，只需將：time 欄位由 long 型別改為 date 型別，其他保持不變，

# 新的 Mapping 結構（微調）
PUT index_personal_02
{
 "mappings": {
  "properties": {
   "time_label": {
    "type": "nested",
    "properties": {
     "intent_order_count": {
      "type": "long"
     },
     "time": {
      "type": "date"
     }
    }
   },
   "user_id": {
    "type": "keyword"
   },
   "gender": {
    "type": "keyword"
   }
  }
 }
}


# 還是原來的構造資料，改成bulk，占據行數更少

PUT index_personal_02/_bulk
{"index":{"_id":1}}
{"user_id":"1","gender":"male","time_label":[{"time":20210601,"intent_order_count":3},{"time":20210602,"intent_order_count":2},{"time":20210605,"intent_order_count":20},{"time":20210606,"intent_order_count":1},{"time":20210611,"intent_order_count":15}]}
{"index":{"_id":2}}
{"user_id":"2","gender":"female"}
{"index":{"_id":3}}
{"user_id":"3","gender":"male","time_label":[{"time":20210102,"intent_order_count":12},{"time":20210202,"intent_order_count":33}]}

良好的資料建模就好比蓋大樓的地基，地基自然是越穩、越實、越牢靠越好！

3、查詢方案拆解

3.1 分步驟拆解用戶查詢需求

問題拆解成如下幾個部分：

3.1.1 篩選出在20210510~20210610

銘毅拆解：這是個范圍查詢，range query 搞定，

DSL 寫法如下：

{
    "nested": {
      "path": "time_label",
      "query": {
        "bool": {
          "must": [
            {
              "range": {
                "time_label.time": {
                  "gte": 20210510,
                  "lte": 20210601
                }
              }
            }
          ]
        }
      }
    }
  }

正常寫 Query 不會涉及 Nested，只有涉及 Nested 資料型別，才必須在檢索的前半部分加上 Nested 宣告，其目的無非告訴 Elasticsearch 后臺，這是針對 Nested 型別的檢索，

Path 指定的Nested 最外層，在本文指定的是：time_label，

3.1.2 意向訂單數總和為26的男性用戶

銘毅拆解：

關于男性用戶，這里可以基于性別檢索做過濾，

DSL 寫法如下：

{
  "term": {
    "gender": {
      "value": "male"
    }
  }
}

關于意向訂單:對于 user_id = 1 的用戶，意向訂單總數就等于 3 + 2 + 20 + 1 + 15 = 41，

要實作類似的求和，得需要借助 sum Metric 指標聚合實作，

sum Metric 聚合的前提是：針對某一特定用戶形成一個結果，所以其外層是基于用戶維度（本文使用：user_id）層面的terms聚合，

為了顯示出除了聚合結果之外的其他屬性列，需要借助 top_hits 的 _source 中的 include 實作，

DSL 寫法大致如下：

"aggs": {
    "user_id_aggs": {
      "terms": {
        "field": "user_id"
      },
      "aggs": {
        "top_sales_hits": {
          "top_hits": {
            "_source": {
              "includes": [
                "user_id",
                "gender"
              ]
            }
          }
        },
        "resellers": {
          "nested": {
            "path": "time_label"
          },
          "aggs": {
            "sum_count": {
              "sum": {
                "field": "time_label.intent_order_count"
              }
            }
          }
        }

如上：

最外層 terms 聚合：是基于 user_id 的分桶聚合，每個 user_id 的結果聚成一桶，
內層的聚合包含兩個，兩個是平級的，

其一：top_hits 指標聚合，用于顯示聚合結果之外的欄位，

其二：sum 指標聚合，用于對“time_label.intent_order_count”統計結果求和，

除了上面的兩層聚合，又涉及總和結果和 26 進行比較，所以要基于聚合的聚合，也就是子聚合的實作，

DSL 寫法如下：

     "count_bucket_filter": {
          "bucket_selector": {
            "buckets_path": {
              "totalcount": "resellers.sum_count"
            },
            "script": "params.totalcount >= 26"
          }
        }

文中給的實際例子沒有滿足 26 的檔案，所以，這里為了直觀顯示結果，使用了 >= 26 實作，

3.1.3 應該怎么寫dsl陳述句？

銘毅拆解：

基于上面幾個步驟整合到一起，即可實作，

查詢 DSL ——即用戶最終期望，查詢 DSL 就類似“圖紙”、“導航”或“路徑”，給出了達到給定目的的可行性路徑，后面無非就是：java 或者 Python 代碼的“堆砌”實作，

3.2 最終 DSL

POST index_personal_02/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "time_label",
            "query": {
              "bool": {
                "must": [
                  {
                    "range": {
                      "time_label.time": {
                        "gte": 20210510,
                        "lte": 20210601
                      }
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "term": {
            "gender": {
              "value": "male"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "user_id_aggs": {
      "terms": {
        "field": "user_id"
      },
      "aggs": {
        "top_sales_hits": {
          "top_hits": {
            "_source": {
              "includes": [
                "user_id",
                "gender"
              ]
            }
          }
        },
        "resellers": {
          "nested": {
            "path": "time_label"
          },
          "aggs": {
            "sum_count": {
              "sum": {
                "field": "time_label.intent_order_count"
              }
            }
          }
        },
        "count_bucket_filter": {
          "bucket_selector": {
            "buckets_path": {
              "totalcount": "resellers.sum_count"
            },
            "script": "params.totalcount >= 26"
          }
        }
      }
    }
  }
}

要強調的點是：

第一：涉及 Nested 的 query 檢索以及 aggs 聚合，都需要明確指定 Nested Path，
第二：復雜檢索和聚合出錯多數是：子聚合的位置放的不對、后括號和前括弧不匹配等，需要多在 Kibana 測驗驗證，
第三：Kibana 的一鍵 DSL 美化快捷鍵：“ctrl + i” 要掌握和靈活使用，

相信經過上面的拆解，這個相對“復雜”的 DSL 會變得非但不那么“復雜”，反而非常容易讀懂，

3.3 查詢后結果

"aggregations" : {
    "user_id_aggs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "1",
          "doc_count" : 1,
          "top_sales_hits" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : 1.4418328,
              "hits" : [
                {
                  "_index" : "index_personal_02",
                  "_type" : "_doc",
                  "_id" : "1",
                  "_score" : 1.4418328,
                  "_source" : {
                    "gender" : "male",
                    "user_id" : "1"
                  }
                }
              ]
            }
          },
          "resellers" : {
            "doc_count" : 5,
            "sum_count" : {
              "value" : 41.0
            }
          }
        }
      ]
    }
  }

由于檢索 size = 0，所以，只回傳了聚合結果，沒有回傳檢索結果，
由于二層聚合設定了 top_hits,所以回傳結果里除了sum_count的聚合結果，還包含的其下鉆資料欄位：“gender”、“user_id” 資訊，如果實際業務還有更多需要召回欄位，可以一并 include 包含后回傳即可，

4、有沒有更簡單的方案？

第 3 小節的實作是基于聚合，但實際檔案是 Nested 型別的，基于 userr_id 聚合顯得非常的多余，

這里自然想到，用檢索能否實作？

如果簡單檢索不行，那么腳本檢索呢？

4.1 擴展方案 1：腳本檢索實戰

搞一把試試，

GET index_personal_02/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "time_label",
            "query": {
              "bool": {
                "must": [
                  {
                    "range": {
                      "time_label.time": {
                        "gte": 20210510,
                        "lte": 20210613
                      }
                    }
                  },
                  {
                    "script": {
                      "script": """
                        int sum = 0;
                        for (obj in doc['time_label.intent_order_count']) {
                          sum += obj;
                        } 
                        sum >= 10;"""
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "term": {
            "gender": {
              "value": "male"
            }
          }
        }
      ]
    }
  }
}

如上邏輯看似非常嚴謹的腳本，實際是行不通的，

sum += obj; 本質上只求了一個值，

Elastic 官方工程師給出了詳細的解釋：“無法在查詢時訪問腳本中所有嵌套物件的值，腳本查詢一次僅適用于一個嵌套物件，”

詳細討論參見：

https://stackoverflow.com/questions/64140179/elasticsearch-sum-up-nested-object-field

https://discuss.elastic.co/t/help-for-painless-iterate-nested-fields/162394

結論：腳本檢索不適用 Nested 嵌套物件求和，

官方推薦用 Ingest pipeline 預處理方式實作，那就再搞一把，

4.2 擴展方案 2：Ingest pipeline 方式實戰

4.2.1 步驟 1——設定求和的 pipeline，

sum_pipeline 用途：將 nested 嵌套的 intent_order_count 欄位進行求和，

# 設定pipeline，統計計數總和


PUT _ingest/pipeline/sum_pipeline
{
  "processors": [
    {
      "script": {
        "source": """
          ctx.sum_count = ctx.time_label.stream()
            .mapToInt(thing -> thing.intent_order_count)
            .sum()
          """
      }
    }
  ]
}

4.2.2 步驟 2——結合 pipeline 更新資料

注意一下：nested 添加資料需要借助 script 實作，不能直接指定 id 插入，

若指定 id 插入資料會覆寫掉之前的資料，


# 新插入資料
POST index_personal_02/_update_by_query?pipeline=sum_pipeline
{
  "query":{
    "term": {
      "user_id": {
        "value": "1"
      }
    }
  },
  "script": {
    "source": "ctx._source.time_label.add(params.newlabel)",
    "params": {
      "newlabel": {
        "time": 20210702,
        "intent_order_count": 88
      }
    }
  }
}

4.2.3 步驟 3——結合文章開頭要求進行檢索

借助 pipeline 新增的欄位 sum_count 可以檢索條件之一，



# 檢索結果
GET index_personal_02/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "time_label",
            "query": {
              "bool": {
                "must": [
                  {
                    "range": {
                      "time_label.time": {
                        "gte": 20210510,
                        "lte": 20210601
                      }
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "term": {
            "gender": {
              "value": "male"
            }
          }
        },
        {
          "range": {
            "sum_count": {
              "gte": 26
            }
          }
        }
      ]
    }
  }
}

Ingest pipeline 方案小結：

通過預處理管道新增欄位，以空間換時間，
新增的欄位作為檢索的條件之一，不再需要聚合，

5、小結

分解是計算思維的核心思想之一，“大事化小，逐個擊破”，本文的拆解思路也是基于分解的思想一步步拆解，

本文針對線上問題，拋轉引玉，給出了方案拆解和完整的步驟實作，

共探索出兩種可行的方案：

方案一：聚合實作，

方案一本質：兩重嵌套聚合（terms分桶 + 分桶內 sum 指標聚合）+ 子聚合（基于聚合的聚合 bucket_selector）實作，

方案二：預處理管道 pipeline 實作，

方案二本質：新增求和欄位，以空間換時間，

實戰環境類似本文問題，銘毅推薦使用方案二，

細節問題待進一步結合線上需求進行擴展修改 DSL，

歡迎就問題及方案進行留言，說一下您的思考和思路反饋，

https://discuss.elastic.co/t/script-processor-ingest-pipelines-on-nested-fields/172092/2

干貨 | 拆解一個 Elasticsearch Nested 型別復雜查詢問題

1、線上實戰問題

2、資料建模探討

2.1 原問題 Nested 模型

2.2 寬表建模方案

2.3 Nested 建模方案

3、查詢方案拆解

3.1 分步驟拆解用戶查詢需求

3.1.1 篩選出在20210510~20210610

3.1.2 意向訂單數總和為26的男性用戶

3.1.3 應該怎么寫dsl陳述句？

3.2 最終 DSL

3.3 查詢后結果

4、有沒有更簡單的方案？

4.1 擴展方案 1：腳本檢索實戰

4.2 擴展方案 2：Ingest pipeline 方式實戰

4.2.1 步驟 1——設定求和的 pipeline，

4.2.2 步驟 2——結合 pipeline 更新資料

4.2.3 步驟 3——結合文章開頭要求進行檢索

5、小結

推薦