1 什么是資料建模?
2 如何對 ES 中的資料進行建模
- 2.1 欄位型別的建模方案
- 2.2 檢索、聚合及排序的建模方案
- 2.3 額外存盤的建模方案
3 ES 資料建模實體演示
- 3.1 動態創建映射關系
- 3.2 手動創建映射關系
- 3.3 新增需求 - 添加大欄位
- 3.4 解決大欄位帶來的性能問題
- 3.5 mapping中欄位的常用引數
- 3.6 mapping 設定小結
4 ES 資料建模最佳實踐
- 4.1 如何處理關聯關系
- 4.2 避免太多的欄位
- 4.3 避免正則查詢
- 4.4 避免空值引起的聚合不準
參考資料
著作權宣告

1 什么是資料建模?

資料建模(Data modeling), 是創建資料模型的程序.

資料模型是對真實世界進行抽象描述的一種工具和方法, 實作對現實世界的映射. 比如影視作品、演員、觀眾評論...

資料建模有三個程序: 概念模型 => 邏輯模型 => 資料模型(第三范式)

資料模型, 需要結合使用的資料庫型別, 在滿足業務讀寫性能等需求的前提下, 制定出最終的定義.

2 如何對 ES 中的資料進行建模

ES中的資料建模:

由資料存盤、檢索等功能需求提煉出物體屬性、物體之間的關系 =》形成邏輯模型;

由性能需求提煉制定索引模板、索引Mapping(包括欄位的配置、關系的處理) ==》形成物理模型.

ES 中存盤、檢索的基本單位是索引檔案(document), 檔案由欄位(field)組成, 所以ES的建模就是對欄位進行建模.

檔案類似于關系型資料庫中的一行資料, 欄位對應關系型資料庫中的某一列資料.

2.1 欄位型別的建模方案

(1) text 與 keyword 比較:

text: 用于全文本欄位, 文本會被 Analyzer 分詞; 默認不支持聚合分析及排序, 設定 "fielddata": true 即可支持;
keyword: 用于 id、列舉及不需要分詞的文本, 比如身份證號碼、電話號碼，Email地址等; 適用于 Filter(精確匹配過濾)、Sorting(排序) 和 Aggregations(聚合).
設定多欄位型別:

默認會為文本型別設定成 text, 并設定一個 keyword 的子欄位;
在處理人類自然語?時, 可以添加“英?”、“拼?”、“標準”等分詞器, 提高搜索結果的正確性.

(2) 結構化資料:

數值型別: 盡量選擇貼近的型別, 例如可以用 byte, 就不要用 long;
列舉型別: 設定為 keyword, 即使是數字, 也應該設定成 keyword, 獲取更好的性能; 另外范圍檢索使用keyword, 速度更快;
其他型別: 日期、二進制、布爾、地理資訊等型別.

2.2 檢索、聚合及排序的建模方案

如不需要檢索、排序和聚合分析, 則可設定 "enable": false ;
如不需要檢索, 則可設定 "index": false ;
如不需要排序、聚合分析功能, 則可設定 "doc_values": false / "fielddate": false ;
更新頻繁、聚合查詢頻繁的 keyword 型別的欄位, 推薦設定 "eager_global_ordinals": true .

2.3 額外存盤的建模方案

是否需要專門存盤當前欄位資料?

"store": true, 可以存盤該欄位的原始內容;

一般結合 "_source": { "enabled": false } 進行使用, 因為默認的 "_source": { "enabled": true } , 也就是添加索引時檔案的原始 JSON 結構都會存盤到 _source 中.

disable_source: 禁用 _source 元欄位, 能節約磁盤, 適用于指標型資料 —— 類似于標識欄位、時間欄位的資料, 不會更新、高亮查詢, 多用來進行過濾操作以快速篩選出更小的結果集, 用來支撐更快的聚合操作.

官方建議: 如果更多關注磁盤空間, 那么建議優先考慮增加資料的壓縮?, 而不是禁用 _source;

無法看到 _source 欄位, 就不能做 reindex、update、update_by_query 操作;

目前為止, Kibana 中無法對禁用了 _source 欄位的索引進行 Discover 挖掘操作.

—— 謹慎禁用 _source 欄位, 參考: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html

3 ES 資料建模實體演示

3.1 動態創建映射關系

# 直接寫入一本圖書資訊:
POST books/_doc
{
  "title": "Thinking in Elasticsearch 7.2.0",
  "author": "Heal Chow",
  "publish_date": "2019-10-01",
  "description": "Master the searching, indexing, and aggregation features in Elasticsearch.",
  "cover_url": "https://healchow.com/images/29dMkliO2a1f.jpg"
}

# 查看自動創建的mapping關系:
GET books/_mapping
# 內容如下:
{
  "books" : {
    "mappings" : {
      "properties" : {
        "author" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "cover_url" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "description" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "publish_date" : {
          "type" : "date"
        },
        "title" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

3.2 手動創建映射關系

# 洗掉自動創建的圖書索引:
DELETE books

# 手動優化欄位的mapping:
PUT books
{
  "mappings": {
    "_source": { "enabled": true },
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 100
          }
        }
      },
      "author": { "type": "keyword" },
      "publish_date": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyyMMddHHmmss||yyyy-MM-dd||epoch_millis"
      },
      "description": { "type": "text" },
      "cover_url": {          # index 設定成 false, 不支持搜索, 但支持 Terms 聚合
        "type": "keyword",
        "index": false
      }
    }
  }
}

說明: _source 元欄位默認是開啟的, 若禁用后, 就無法對搜索的結果進行展示, 也無法進行 reindex、update、update_by_query 操作.

3.3 新增需求 - 添加大欄位

需求描述: 添加圖書內容欄位, 要求支持全文搜索, 并且能夠高亮顯示.
需求分析: 新需求會導致 _source 的內容過?, 雖然我們可以通過source filtering對要搜索結果中的欄位進行過濾:
```
"_source": {
    "includes": ["title"]  # 或 "excludes": ["xxx"] 排除某些欄位, includes 優先級更高
}
```
但這種方式只是 ES 服務端傳輸給客戶端時的過濾, 內部 Fetch 資料時, ES 各資料節點還是會傳輸 _source 中的所有資料到協調節點 —— 網路 IO 沒有得到本質上的降低.

3.4 解決大欄位帶來的性能問題

(1) 在創建 mapping 時手動關閉 _source 元欄位: "_source": { "enabled": false} ;

(2) 然后為每個欄位設定 "store": true .

# 關閉_source元欄位, 設定store=true:
PUT books
{
  "mappings": {
    "_source": { "enabled": false },
    "properties": {
      "title": {
        "type": "text",
        "store": true,
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 100
          }
        }
      },
      "author": { "type": "keyword", "store": true },
      "publish_date": {
        "type": "date",
        "store": true,
        "format": "yyyy-MM-dd HH:mm:ss||yyyyMMddHHmmss||yyyy-MM-dd||epoch_millis"
      },
      "description": { "type": "text", "store": true },
      "cover_url": {
        "type": "keyword",
        "index": false,
        "store": true
      },
      "content": { "type": "text", "store": true }
    }
  }
}

(3) 加資料, 并進行高亮查詢:

# 添加包含新欄位的檔案:
POST books/_doc
{
  "title": "Thinking in Elasticsearch 7.2.0",
  "author": "Heal Chow",
  "publish_date": "2019-10-01",
  "description": "Master the searching, indexing, and aggregation features in Elasticsearch.",
  "cover_url": "https://healchow.com/images/29dMkliO2a1f.jpg",
  "content": "1. Revisiting Elasticsearch and the Changes. 2. The Improved Query DSL. 3. Beyond Full Text Search. 4. Data Modeling and Analytics. 5. Improving the User Search Experience. 6. The Index Distribution Architecture.  .........."
}

# 通過 stored_fields 指定要查詢的欄位:
GET books/_search
{
  "stored_fields": ["title", "author", "publish_date"],
  "query": {
    "match": { "content": "data modeling" }
  },
  "highlight": {
    "fields": { "content": {} }
  }
}

查詢結果如下:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "dukLoG0BdfGBNhbF13CJ",
        "_score" : 0.5753642,
        "highlight" : {
          "content" : [
            "<em>Data</em> <em>Modeling</em> and Analytics. 5. Improving the User Search Experience. 6."
          ]
        }
      }
    ]
  }
}

(4) 結果說明:

回傳結果中不包含 _source 欄位;

對需要顯示的資訊, 要在查詢中指定 "stored_fields": ["xxx", "yyy"] ;

禁? _source 欄位后, 仍然支持使用 Highlights API 的使用.

3.5 mapping中欄位的常用引數

參考: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-params.html

enabled – 設定成 false, 當前欄位就只存盤, 不支持搜索和聚合分析 (資料保存在 _source 中);
index – 是否構建倒排索引, 設定成 false, 就無法被搜索, 但還是支持聚合操作, 并會出現在 _source 中;
norms – 只?來過濾和聚合分析(指標資料)、不關心評分的欄位, 建議關閉, 節約存盤空間;
doc_values – 是否啟用 doc_values, 用于排序和聚合分析;
field_data – 如果要對 text 型別啟用排序和聚合分析, fielddata 需要設定成true;
coerce – 是否開啟資料型別的自動轉換 (如: 字串轉數字), 默認開啟;
multifields - 是否開啟多欄位特性;
dynamic – 控制 mapping 的動態更新策略, 有 true / false / strict 三種.

doc_values 與 fielddata 比較:

doc_values: 聚合和排序的欄位需要開啟 —— 默認 為所有非text型別的欄位 開啟 —— 記憶體不夠時, 會寫入磁盤檔案中;

fielddata: 是否為text型別開啟, 以實作排序和聚合分析 —— 默認關閉 —— 全部加載進記憶體中.

3.6 mapping 設定小結

(1) 支持加入新的欄位 (包括子欄位)、更換分詞器等操作:

可以通過 update_by_query 令舊資料得到清洗.

(2) Index Template: 根據索引的名稱匹配不同的 mappings 和 settings;

(3) Dynamic Template: 在一個 mapping 上動態設定欄位型別;

(4) Reindex: 如果要修改、洗掉已經存在的欄位, 或者修改分片個數等引數, 就要重建索引.

必須停機, 資料量大時耗時會比較久.

可借助 Index Alias (索引別名) 來實作零停機維護.

4 ES 資料建模最佳實踐

4.1 如何處理關聯關系

(1) 范式化設計:

我們知道, 在關系型資料庫中有“范式化設計”的概念, 有 1NF、2NF、3NF、BCNF 等等, 主要目標是減少不必要的更新, 雖然節省了存盤空間, 但缺點是資料讀取操作可能會更慢, 尤其是跨表操作, 需要 join 的表會很多.

反范式化設計: 資料扁平, 不使用關聯關系, 而是在檔案中通過 _source 欄位來保存冗余的資料拷貝.

優點: 無需處理 join 操作, 資料讀取性能好;

缺點: 不適合資料頻繁修改的場景.

==》ES 不擅長處理關聯關系, 一般可以通過物件型別(object)、嵌套型別(nested)、父子關聯關系(child/parent)解決.

具體使用所占篇幅較大, 這里省略.

4.2 避免太多的欄位

(1) 一個?檔中, 最好不要有?量的欄位:

過多的欄位導致資料不容易維護;

mapping 資訊保存在 Cluster State 中, 資料量過?, 對集群性能會有影響 (Cluster State 資訊需要和所有的節點同步);

洗掉或修改欄位時, 需要 reindex;

(2) ES中單個索引最大欄位數默認是 1000, 可以通過引數 index.mapping.total_fields.limt 修改最?欄位數.

思考: 什么原因會導致檔案中有成百上千的欄位?

ES 是無模式 (schemaless) 的, 默認情況下, 每添加一個欄位, ES 都會根據該欄位可能的型別自動添加映射關系.

如果業務處理不嚴謹, 會出現欄位爆炸的現象. 為了避免這種現象的發生, 需要制定 dynamic 策略:

true - 未知欄位會被自動加入, 是默認設定;

false - 新欄位不會被索引, 但是會保存到 _source 中;

strict - 新增欄位不會被索引, ?檔寫入失敗, 拋出例外.

—— 生產環境中, 盡量不要使用默認的 "dynamic": true .

4.3 避免正則查詢

正則、前綴、通配符查詢, 都屬于 Term 查詢, 但是性能很不好(掃描所有檔案, 并逐一比對), 特別是將通配符放在開頭, 會導致性能災難.

(1) 案例:

檔案中某個欄位包含了 Elasticsearch 的版本資訊, 例如 version: "7.2.0" ;

搜索某系列的 bug_fix 版本(末位非0的版本號)? 每個主要版本號所關聯的檔案?

(2) 通配符查詢示例:

# 插入2條資料:
PUT softwares/_doc/1
{
  "version": "7.2.0",
  "doc_url": "https://www.elastic.co/guide/en/elasticsearch/.../.html"
}

PUT softwares/_doc/2
{
  "version": "7.3.0",
  "doc_url": "https://www.elastic.co/guide/en/elasticsearch/.../.html"
}

# 通配符查詢:
GET softwares/_search
{
  "query": {
    "wildcard": {
      "version": "7*"
    }
  }
}

(3) 解決方案 - 將字串型別轉換為物件型別:

# 創建物件型別的映射:
PUT softwares
{
  "mappings": {
    "properties": {
      "version": {		# 版本號設定為物件型別
        "properties": {
          "display_name": { "type": "keyword" },
          "major": { "type": "byte" },
          "minor": { "type": "byte" },
          "bug_fix": { "type": "byte" }
        }
      },
      "doc_url": { "type": "text" }
    }
  }
}

# 添加資料:
PUT softwares/_doc/1
{
  "version": {
    "display_name": "7.2.0",
    "major": 7,
    "minor": 2,
    "bug_fix": 0
  },
  "doc_url": "https://www.elastic.co/guide/en/elasticsearch/.../.html"
}

PUT softwares/_doc/2
{
  "version": {
    "display_name": "7.3.0",
    "major": 7,
    "minor": 3,
    "bug_fix": 0
  },
  "doc_url": "https://www.elastic.co/guide/en/elasticsearch/.../.html"
}

# 通過filter過濾, 避免正則查詢, 大大提升性能:
GET softwares/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": { "version.major": 7 }
        },
        {
          "match": { "version.minor": 2 }
        }
      ]
    }
  }
}

4.4 避免空值引起的聚合不準

(1) 示例:

# 添加資料, 包含1條 null 值的資料:
PUT ratings/_doc/1
{
  "rating": 5
}
PUT ratings/_doc/2
{
  "rating": null
}

# 對含有 null 值的欄位進行聚合:
GET ratings/_search
{
  "size": 0,
  "aggs": {
    "avg_rating": {
      "avg": { "field": "rating"}
    }
  }
}

# 結果如下:
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,				# 2條資料, avg_rating 結果不正確
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_rating" : {
      "value" : 5.0
    }
  }
}

(2) 使用 null_value 解決空值的問題:

# 創建 mapping 時, 設定 null_value:
PUT ratings
{
  "mappings": {
    "properties": {
      "rating": {
        "type": "float",
        "null_value": "1.0"
      }
    }
  }
}

# 添加相同的資料, 再次聚合, 結果正確:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_rating" : {
      "value" : 3.0
    }
  }
}

參考資料

《極客時間》視頻課之《Elasticsearch核心技術與實戰》

著作權宣告

作者: 馬瘦風(https://healchow.com)

出處: 博客園馬瘦風的博客(https://www.cnblogs.com/shoufeng)

感謝閱讀, 如果文章有幫助或啟發到你, 點個[好文要頂??] 或 [推薦??] 吧??

本文著作權歸博主所有, 歡迎轉載, 但 [必須在文章頁面明顯位置標明原文鏈接], 否則博主保留追究相關人員法律責任的權利.

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/50655.html

標籤：大數據

上一篇：《數學分析原理》筆記之——無理數的引入

下一篇：Hbase入門(四)——表結構設計-RowKey

ES 32 - Elasticsearch 資料建模的探索與實踐