為什么當one_hot_encoder應用于訓練資料時會添加embarkation_point

下面的例子vertica在https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/AnalyzingData/MachineLearning/DataPreparation/EncodingCategoricalColumns.htm?tocpath=Analyzing Data|Machine Learning for Predictive Analytics|Data Preparation|_____3

它使用來自泰坦尼克號的資料kaggle，

ONE_HOT_ENCODER_FIT 函式轉換分類資料并創建一個模型來表示分類資料的新表示

SELECT one_hot_encoder_fit('public.titanic_encoder','titanic_training','sex, embarkation_point'  USING PARAMETERS exclude_columns='', output_view='', extra_levels='{}');

==================
varchar_categories
==================
  category_name  |category_level|category_level_index
----------------- -------------- --------------------
embarkation_point|      C       |         0
embarkation_point|      Q       |         1
embarkation_point|      S       |         2 <- note S is 2
embarkation_point|              |         3
       sex       |    female    |         0
       sex       |     male     |         1 <-- note male is 1

然后在應用該模型titanic_encoder是這樣的titanic_training資料，為什么會embarkation_point_2被添加？輸出是否應該只包含分類值（比如S）及其編碼值？為什么我會看到值0和1而不是2（這是的編碼值S？類似于sex M和sex_1 1

dbadmin@2e4e746b3e6c(*)=> select * from titanic_training limit 1;
 passenger_id | survived | pclass |          name           | sex  | age | sibling_and_spouse_count | parent_and_child_count |  ticket   | fare | cabin | embarkation_point
-------------- ---------- -------- ------------------------- ------ ----- -------------------------- ------------------------ ----------- ------ ------- -------------------
            1 |        0 |      3 | Braund, Mr. Owen Harris | male |  22 |                        1 |                      0 | A/5 21171 | 7.25 |       | S <-- note S
(1 row)



dbadmin@2e4e746b3e6c(*)=> SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder') from titanic_training limit 1;
 passenger_id | survived | pclass |          name           | sex  | sex_1 | age | sibling_and_spouse_count | parent_and_child_count |  ticket   | fare | cabin | embarkation_point | embarkation_point_1 | embarkation_point_2 (<-- why this is here)?
-------------- ---------- -------- ------------------------- ------ ------- ----- -------------------------- ------------------------ ----------- ------ ------- ------------------- --------------------- ---------------------
            1 |        0 |      3 | Braund, Mr. Owen Harris | male <- note male|     1 <- note  encoded value of male |  22 |                        1 |                      0 | A/5 21171 | 7.25 |       | S <- note S                 |                   0 <- why this is here |                   1 <-- why this is here. Where is 2?
(1 row)

為什么沒有embarkation_point_3？

uj5u.com熱心網友回復：

您的輸出有很多原因。首先，閱讀APPLY_ONE_HOT_ENCODER的檔案：https ://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/MachineLearning/APPLY_ONE_HOT_ENCODER.htm ? tocpath = SQL Reference Manual|SQL Functions|Machine Learning Functions|Transformation Functions|_____5

兩個引數可讓您實作目標：

drop_first：將其設定為 false 以獲取所有列。由于相關性目的而丟棄了一個。您可以閱讀這篇文章：https : //inmachineswetrust.com/posts/drop-first-columns/ 有利有弊。
column_naming：將其設定為值，但要小心。如果您有帶有特殊字符的類別，您可能會遇到一些困難。

巴德爾

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/373170.html

標籤：机器学习垂直单热编码

上一篇：問題擬合SGD分類器

下一篇：如何構建管道以細粒度的方式找到每列的最佳預處理？

為什么當one_hot_encoder應用于訓練資料時會添加embarkation_point_2欄位