ML之FE:基于單個csv檔案資料集(自動切分為兩個dataframe表)利用featuretools工具實作自動特征生成/特征衍生
目錄
基于單個csv檔案資料集(自動切分為兩個dataframe表)利用featuretools工具實作自動特征生成/特征衍生
設計思路
1、定義資料集
2、DFS設計
輸出結果
feature_matrix_cats_df.csv
feature_matrix_nums.csv
推薦文章
Py之featuretools:featuretools庫的簡介、安裝、使用方法之詳細攻略
ML之FE:基于單個csv檔案資料集(自動切分為兩個dataframe表)利用featuretools工具實作自動特征生成/特征衍生
ML之FE:基于單個csv檔案資料集(自動切分為兩個dataframe表)利用featuretools工具實作自動特征生成/特征衍生實作
基于單個csv檔案資料集(自動切分為兩個dataframe表)利用featuretools工具實作自動特征生成/特征衍生
設計思路
1、定義資料集
contents={"name": ['Bob', 'LiSa', 'Mary', 'Alan'],
"ID": [1, 2, 3, 4], # 輸出 NaN
"age": [np.nan, 28, 38 , '' ], # 輸出
"born": [pd.NaT, pd.Timestamp("1990-01-01"), pd.Timestamp("1980-01-01"), ''], # 輸出 NaT
"sex": ['男', '女', '女', '男',], # 輸出 None
"hobbey":['打籃球', '打羽毛球', '打乒乓球', '',], # 輸出
"money":[200.0, 240.0, 290.0, 300.0], # 輸出
"weight":[140.5, 120.8, 169.4, 155.6], # 輸出
}
2、DFS設計
- (1)、指定一個包含資料集中所有物體的字典
- (2)、指定物體間如何關聯:當兩個物體有一對多關系時,我們稱之為“one”物體,即“parent entity”,
- (3)、運行深度特征合成:DFS的最小輸入是一組物體、一組關系和計算特性的“target_entity”,DFS的輸出是一個特征矩陣和相應的特征定義串列,
讓我們首先為資料中的每個客戶創建一個特性矩陣,那么現在有幾十個新特性來描述客戶的行為, - (4)、改變目標的物體:DFS如此強大的原因之一是它可以為我們的資料中的任何物體創建一個特征矩陣,例如,如果我們想為會話構建特性
- (5)、理解特征輸出:一般來說,Featuretools通過特性名稱參考生成的特性,
為了讓特性更容易理解,Featuretools提供了兩個額外的工具,Featuretools .graph_feature()和Featuretools .describe_feature(),
來幫助解釋什么是特性以及Featuretools生成特性的步驟, - (6)、特征譜系圖
特征譜系圖可視地遍歷功能生成程序,從基本資料開始,它們一步一步地展示應用的原語和生成的中間特征,以創建最終特征, - (7)、特征描述:功能工具還可以自動生成功能的英文句子描述,特性描述有助于解釋什么是特性,并且可以通過包含手動定義的自定義來進一步改進,
有關如何自定義自動生成的特性描述的詳細資訊,請參見生成特性描述,
輸出結果
name ID age born sex hobbey money weight
0 Bob 1 NaN NaT 男 打籃球 200.0 140.5
1 LiSa 2 28 1990-01-01 女 打羽毛球 240.0 120.8
2 Mary 3 38 1980-01-01 女 打乒乓球 290.0 169.4
3 Alan 4 NaT 男 300.0 155.6
-------------------------------------------
nums_df:----------------------------------
name ID age money weight
0 Bob 1 NaN 200.0 140.5
1 LiSa 2 28.0 240.0 120.8
2 Mary 3 38.0 290.0 169.4
3 Alan 4 NaN 300.0 155.6
cats_df:----------------------------------
ID hobbey sex born
0 4 NaN 男 NaN
1 1 打籃球 男 NaN
2 2 打羽毛球 女 1990-01-01
---------------------------------DFS設計:-----------------------------------
feature_matrix_nums
ID age money weight cats.hobbey cats.sex cats.COUNT(nums) \
name
Bob 1 NaN 200.0 140.5 打籃球 男 1.0
LiSa 2 28.0 240.0 120.8 打羽毛球 女 1.0
Mary 3 38.0 290.0 169.4 NaN NaN NaN
cats.MAX(nums.age) cats.MAX(nums.money) cats.MAX(nums.weight) \
name
Bob NaN 200.0 140.5
LiSa 28.0 240.0 120.8
Mary NaN NaN NaN
cats.MEAN(nums.age) cats.MEAN(nums.money) cats.MEAN(nums.weight) \
name
Bob NaN 200.0 140.5
LiSa 28.0 240.0 120.8
Mary NaN NaN NaN
cats.MIN(nums.age) cats.MIN(nums.money) cats.MIN(nums.weight) \
name
Bob NaN 200.0 140.5
LiSa 28.0 240.0 120.8
Mary NaN NaN NaN
cats.SKEW(nums.age) cats.SKEW(nums.money) cats.SKEW(nums.weight) \
name
Bob NaN NaN NaN
LiSa NaN NaN NaN
Mary NaN NaN NaN
cats.STD(nums.age) cats.STD(nums.money) cats.STD(nums.weight) \
name
Bob NaN NaN NaN
LiSa NaN NaN NaN
Mary NaN NaN NaN
cats.SUM(nums.age) cats.SUM(nums.money) cats.SUM(nums.weight) \
name
Bob 0.0 200.0 140.5
LiSa 28.0 240.0 120.8
Mary NaN NaN NaN
cats.DAY(born) cats.MONTH(born) cats.WEEKDAY(born) cats.YEAR(born)
name
Bob NaN NaN NaN NaN
LiSa 1.0 1.0 0.0 1990.0
Mary NaN NaN NaN NaN
features_defs_nums: 29 [<Feature: ID>, <Feature: age>, <Feature: money>, <Feature: weight>, <Feature: cats.hobbey>, <Feature: cats.sex>, <Feature: cats.COUNT(nums)>, <Feature: cats.MAX(nums.age)>, <Feature: cats.MAX(nums.money)>, <Feature: cats.MAX(nums.weight)>, <Feature: cats.MEAN(nums.age)>, <Feature: cats.MEAN(nums.money)>, <Feature: cats.MEAN(nums.weight)>, <Feature: cats.MIN(nums.age)>, <Feature: cats.MIN(nums.money)>, <Feature: cats.MIN(nums.weight)>, <Feature: cats.SKEW(nums.age)>, <Feature: cats.SKEW(nums.money)>, <Feature: cats.SKEW(nums.weight)>, <Feature: cats.STD(nums.age)>, <Feature: cats.STD(nums.money)>, <Feature: cats.STD(nums.weight)>, <Feature: cats.SUM(nums.age)>, <Feature: cats.SUM(nums.money)>, <Feature: cats.SUM(nums.weight)>, <Feature: cats.DAY(born)>, <Feature: cats.MONTH(born)>, <Feature: cats.WEEKDAY(born)>, <Feature: cats.YEAR(born)>]
feature_matrix_cats_df
hobbey sex COUNT(nums) MAX(nums.age) MAX(nums.money) MAX(nums.weight) \
ID
4 NaN 男 1 NaN 300.0 155.6
1 打籃球 男 1 NaN 200.0 140.5
2 打羽毛球 女 1 28.0 240.0 120.8
MEAN(nums.age) MEAN(nums.money) MEAN(nums.weight) MIN(nums.age) \
ID
4 NaN 300.0 155.6 NaN
1 NaN 200.0 140.5 NaN
2 28.0 240.0 120.8 28.0
MIN(nums.money) MIN(nums.weight) SKEW(nums.age) SKEW(nums.money) \
ID
4 300.0 155.6 NaN NaN
1 200.0 140.5 NaN NaN
2 240.0 120.8 NaN NaN
SKEW(nums.weight) STD(nums.age) STD(nums.money) STD(nums.weight) \
ID
4 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
SUM(nums.age) SUM(nums.money) SUM(nums.weight) DAY(born) MONTH(born) \
ID
4 0.0 300.0 155.6 NaN NaN
1 0.0 200.0 140.5 NaN NaN
2 28.0 240.0 120.8 1.0 1.0
WEEKDAY(born) YEAR(born)
ID
4 NaN NaN
1 NaN NaN
2 0.0 1990.0
features_defs_cats_df: 25 [<Feature: hobbey>, <Feature: sex>, <Feature: COUNT(nums)>, <Feature: MAX(nums.age)>, <Feature: MAX(nums.money)>, <Feature: MAX(nums.weight)>, <Feature: MEAN(nums.age)>, <Feature: MEAN(nums.money)>, <Feature: MEAN(nums.weight)>, <Feature: MIN(nums.age)>, <Feature: MIN(nums.money)>, <Feature: MIN(nums.weight)>, <Feature: SKEW(nums.age)>, <Feature: SKEW(nums.money)>, <Feature: SKEW(nums.weight)>, <Feature: STD(nums.age)>, <Feature: STD(nums.money)>, <Feature: STD(nums.weight)>, <Feature: SUM(nums.age)>, <Feature: SUM(nums.money)>, <Feature: SUM(nums.weight)>, <Feature: DAY(born)>, <Feature: MONTH(born)>, <Feature: WEEKDAY(born)>, <Feature: YEAR(born)>]
<Feature: SUM(nums.age)>
The sum of the "age" of all instances of "nums" for each "ID" in "cats".
feature_matrix_cats_df.csv
features_defs_cats_df: 25
[<Feature: hobbey>, <Feature: sex>, <Feature: COUNT(nums)>, <Feature: MAX(nums.age)>, <Feature: MAX(nums.money)>, <Feature: MAX(nums.weight)>, <Feature: MEAN(nums.age)>, <Feature: MEAN(nums.money)>, <Feature: MEAN(nums.weight)>, <Feature: MIN(nums.age)>, <Feature: MIN(nums.money)>, <Feature: MIN(nums.weight)>, <Feature: SKEW(nums.age)>, <Feature: SKEW(nums.money)>, <Feature: SKEW(nums.weight)>, <Feature: STD(nums.age)>, <Feature: STD(nums.money)>, <Feature: STD(nums.weight)>, <Feature: SUM(nums.age)>, <Feature: SUM(nums.money)>, <Feature: SUM(nums.weight)>, <Feature: DAY(born)>, <Feature: MONTH(born)>, <Feature: WEEKDAY(born)>, <Feature: YEAR(born)>]
| ID | hobbey | sex | COUNT(nums) | MAX(nums.age) | MAX(nums.money) | MAX(nums.weight) | MEAN(nums.age) | MEAN(nums.money) | MEAN(nums.weight) | MIN(nums.age) | MIN(nums.money) | MIN(nums.weight) | SKEW(nums.age) | SKEW(nums.money) | SKEW(nums.weight) | STD(nums.age) | STD(nums.money) | STD(nums.weight) | SUM(nums.age) | SUM(nums.money) | SUM(nums.weight) | DAY(born) | MONTH(born) | WEEKDAY(born) | YEAR(born) |
| 4 | 男 | 1 | 300 | 155.6 | 300 | 155.6 | 300 | 155.6 | 0 | 300 | 155.6 | ||||||||||||||
| 1 | 打籃球 | 男 | 1 | 200 | 140.5 | 200 | 140.5 | 200 | 140.5 | 0 | 200 | 140.5 | |||||||||||||
| 2 | 打羽毛球 | 女 | 1 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | 1 | 1 | 0 | 1990 |
| ID | hobbey | sex | COUNT(nums) | ||||||
| 4 | 男 | 1 | |||||||
| 1 | 打籃球 | 男 | 1 | ||||||
| 2 | 打羽毛球 | 女 | 1 | ||||||
| MAX(nums.age) | MAX(nums.money) | MAX(nums.weight) | MEAN(nums.age) | MEAN(nums.money) | MEAN(nums.weight) | MIN(nums.age) | MIN(nums.money) | MIN(nums.weight) | |
| 300 | 155.6 | 300 | 155.6 | 300 | 155.6 | ||||
| 200 | 140.5 | 200 | 140.5 | 200 | 140.5 | ||||
| 28 | 240 | 120.8 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | |
| SKEW(nums.age) | SKEW(nums.money) | SKEW(nums.weight) | STD(nums.age) | STD(nums.money) | STD(nums.weight) | SUM(nums.age) | SUM(nums.money) | SUM(nums.weight) | |
| 0 | 300 | 155.6 | |||||||
| 0 | 200 | 140.5 | |||||||
| 28 | 240 | 120.8 | |||||||
| DAY(born) | MONTH(born) | WEEKDAY(born) | YEAR(born) | ||||||
| 1 | 1 | 0 | 1990 |
欄位解釋:
- <Feature: hobbey> : The "hobbey".
- <Feature: sex> : The "sex".
- <Feature: COUNT(nums)> : The number of all instances of "nums" for each "ID" in "cats".
- <Feature: MAX(nums.age)> : The maximum of the "age" of all instances of "nums" for each "ID" in "cats".
- <Feature: MAX(nums.money)> : The maximum of the "money" of all instances of "nums" for each "ID" in "cats".
- <Feature: MAX(nums.weight)> : The maximum of the "weight" of all instances of "nums" for each "ID" in "cats".
- <Feature: MEAN(nums.age)> : The average of the "age" of all instances of "nums" for each "ID" in "cats".
- <Feature: MEAN(nums.money)> : The average of the "money" of all instances of "nums" for each "ID" in "cats".
- <Feature: MEAN(nums.weight)> : The average of the "weight" of all instances of "nums" for each "ID" in "cats".
- <Feature: MIN(nums.age)> : The minimum of the "age" of all instances of "nums" for each "ID" in "cats".
- <Feature: MIN(nums.money)> : The minimum of the "money" of all instances of "nums" for each "ID" in "cats".
- <Feature: MIN(nums.weight)> : The minimum of the "weight" of all instances of "nums" for each "ID" in "cats".
- <Feature: SKEW(nums.age)> : The skewness of the "age" of all instances of "nums" for each "ID" in "cats".
- <Feature: SKEW(nums.money)> : The skewness of the "money" of all instances of "nums" for each "ID" in "cats".
- <Feature: SKEW(nums.weight)> : The skewness of the "weight" of all instances of "nums" for each "ID" in "cats".
- <Feature: STD(nums.age)> : The standard deviation of the "age" of all instances of "nums" for each "ID" in "cats".
- <Feature: STD(nums.money)> : The standard deviation of the "money" of all instances of "nums" for each "ID" in "cats".
- <Feature: STD(nums.weight)> : The standard deviation of the "weight" of all instances of "nums" for each "ID" in "cats".
- <Feature: SUM(nums.age)> : The sum of the "age" of all instances of "nums" for each "ID" in "cats".
- <Feature: SUM(nums.money)> : The sum of the "money" of all instances of "nums" for each "ID" in "cats".
- <Feature: SUM(nums.weight)> : The sum of the "weight" of all instances of "nums" for each "ID" in "cats".
- <Feature: DAY(born)> : The day of the month of the "born".
- <Feature: MONTH(born)> : The month of the "born".
- <Feature: WEEKDAY(born)> : The day of the week of the "born".
- <Feature: YEAR(born)> : The year of the "born".
feature_matrix_nums.csv
features_defs_nums: 29
[<Feature: ID>, <Feature: age>, <Feature: money>, <Feature: weight>, <Feature: cats.hobbey>, <Feature: cats.sex>, <Feature: cats.COUNT(nums)>, <Feature: cats.MAX(nums.age)>, <Feature: cats.MAX(nums.money)>, <Feature: cats.MAX(nums.weight)>, <Feature: cats.MEAN(nums.age)>, <Feature: cats.MEAN(nums.money)>, <Feature: cats.MEAN(nums.weight)>, <Feature: cats.MIN(nums.age)>, <Feature: cats.MIN(nums.money)>, <Feature: cats.MIN(nums.weight)>, <Feature: cats.SKEW(nums.age)>, <Feature: cats.SKEW(nums.money)>, <Feature: cats.SKEW(nums.weight)>, <Feature: cats.STD(nums.age)>, <Feature: cats.STD(nums.money)>, <Feature: cats.STD(nums.weight)>, <Feature: cats.SUM(nums.age)>, <Feature: cats.SUM(nums.money)>, <Feature: cats.SUM(nums.weight)>, <Feature: cats.DAY(born)>, <Feature: cats.MONTH(born)>, <Feature: cats.WEEKDAY(born)>, <Feature: cats.YEAR(born)>]
| name | ID | age | money | weight | cats.hobbey | cats.sex | cats.COUNT(nums) | cats.MAX(nums.age) | cats.MAX(nums.money) | cats.MAX(nums.weight) | cats.MEAN(nums.age) | cats.MEAN(nums.money) | cats.MEAN(nums.weight) | cats.MIN(nums.age) | cats.MIN(nums.money) | cats.MIN(nums.weight) | cats.SKEW(nums.age) | cats.SKEW(nums.money) | cats.SKEW(nums.weight) | cats.STD(nums.age) | cats.STD(nums.money) | cats.STD(nums.weight) | cats.SUM(nums.age) | cats.SUM(nums.money) | cats.SUM(nums.weight) | cats.DAY(born) | cats.MONTH(born) | cats.WEEKDAY(born) | cats.YEAR(born) |
| Bob | 1 | 200 | 140.5 | 打籃球 | 男 | 1 | 200 | 140.5 | 200 | 140.5 | 200 | 140.5 | 0 | 200 | 140.5 | ||||||||||||||
| LiSa | 2 | 28 | 240 | 120.8 | 打羽毛球 | 女 | 1 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | 1 | 1 | 0 | 1990 | ||||||
| Mary | 3 | 38 | 290 | 169.4 | |||||||||||||||||||||||||
| Alan | 4 | 300 | 155.6 | 男 | 1 | 300 | 155.6 | 300 | 155.6 | 300 | 155.6 | 0 | 300 | 155.6 |
| name | ID | age | money | weight | |||||
| Bob | 1 | 200 | 140.5 | ||||||
| LiSa | 2 | 28 | 240 | 120.8 | |||||
| Mary | 3 | 38 | 290 | 169.4 | |||||
| Alan | 4 | 300 | 155.6 | ||||||
| cats.hobbey | cats.sex | cats.COUNT(nums) | |||||||
| 打籃球 | 男 | 1 | |||||||
| 打羽毛球 | 女 | 1 | |||||||
| 男 | 1 | ||||||||
| cats.MAX(nums.age) | cats.MAX(nums.money) | cats.MAX(nums.weight) | cats.MEAN(nums.age) | cats.MEAN(nums.money) | cats.MEAN(nums.weight) | cats.MIN(nums.age) | cats.MIN(nums.money) | cats.MIN(nums.weight) | |
| 200 | 140.5 | 200 | 140.5 | 200 | 140.5 | ||||
| 28 | 240 | 120.8 | 28 | 240 | 120.8 | 28 | 240 | 120.8 | |
| 300 | 155.6 | 300 | 155.6 | 300 | 155.6 | ||||
| cats.SKEW(nums.age) | cats.SKEW(nums.money) | cats.SKEW(nums.weight) | cats.STD(nums.age) | cats.STD(nums.money) | cats.STD(nums.weight) | cats.SUM(nums.age) | cats.SUM(nums.money) | cats.SUM(nums.weight) | |
| 0 | 200 | 140.5 | |||||||
| 28 | 240 | 120.8 | |||||||
| 0 | 300 | 155.6 | |||||||
| cats.DAY(born) | cats.MONTH(born) | cats.WEEKDAY(born) | cats.YEAR(born) | ||||||
| 1 | 1 | 0 | 1990 | ||||||
欄位解釋:
- <Feature: ID> : The "ID".
- <Feature: age> : The "age".
- <Feature: money> : The "money".
- <Feature: weight> : The "weight".
- <Feature: cats.sex> : The "sex" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.hobbey> : The "hobbey" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.COUNT(nums)> : The number of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MAX(nums.age)> : The maximum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MAX(nums.money)> : The maximum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MAX(nums.weight)> : The maximum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MEAN(nums.age)> : The average of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MEAN(nums.money)> : The average of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MEAN(nums.weight)> : The average of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MIN(nums.age)> : The minimum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MIN(nums.money)> : The minimum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MIN(nums.weight)> : The minimum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.SKEW(nums.age)> : The skewness of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.SKEW(nums.money)> : The skewness of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.SKEW(nums.weight)> : The skewness of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.STD(nums.age)> : The standard deviation of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.STD(nums.money)> : The standard deviation of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.STD(nums.weight)> : The standard deviation of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.SUM(nums.age)> : The sum of the "age" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.SUM(nums.money)> : The sum of the "money" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.SUM(nums.weight)> : The sum of the "weight" of all instances of "nums" for each "ID" in "cats" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.DAY(born)> : The day of the month of the "born" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.MONTH(born)> : The month of the "born" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.WEEKDAY(born)> : The day of the week of the "born" for the instance of "cats" associated with this instance of "nums".
- <Feature: cats.YEAR(born)> : The year of the "born" for the instance of "cats" associated with this instance of "nums".
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/273315.html
標籤:AI
上一篇:海螢物聯網介紹
