我有一個形狀為 4200,8 的資料集。第一行看起來像這樣:
X0 X1 X2 X3 X4 X5 X6 X8
32 23 17 0 3 24 9 14
每個值都是一個分類編碼,對應于 30 個值的串列,如下所示:
[ 0.06405287, -0.1176078 , -0.06206927, 0.08389127, -0.18036067,
0.35158703, -0.0928449 , -0.0974429 , -0.06705306, -0.17196381,
-0.03776502, 0.09204011, 0.47813812, 0.16258538, 0.2699648 ,
0.07496626, -0.09791522, -0.31499937, -0.24898018, 0.06126055,
0.13187763, 0.21042736, -0.1585868 , 0.08355565, -0.13935572,
0.12408883, 0.2043313 , -0.12544186, -0.09223691, 0.00720569 ]
我的目標是為該串列中的每個值創建一列,位于串列對應分類編碼的位置。上面的串列對應14于 column 處的值X8,所以不是:X8 : 14我有:
X8_1 X8_2 X8_3 ... X8_29 X8_30
0.06 -0.11 -0.62 ...-0.09 0.007
最終結果是我的 8 列資料框變成了 240 列的資料框。當然,每一行都有一組不同的值。我是這樣做的:我獲取列中的每個唯一值,創建一個
colname:uniqueval:indexoflist:listvalatindex. 然后我從資料框的每一行創建一個字典,對于每一列和列中的值,我得到相應的串列并連接。然后我將該行連接到上一行。
weights = {}
for index, x in enumerate(encoded.columns): #this is the dataset with the original encoded values
weights[x] = {}
for id, val in enumerate(encoded[x].unique()):
weights[x][val] = {}
for weightid, weightval in enumerate(model_full.get_layer(embeddings[index]).get_weights()[0][id]): #this is where I get the list of 30 values from
weights[x][val][weightid] = weightval
mappedembeddings = pd.DataFrame()
encodedindex = []
for row in encoded.iterrows(): #iterate over original dataset
encodedindex.append(row[0]) #store index for later
df0 = pd.DataFrame()
for k, v in row[1].to_dict().items(): #for each key/val in row
names = []
for z in weights[k][v].keys():
names.append(str(k) '_' str(z)) #naming (z is key of list value)
tempdf = pd.DataFrame([weights[k][v]]) #dataframe of list at column/value key in dictionary made from embedding layer list
tempdf.columns = names
df0 = pd.concat([df0,tempdf],axis=1)
mappedembeddings = pd.concat([mappedembeddings,df0],axis=0) #concat row to previous row
mappedembeddings.index = encodedindex
這需要很長時間。我想對這個操作進行矢量化,但我不確定如何進行,所以我很感激一些見解。
uj5u.com熱心網友回復:
map您的每個列值到相應的串列explode串列到單獨的行和stack- 創建所需的列名稱
groupby pivot得到輸出
values = df.apply(lambda x: x.map(weights[x.name]))
values = values.explode(list(values.columns)).stack().reset_index()
values["column"] = values["level_1"] "_" (values.groupby(["level_0", "level_1"]).transform("cumcount") 1).astype(str)
output = values.pivot("level_0", "column", 0)
完整的作業示例:
import pandas as pd
import numpy as np
np.random.seed(100)
#random dataframe with three columns X0 X1 and X2
df = pd.DataFrame(data = np.random.randint(30, size=(2,3)),
columns = [f"X{i}" for i in range(3)]
)
#creating weights dictionary
#weights[col][number]: list of 5 numbers
weights = dict()
for c in df.columns:
weights[c] = {num: np.random.rand(5) for num in df[c].unique()}
values = df.apply(lambda x: x.map(weights[x.name]))
values = values.explode(list(values.columns)).stack().reset_index()
values["column"] = values["level_1"] "_" (values.groupby(["level_0", "level_1"]).transform("cumcount") 1).astype(str)
output = values.pivot("level_0", "column", 0)
>>> output
column X0_1 X0_2 X0_3 ... X2_3 X2_4 X2_5
level_0 ...
0 0.844776 0.004719 0.121569 ... 0.372832 0.005689 0.252426
1 0.136707 0.575093 0.891322 ... 0.598843 0.603805 0.105148
[2 rows x 15 columns]
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/365735.html
上一篇:Python-根據特定變數按字母順序將字母附加到資料框
下一篇:在嵌套回圈中更新字典
