我有以下資料結構。其中類別名稱與產品名稱混合
df = pd.DataFrame(data={'name':['Category A', 'Subcategory A.A', 'Product A', 'Product B', 'Category B', 'Product C'],'values':["", "", 1,2,"", 3]})
name values
Category A
Subcategory A.A
Product A 1
Product B 2
Category B
Product C 3
name列中沒有值的每個條目都是一個類別名稱。
有什么方法可以將pandas DataFrame 轉換成以下結構?
name values category
Product A 1 Category A, Subcategory A.A
Product B 2 Category A, Subcategory A.A
Product C 3 Category B
任何幫助表示贊賞。
uj5u.com熱心網友回復:
使用cumsum創建一個自定義的類模塊和使用分組groupby.apply回傳非類別行 新類別列:
# create custom grouping per category block
newgroup = df['values'].eq('') & df['values'].shift().ne('')
groups = newgroup.cumsum()
# given group g, return subframe of non-category rows category name
def categorize(g):
is_category = g['values'].eq('')
category = ', '.join(g.loc[is_category, 'name']) # join category names by comma
return g.loc[~is_category].assign(category=category) # return non-category rows with new category column
# apply custom function to each group
df.groupby(groups).apply(categorize).droplevel(0)
輸出:
name values category
2 Product A 1 Category A, Subcategory A.A
3 Product B 2 Category A, Subcategory A.A
5 Product C 3 Category B
細節
每個類別塊在當前
values為空而前一個為空時開始values,因此我們可以使用生成偽組cumsum。這里的組顯示為一列,僅供視覺參考:newgroup = df['values'].eq('') & df['values'].shift().ne('') groups = newgroup.cumsum() # name values groups # 0 Category A 1 # 1 Subcategory A.A 1 # 2 Product A 1 1 # 3 Product B 2 1 # 4 Category B 2 # 5 Product C 3 2在每個組中,通過連接
name來自所有類別行的來獲取類別字串。然后我們可以在我們assign新的類別名稱之后回傳非類別行:def categorize(g): is_category = g['values'].eq('') category = ', '.join(g.loc[is_category, 'name']) # join category rows by comma return g.loc[~is_category].assign(category=category) # return non-category rows with new category column將此函式傳遞給
groupby.apply:df.groupby(groups).apply(categorize).droplevel(0) # name values category # 2 Product A 1 Category A, Subcategory A.A # 3 Product B 2 Category A, Subcategory A.A # 5 Product C 3 Category B
uj5u.com熱心網友回復:
我不確定是否有一種非常熊貓式的方法來處理這個問題,所以我想出了一個簡單的 Python 解決方案:
new_data = {'name':[], 'values': [], 'category': []}
lasts = {}
for idx, row in df.iterrows():
tp, val = row['name'].split(' ')
if row['values'] == '':
lasts[tp] = val
# Reset the subcategory if a new category is encountered
if tp == 'Category' and 'Subcategory' in lasts:
del lasts['Subcategory']
else:
new_data['category'].append(', '.join(f'{k} {v}' for k, v in zip(lasts.keys(), lasts.values())))
for k in row.index:
new_data[k].append(row[k])
df = pd.DataFrame(new_data)
輸出:
>>> df
name values category
0 Product A 1 Category A, Subcategory A.A
1 Product B 2 Category A, Subcategory A.A
2 Product C 3 Category B
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/376311.html
