我有一個帶有 2 列的 Pandas df:
name Count_Relationship
0 allicin DOWNREGULATE: 1
1 allicin DOWNREGULATE: 2
2 allicin UPREGULATE: 1 | DOWNREGULATE: 1
3 aspirin UPREGULATE: 5 | DOWNREGULATE: 1
4 albuterol DOWNREGULATE: 1
5 albuterol UPREGULATE: 3
我只想過濾掉如果我按“名稱”分組并在“Count_Relationship”列中計算 DOWNREGULATE 數量大于 UPREGULATE 數量的行。在這種情況下,大蒜素將有 DOWREGULATE 1 2 1=4 和 UPREGULATE =1 所以 num_downregulate>num_upregulate,而在其余的(阿司匹林,沙丁胺醇)情況并非如此。我想回傳這個過濾后的 df:
name Count_Relationship
0 allicin DOWNREGULATE: 1
1 allicin DOWNREGULATE: 2
2 allicin UPREGULATE: 1 | DOWNREGULATE: 1
Count_Relationship 列是一個字串,因此我必須決議字串的數字部分并將其轉換為 int。
我試過這個:
import pandas as pd
data = {'name': ['allicin', 'allicin', 'allicin', 'aspirin', 'albuterol', 'albuterol'],
'Count_Relationship': ['DOWNREGULATE: 1', 'DOWNREGULATE: 2', 'UPREGULATE: 1 | DOWNREGULATE: 1', 'UPREGULATE: 5 | DOWNREGULATE: 1', 'DOWNREGULATE: 1' , 'UPREGULATE: 3']
}
df = pd.DataFrame(data)
substances = df["name"].tolist()
substances = list(set(substances)) # to get the unique names
result_substances = []
for substance in (substances):
try:
numberOfdownregulate = df[(df["name"] == substance) & (\
(df["Count_Relationship"].str.match(pat = '("DOWNREGULATE:"([0-9]))')).values[0].astype(int)
except:
pass
try:
numberOfupregulate = df[(df["name"] == substance) & (\
(df["Count_Relationship"].str.match(pat = '("UPREGULATE:"([0-9]))')).values[0].astype(int)
except:
pass
result = numberOfdownregulate - numberOfupregulate
if result > 0:
result_substances.append(substance)
df_filtered = df[df["name"].isin(result_substances)]
但是我在我的正則運算式所在的 numberOfdownregulate 行出現語法錯誤。如何修復演算法?非常感謝
uj5u.com熱心網友回復:
您可以提取資訊,比較上下,并構建一個掩碼來選擇資料:
drugs = (df.join(df['Count_Relationship'].str.extractall('(?P<down>(?<=DOWNREGULATE: )\d )|(?P<up>(?<=UPREGULATE: )\d )')
.groupby(level=0).first().fillna(0).astype(int)
)
.groupby('name').agg({'down': 'sum', 'up': 'sum'})
.query('down >= up')
.index
)
df[df['name'].isin(drugs)]
輸出:
name Count_Relationship
0 allicin DOWNREGULATE: 1
1 allicin DOWNREGULATE: 2
2 allicin UPREGULATE: 1 | DOWNREGULATE: 1
uj5u.com熱心網友回復:
我建議將 DOWNREGULATE 和 UPREGULATE 值提取到不同的列中,然后應用按名稱分組的值的總和并檢查哪個更大。
下面的示例創建了一個名為 的附加布爾列UP_gt_DOWN,實際上 UPREGULATE 大于 DOWNREGULATE:
df['UPREGULATE'] = df['Count_Relationship'].str.extract(r"UPREGULATE: (\d*)").fillna(0).astype(int)
df['DOWNREGULATE'] = df['Count_Relationship'].str.extract(r"DOWNREGULATE: (\d*)").fillna(0).astype(int)
summed_df = df.groupby('name').sum()
summed_df['UP_gt_DOWN'] = summed_df['UPREGULATE'] > summed_df['DOWNREGULATE']
print(summed_df)
# Output
# UPREGULATE DOWNREGULATE UP_gt_DOWN
# name
# albuterol 3 1 True
# allicin 1 4 False
# aspirin 5 1 True
filtered_drugs = summed_df[~summed_df['UP_gt_DOWN']].index
print(df[df['name'].isin(filtered_drugs)])
# Output
# name Count_Relationship UPREGULATE DOWNREGULATE
# 0 allicin DOWNREGULATE: 1 0 1
# 1 allicin DOWNREGULATE: 2 0 2
# 2 allicin UPREGULATE: 1 | DOWNREGULATE: 1 1 1
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/321138.html
標籤:Python 正则表达式 熊猫 数据框 通过...分组
上一篇:捕獲最后一組/單詞
