我有一個資料框,其中包含生物列,其中包含與特定正交群相關的基因名稱串列。我還有一個候選基因串列,我想用它來詢問這個資料框。
我想要做的是遍歷資料框,只保留與基因串列中任何候選基因匹配的字串部分。如果一行中的任何字串都沒有匹配項,我想完全洗掉該行。
我嘗試過的: 我嘗試使用此處鏈接的解決方案,該解決方案非常有效,但僅適用于一列 - 我想對所有列都執行此操作。
我還嘗試使用 np.where() 的一些組合而沒有任何收益。
可重現的資料:
資料框示例:
{'Orthogroup': {0: 'OG0000000',
1: 'OG0000001',
2: 'OG0000002',
3: 'OG0000003',
4: 'OG0000004'},
'acidithiobacillus_ferrooxidans': {0: 'WP_012536716.1, WP_113526628.1',
1: 'WP_009560941.1, WP_009567366.1, WP_012536141.1, WP_012536296.1, WP_012536581.1, WP_012537464.1, WP_012537498.1, WP_041645259.1, WP_229129755.1',
2: nan,
3: 'WP_009562369.1',
4: 'WP_009564938.1, WP_012535757.1, WP_012535999.1, WP_012536379.1, WP_012536404.1, WP_012536894.1, WP_012537015.1, WP_012537128.1, WP_232027572.1'},
'acidithiobacillus_thiooxidans': {0: 'WP_010639275.1, WP_010639739.1, WP_010640967.1, WP_029316246.1, WP_157999017.1, WP_157999033.1, WP_157999054.1',
1: 'WP_010637203.1, WP_010637257.1, WP_010639175.1, WP_010639301.1, WP_010639864.1, WP_010639966.1, WP_010639971.1, WP_010640895.1, WP_010641185.1, WP_050816103.1, WP_226825859.1, WP_226826127.1',
2: 'WP_235179950.1',
3: 'WP_010639365.1, WP_010642423.1',
4: 'WP_010637024.1, WP_010637323.1, WP_010637554.1, WP_010637935.1, WP_010639011.1, WP_010639453.1, WP_010640567.1, WP_010642114.1, WP_024893638.1'},
'burkholderia_pseudomallei': {0: 'WP_004185903.1, WP_004190283.1, WP_004528082.1, WP_004530921.1, WP_004534726.1, WP_011204972.1',
1: 'WP_004187662.1, WP_004187694.1, WP_004188321.1, WP_004189039.1, WP_004192300.1, WP_004198029.1, WP_004526108.1, WP_004538115.1, WP_004551242.1, WP_004553906.1, WP_004554106.1, WP_009931394.1',
2: 'WP_004187329.1, WP_004191606.1, WP_004195354.1, WP_004524029.1, WP_004524162.1, WP_004524990.1, WP_004525719.1, WP_004526156.1, WP_004529761.1, WP_004534748.1, WP_004552066.1, WP_004552880.1, WP_004553008.1, WP_004553113.1, WP_004554092.1, WP_014696855.1, WP_024431032.1, WP_024431563.1, WP_038742977.1, WP_038794545.1, WP_038803065.1',
3: 'WP_004185837.1, WP_004187646.1, WP_004188318.1, WP_004189230.1, WP_004189405.1, WP_004192049.1, WP_004192063.1, WP_004194581.1, WP_004197652.1, WP_004198340.1, WP_004266649.1, WP_004521296.1, WP_004522591.1, WP_004522925.1, WP_004523005.1, WP_004524197.1, WP_004525408.1, WP_004525453.1, WP_004528137.1, WP_004529948.1, WP_004531197.1, WP_004536377.1, WP_004541514.1, WP_004543954.1, WP_004553290.1, WP_004553756.1, WP_004553993.1, WP_011204859.1',
4: 'WP_004193481.1, WP_004195334.1, WP_004195802.1, WP_004196705.1, WP_004196791.1, WP_004200681.1, WP_004521379.1, WP_004524701.1, WP_004526731.1, WP_004529133.1, WP_004535483.1, WP_004537932.1, WP_004552076.1, WP_038729337.1'},
'e_coli': {0: 'WP_000043761.1, WP_000859945.1, WP_000969032.1, WP_001266293.1, WP_001301264.1, WP_001310896.1, WP_001360132.1',
1: 'WP_000125282.1, WP_000148503.1, WP_000417791.1, WP_000941041.1, WP_001188777.1, WP_001251544.1, WP_001295369.1, WP_001301108.1, WP_001301332.1',
2: 'WP_000483239.1, WP_000919536.1, WP_001098559.1, WP_001297437.1',
3: 'WP_000164036.1, WP_000357790.1, WP_000440317.1, WP_000648572.1, WP_000817708.1, WP_000935206.1, WP_001061575.1, WP_001242684.1',
4: 'WP_000365791.1, WP_000423261.1, WP_000442949.1, WP_000548294.1, WP_000952503.1, WP_001011462.1, WP_001019525.1, WP_001109794.1, WP_001300638.1, WP_001350493.1'}}
候選基因示例:
["WP_012536716.1","WP_004196791.1","WP_001297437.1"]
理想輸出: 理想情況下,這應該如下所示:
{'Orthogroup': {0: 'OG0000000',
2: 'OG0000002',
4: 'OG0000004'},
'acidithiobacillus_ferrooxidans': {0: 'WP_012536716.1',
2: nan,
4: nan},
'acidithiobacillus_thiooxidans': {0: nan,
2: nan,
4: nan},
'burkholderia_pseudomallei': {0: nan,
2: nan,
4: 'WP_004196791.1'},
'e_coli': {0: nan,
2: 'WP_001297437.1',
4: nan}}
任何幫助是極大的贊賞!
謝謝。
uj5u.com熱心網友回復:
使用自定義 lambda 函式DataFrame.applymap來過濾分割值candidates,如果沒有匹配得到空字串,所以最后添加DataFrame.replace替換空字串到缺失值:
candidates = set(["WP_012536716.1","WP_004196791.1","WP_001297437.1"])
f = lambda x: ', '.join(y for y in x.split(', ') if y in candidates)
df = (df.set_index('Orthogroup')
.fillna('')
.applymap(f)
.replace('', np.nan)
.dropna(how='all')
.reset_index())
print (df)
Orthogroup acidithiobacillus_ferrooxidans acidithiobacillus_thiooxidans \
0 OG0000000 WP_012536716.1 NaN
1 OG0000002 NaN NaN
2 OG0000004 NaN NaN
burkholderia_pseudomallei e_coli
0 NaN NaN
1 NaN WP_001297437.1
2 WP_004196791.1 NaN
uj5u.com熱心網友回復:
您別無選擇,只能使用applymap.
為此,您可以使用自定義函式:
candidates = ["WP_012536716.1","WP_004196791.1","WP_001297437.1"]
S = set(candidates)
def process(x):
if pd.isna(x):
return x
out = ', '.join(S.intersection(x.split(', ')))
if out:
return out
df2 = df.set_index('Orthogroup').applymap(process).dropna(how='all').reset_index()
輸出:
Orthogroup acidithiobacillus_ferrooxidans acidithiobacillus_thiooxidans burkholderia_pseudomallei e_coli
0 OG0000000 WP_012536716.1 NaN NaN NaN
1 OG0000002 NaN NaN NaN WP_001297437.1
2 OG0000004 NaN NaN WP_004196791.1 NaN
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/461335.html
