我有一個資料框,它是一個將突變與基因重疊的應用程式的輸出。有時大突變可以與多個基因重疊,所以這個資料框的結構是這樣的
mutation1 1gene_affected # mut1 only affected one gene
mutation2 1gene_affected # mut2 has affected 2 genes
mutation2 2gene_affected
mutation3 NO_gene_affected # there is also this. This can be filtered previously.
我怎么能以某種方式計算
number of mutations that affect 1 gene,
number of mutations that affect 2 genes,
number of mutations that affect 3 genes,
number of mutations that affect 4 genes,
number of mutations that affect 5 genes,
number of mutations that affect > 5 but <10,
number of mutations that affect >10 but <20,
number of mutations that affect >30 genes,
我想將這些值保存在變數中并呼叫我已經創建的將統計資料保存在檔案中的函式。
uj5u.com熱心網友回復:
假設您的資料框的列如下 : ["mutation", "gene"],對突變使用value_counts將為您提供每個突變的發生次數。然后一個比較函式ge就足夠了。例如,要知道完全影響 X 基因的所有突變:
mask_eq_X = df.loc[:, "mutation"].value_counts().eq(X)
print(df[mask_eq_X])
uj5u.com熱心網友回復:
清潔第二列,然后使用pd.cut:
count = df['mutation'].str.replace('NO_', '0') \
.str.extract('^(\d )', expand=False).astype(int)
lbls = ['No gene', '1 gene', '2 genes', '3 genes', '4 genes', '5 genes',
'between 10 and 20', 'between 20 and 30', 'more than 30 genes']
bins = [-np.inf, 1, 2, 3, 4, 5, 10, 20, 30, np.inf]
df['group'] = pd.cut(count, bins=bins, labels=lbls, right=False)
out = df.value_counts('group', sort=False)
輸出:
>>> out
group
No gene 1
1 gene 2
2 genes 1
3 genes 0
4 genes 0
5 genes 0
between 10 and 20 0
between 20 and 30 0
more than 30 genes 0
dtype: int64
設定:
>>> df
name mutation
0 mutation1 1gene_affected
1 mutation2 1gene_affected
2 mutation2 2gene_affected
3 mutation3 NO_gene_affected
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/446240.html
下一篇:熊貓應用函式參考列名
