遍歷兩列并計算一列中有多少值與第二列中的精確值匹配？-有解無憂

我有一個資料框，它是一個將突變與基因重疊的應用程式的輸出。有時大突變可以與多個基因重疊，所以這個資料框的結構是這樣的

mutation1        1gene_affected # mut1 only affected one gene
mutation2        1gene_affected # mut2 has affected 2 genes
mutation2        2gene_affected
mutation3        NO_gene_affected # there is also this. This can be filtered previously.

我怎么能以某種方式計算

number of mutations that affect 1 gene,
number of mutations that affect 2 genes,
number of mutations that affect 3 genes,
number of mutations that affect 4 genes,
number of mutations that affect 5 genes,
number of mutations that affect > 5 but <10,
number of mutations that affect >10 but <20,
number of mutations that affect >30 genes,

我想將這些值保存在變數中并呼叫我已經創建的將統計資料保存在檔案中的函式。

uj5u.com熱心網友回復：

假設您的資料框的列如下 : ["mutation", "gene"]，對突變使用value_counts將為您提供每個突變的發生次數。然后一個比較函式ge就足夠了。例如，要知道完全影響 X 基因的所有突變：

mask_eq_X = df.loc[:, "mutation"].value_counts().eq(X)
print(df[mask_eq_X])

uj5u.com熱心網友回復：

清潔第二列，然后使用pd.cut：

count = df['mutation'].str.replace('NO_', '0') \
                      .str.extract('^(\d )', expand=False).astype(int)

lbls = ['No gene', '1 gene', '2 genes', '3 genes', '4 genes', '5 genes',
        'between 10 and 20', 'between 20 and 30', 'more than 30 genes']
bins = [-np.inf, 1, 2, 3, 4, 5, 10, 20, 30, np.inf]

df['group'] = pd.cut(count, bins=bins, labels=lbls, right=False)

out = df.value_counts('group', sort=False)

輸出：

>>> out
group
No gene               1
1 gene                2
2 genes               1
3 genes               0
4 genes               0
5 genes               0
between 10 and 20     0
between 20 and 30     0
more than 30 genes    0
dtype: int64

設定：

>>> df
        name          mutation
0  mutation1    1gene_affected
1  mutation2    1gene_affected
2  mutation2    2gene_affected
3  mutation3  NO_gene_affected

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/446240.html

標籤：Python 熊猫

上一篇：pandas：groupby事件型別超過年度統計

下一篇：熊貓應用函式參考列名