我有一個遵循這種格式的資料框:
df = pd.DataFrame({'subtype': ['AC', 'SCC', 'SCC', 'AC', 'AC', 'SCC', 'AC'],
'geneA': ['0.56', '0.74', '0.89', '0.99', '0.24', '0.76', '0.60'],
'geneB': ['0.54', '0.73', '0.82', '0.99', '0.23', '0.74', '0.61'],
'geneC': ['0.53', '0.72', '0.84', '0.97', '0.23', '0.76', '0.62'],
'geneD': ['0.52', '0.77', '0.89', '0.99', '0.23', '0.75', '0.64'],
'geneE': ['0.51', '0.77', '0.89', '0.93', '0.23', '0.76', '0.64'],
'geneF': ['0.50', '0.79', '0.89', '0.96', '0.26', '0.73', '0.65'],
'geneG': ['0.56', '0.78', '0.89', '0.99', '0.23', '0.76', '0.64']})
它要大得多(它有大約 1000 個基因,即列)。每個數字對應一個 mRNA 豐度值。
我需要使用 Wilcoxon 秩和檢驗比較每個基因的 AC 和 SCC 亞型。我需要對資料集中的每個基因都這樣做,所以我基本上需要這樣做 1000 次。其中 group1 是基因的 AC 亞型的 mRNA 值,group2 是同一基因的 SCC 亞型的 mRNA 值。
import scipy.stats
ranksums(group1, group2)
我需要創建一個 for 回圈,該回圈將使用兩個子型別/組之間的秩和檢驗來比較 mRNA 值:AC 和 SCC,并生成一個 p 值串列。我基本上需要進行 1000 次 wilcoxon 秩和檢驗,以生成我為每個基因計算的長 p 值串列(其中有 1000 個,每列是一個基因)比較 AC 與 SCC。
我怎樣才能在python中實作這一點?這是我在沒有運氣的情況下嘗試過的。
p_vals= []
for i in range(1000):
new_data = subset.copy()
permuted_labels = list(subset['subtype'].sample(n=subset.shape[0], replace=False))
new_data['subtype'] = permuted_labels
group1= new_data.loc[new_data.subtype == 'AC']
group2= new_data.loc[new_data.subtype == 'SCC']
ranksums= ranksums(group1, group2)
p_vals.append(ranksums)
print(p_vals)
我需要做類似的事情,但不是計算 p 值,我需要計算每個基因的 AC 和 SCC 亞型之間平均 mRNA 豐度的倍數變化 (FC)(使用 FC 分子中的 AC 值) . 我需要將秩和檢驗中的基因 FC 和 p 值合并到一個表中。此外,我還需要使用
from statsmodels.stats.multitest import fdrcorrection
fdrcorrection(list_of_pvalues, alpha=0.05, method='indep', is_sorted=False)
def geneFC(df, geneColumnName):
# function to return fold change for every gene in the matrix
ac = df[(df['subtype'] == 'AC')]
scc = df[(df['subtype'] == 'SCC')]
acGene = ac[geneColumnName]
sccGene = scc[geneColumnName]
return acGene.mean()/sccGene.mean()
genes = list(df.columns) # list of genes from df columns
genes.remove('subtype') # removes "subtype" from list
fc_values = [] # list of pvalues to fill
for gene in genes: # loops through list of genes
fc_values.append(geneFC(df, gene)) # adds FC value of gene to list
uj5u.com熱心網友回復:
我想我有一個可行的解決方案,但我不確定為什么它回傳的 pvalues 完全相同。這是您提供的資料的屬性嗎?
import pandas as pd
from scipy.stats import ranksums
df = pd.DataFrame({'subtype': ['AC', 'SCC', 'SCC', 'AC', 'AC', 'SCC', 'AC'],
'geneA': ['0.56', '0.74', '0.89', '0.99', '0.24', '0.76', '0.60'],
'geneB': ['0.54', '0.73', '0.82', '0.99', '0.23', '0.74', '0.61'],
'geneC': ['0.53', '0.72', '0.84', '0.97', '0.23', '0.76', '0.62'],
'geneD': ['0.52', '0.77', '0.89', '0.99', '0.23', '0.75', '0.64'],
'geneE': ['0.51', '0.77', '0.89', '0.93', '0.23', '0.76', '0.64'],
'geneF': ['0.50', '0.79', '0.89', '0.96', '0.26', '0.73', '0.65'],
'geneG': ['0.56', '0.78', '0.89', '0.99', '0.23', '0.76', '0.64']})
def geneRankSum(df, geneColumnName):
# function to return rank sum for given gene
ac = df[(df['subtype'] == 'AC')]
scc = df[(df['subtype'] == 'SCC')]
acGene = ac[geneColumnName]
sccGene = scc[geneColumnName]
return ranksums(acGene, sccGene).pvalue
genes = list(df.columns) # list of genes from df columns
genes.remove('subtype') # removes "subtype" from list
pvalues = [] # list of pvalues to fill
for gene in genes: # loops through list of genes
pvalues.append(geneRankSum(df, gene)) # adds pvalue of gene to list
def geneFC(df, geneColumnName):
# function to return fold change for every gene in the matrix
ac = df[(df['subtype'] == 'AC')]
scc = df[(df['subtype'] == 'SCC')]
acGene = ac[geneColumnName]
sccGene = scc[geneColumnName]
return acGene.mean()/sccGene.mean()
genes = list(df.columns) # list of genes from df columns
genes.remove('subtype') # removes "subtype" from list
data = df[genes].astype(float)
data['subtype'] = df['subtype']
fc_values = [] # list of pvalues to fill
for gene in genes: # loops through list of genes
fc_values.append(geneFC(data, gene)) # adds FC value of gene to list
print(fc_values)
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/348116.html
