這是我擁有的資料集的示例:
行代表不同的種質,列代表不同的基因。
gene1 gene2 gene3
1 PRESENT LOST PRESENT
2 PRESENT PRESENT LOST
3 LOST PRESENT PRESENT
我想找到加入之間的共同基因的數量,并創建一個資料框加入與加入,如下所示:
> test_self
1 2 3
1 2 1 1
2 1 2 1
3 1 1 2
我曾經使用R以下代碼創建此資料框
gene1 = c("PRESENT", "PRESENT", "LOST")
gene2 = c("LOST", "PRESENT", "PRESENT")
gene3 = c("PRESENT","LOST","PRESENT")
test_PAV <- data.frame (gene1,gene2,gene3)
test_self <- data.frame(matrix(ncol = nrow(test_PAV), nrow = nrow(test_PAV)))
colnames(test_self)<-row.names(test_PAV)
row.names(test_self)<-row.names(test_PAV)
for (rowInQuestion in 1:nrow(test_PAV)){
for (rowBeingComparedTo in 1:nrow(test_PAV)){
commonGeneCount <- 0
for (col in 1:ncol(test_PAV)){
if(test_PAV[rowInQuestion,col]=='PRESENT'&&test_PAV[rowBeingComparedTo,col]=='PRESENT'){
commonGeneCount <- commonGeneCount 1
}
test_self[rowInQuestion,rowBeingComparedTo]<-commonGeneCount
}
}
}
但我很清楚這是一個非常低效的解決方案。首先,由于上下三角形是對稱的,因此不必計算兩次。
我在這里找到了一個類似的解決方案,但是共性是連續的,但是在我的情況下,為了將其稱為共性,相同的基因(列)必須在兩個加入中都是“存在”的。
謝謝你。
uj5u.com熱心網友回復:
首先,讓我們只保留 PRESENT 基因并frozenset每行構建一個現有基因:
s = (
df.where(df.eq('PRESENT'))
.stack()
.reset_index(level=1)
.groupby(level=0)['level_1'].agg(frozenset)
)
# 1 (gene3, gene1)
# 2 (gene2, gene1)
# 3 (gene3, gene2)
# Name: level_1, dtype: object
現在我們可以從基因集的長度獲得itertools.product(或者,更快combinations)來構建預期的資料框:intersection
from itertools import product
out = (
pd.Series({(i,j): len(a.intersection(b))
for (i,a),(j,b) in product(zip(s.index, s), repeat=2)})
.unstack()
)
# 1 2 3
# 1 2 1 1
# 2 1 2 1
# 3 1 1 2
組合更快,因為只計算 A/B 或 B/A 中的 1 個,而不是 A/A:
from itertools import combinations
out = (
pd.Series({(i,j): len(a.intersection(b))
for (i,a),(j,b) in combinations(zip(s.index, s), r=2)})
.unstack()
)
# 2 3
# 1 1.0 1.0
# 2 NaN 1.0
uj5u.com熱心網友回復:
向量化將提高更大資料集的速度:
test_PAV <- as.data.frame(matrix(sample(c("PRESENT", "LOST"), 3e4, replace = TRUE), ncol = 150))
fCommonVec <- function(test_PAV) {
m <- as.matrix(test_PAV) == "PRESENT"
n <- nrow(m) - 1L
test_self <- matrix(nrow = nrow(test_PAV), ncol = nrow(test_PAV))
test_self[lower.tri(test_self)] <- rowSums(m[rep.int(1:n, n:1),] & m[sequence(n:1, 2:nrow(m)),])
diag(test_self) <- rowSums(m)
return(test_self)
}
與回圈版本比較:
fCommonLoop <- function(test_PAV) {
pavMatrix_transposeBinary <- as.matrix(test_PAV) == "PRESENT"
pavSelfSimilarity <- matrix(nrow = nrow(pavMatrix_transposeBinary), ncol = nrow(pavMatrix_transposeBinary))
for (rowInQuestion in 1:nrow(pavMatrix_transposeBinary)){
for (rowBeingComparedTo in 1:nrow(pavMatrix_transposeBinary)){
pavSelfSimilarity[rowInQuestion,rowBeingComparedTo] <- sum(pavMatrix_transposeBinary[rowInQuestion,]*pavMatrix_transposeBinary[rowBeingComparedTo,])
}
}
return(pavSelfSimilarity)
}
test_self <- fCommonVec(test_PAV)
pavSelfSimilarity <- fCommonLoop(test_PAV)
all.equal(test_self[lower.tri(test_self, diag = TRUE)], pavSelfSimilarity[lower.tri(pavSelfSimilarity, diag = TRUE)])
#> [1] TRUE
microbenchmark::microbenchmark(fCommonVec = fCommonVec(test_PAV),
fCommonLoop = fCommonLoop(test_PAV))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> fCommonVec 25.5148 30.6751 36.76842 32.6326 36.0579 76.0979 100
#> fCommonLoop 215.4542 226.2288 235.84177 229.6783 235.9802 304.1668 100
uj5u.com熱心網友回復:
因此,如果有人想使用它,我也想出了另一個解決方案;
我沒有使用帶有“PRESENT”和“LOST”的資料框,而是sed將它們轉換為 1 和 0。
sed -i '' 's/PRESENT/1/g' pav-matrix.binary.txt
sed -i '' 's/LOST/0/g' pav-matrix.binary.txt
然后使用這個二進制矩陣來比較每一行的每個乘積并得到乘積陣列的總和。
for (rowInQuestion in 1:nrow(pavMatrix_transposeBinary)){
for (rowBeingComparedTo in 1:nrow(pavMatrix_transposeBinary)){
pavSelfSimilarity[rowInQuestion,rowBeingComparedTo] <- sum(pavMatrix_transposeBinary[rowInQuestion,]*pavMatrix_transposeBinary[rowBeingComparedTo,])
}
}
uj5u.com熱心網友回復:
這個for回圈awk應該可以作業。
for i in {2..4}; do
for j in {2..4};do
tmp=$(awk '{print $'$i',$'$j'}' input |sed 1d|awk '$1==$2 {print}'|grep -c "PRESENT");echo -en "$tmp\t"
done
echo -en "\n"
done >> test_self
輸出:
2 1 1
1 2 1
1 1 2
您可能必須自己添加標題,順序應該相同。
謝謝!!
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/422933.html
標籤:
