我想知道是否有一種方法可以轉換如下所示的純檔案
target: locus9_window12
length: 120
miRNA : hsa-miR-4458
length: 19
mfe: -23.7 kcal/mol
p-value: 0.033901
target: locus104_window172
length: 120
miRNA : hsa-let-7b-5p
length: 22
mfe: -26.2 kcal/mol
p-value: 0.015466
target: locus119_window193
length: 120
miRNA : hsa-let-7b-5p
length: 22
mfe: -32.8 kcal/mol
p-value: 0.00028
以逗號分隔的 csv 型別格式
target length miRNA length mfe p-value
locus9_window12 120 hsa-miR-4458 19 -23.7 0.033901
locus104_window172 120 hsa-let-7b-5p 22 -26.2 0.015466
locus119_window193 120 hsa-let-7b-5p 22 -32.8 0.00028
如果可以將純文本檔案轉換為逗號分隔的 csv 檔案,我將不勝感激任何支持和貢獻
uj5u.com熱心網友回復:
這是使用正則運算式和熊貓方法的潛在解決方案。我將第二個length資料欄位重寫為miRNA_length(假設它是 miRNA 的長度)以避免重復的列名。
with open('filename.txt') as f:
t = f.read()
import re
df = (pd.DataFrame(re.findall(r'([^\s:] )\s*: (\S*)', t), columns=['col', 'value'])
# rename the length field that follows miRNA
.assign(col=lambda d: d['col'].mask(d['col'].shift().eq('miRNA'), 'miRNA_length'))
# group the data by row
.assign(index=lambda d: d.groupby('col').cumcount())
# reshape to wide format
.pivot(index='index', columns='col', values='value')
.rename_axis(index=None, columns=None)
# convert the data types (e.g numeric)
.convert_dtypes()
)
輸出:
length mfe miRNA miRNA_length p-value target
0 120 -23.7 hsa-miR-4458 19 0.033901 locus9_window12
1 120 -26.2 hsa-let-7b-5p 22 0.015466 locus104_window172
2 120 -32.8 hsa-let-7b-5p 22 0.00028 locus119_window193
如果不是檔案,則輸入文本:
t = '''target: locus9_window12
length: 120
miRNA : hsa-miR-4458
length: 19
mfe: -23.7 kcal/mol
p-value: 0.033901
target: locus104_window172
length: 120
miRNA : hsa-let-7b-5p
length: 22
mfe: -26.2 kcal/mol
p-value: 0.015466
target: locus119_window193
length: 120
miRNA : hsa-let-7b-5p
length: 22
mfe: -32.8 kcal/mol
p-value: 0.00028
'''
保存為 CSV:
df.to_csv('out.csv') # check the doc for more options
uj5u.com熱心網友回復:
miller非常適合這種檔案格式。輸入檔案需要稍作調整,在每條記錄之間添加一個空行,并洗掉冒號
awk -F: 'NR > 1 && $1 == "target" {print ""}; {sub(/:/,""); print}' file \
| mlr --ixtab --ocsv cat
輸出
target,length,miRNA,mfe,p-value
locus9_window12,19,hsa-miR-4458,-23.7 kcal/mol,0.033901
locus104_window172,22,hsa-let-7b-5p,-26.2 kcal/mol,0.015466
locus119_window193,22,hsa-let-7b-5p,-32.8 kcal/mol,0.00028
uj5u.com熱心網友回復:
假設您的文本檔案中沒有逗號并且您要列印的欄位中沒有空格,GNU awk 可以提供幫助:
# foo.awk
BEGIN {
print "length,mfe,miRNA,miRNA_length,p-value,target"
}
{
fields[NR%6] = $NF
}
NR%6 == 0 {
for(i=1; i<=6; i ) printf("%s%c", fields[i%6], i==6 ? "\n" : OFS)
}
進而:
awk -v OFS=, -f foo.awk foo.txt
length,mfe,miRNA,miRNA_length,p-value,target
locus9_window12,120,hsa-miR-4458,19,kcal/mol,0.033901
locus104_window172,120,hsa-let-7b-5p,22,kcal/mol,0.015466
locus119_window193,120,hsa-let-7b-5p,22,kcal/mol,0.00028
說明:我們fields使用$NF索引“行號模 6 ” ( NR%6)處每行 ( )的最后一個欄位填充陣列。請注意,行號從 1 開始,因此每組 6 中的最后一個在陣列中的索引為 0,而不是 6。如果當前記錄號是 6 的倍數,我們將列印fields陣列的內容。輸出欄位分隔符設定為逗號 ( -v OFS=,)。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/363378.html
