我的作業是處理生物資訊學檔案,這些檔案中的字符和數字組合成一排,如下圖所示
EDIT更新的例子chr1 1 100
chr10 1 500
chr2 33 52
chr5 11 66
chr22 99 1052
CHR11 444 2141
CHR2 555 1200
chr7 300 444
chr7 44 222
chr21 24 6023
CHR16 224 5521
chr3 3 200
chrX 6234 79593
chr1 5291 5500
chrY 204 310
我希望像這樣進行排序
chr1 1 100
chr1 5291 5500
chr2 33 52
chr2 555 1200
chr3 3 200
chr5 11 66
chr7 44 222
CHR7 300 444
chr10 1 500
CHR11 444 2141
CHR16 224 5521
chr21 24 6023
chr22 99 1052
chrX 6234 79593
chrY 204 310
我正在使用受控陣列橫向排序,因為我用for回圈來掃描它們,但既定的排序方法是數字、字串或型別排序。 我希望能有像 GNU sort -V 排序那樣的方法。 在下面的鏈接中,我看到我可以為排序建立一個自定義函式,但我不確定如何為字母數字值進行排序。
我有一個想法,就是做這樣的事情,把字母值和數字部分分開排序,然后再想辦法把它們結合起來,但是我不確定到底該怎么做。
echo "d23FgE55" | awk '{alph=$1;num=$1;gsub("[0-9] ", "-", alph);gsub("[A-z] ", "-", num);print alph, num}'
d-FgE- -23-55
編輯部
我很抱歉,其中一個答案讓我想起了我有時遇到的一些更復雜的例子,讓我問到了更接近于sort -V行為的東西。
未排序的例子
chr1_KI270708v1_random 9240458 9393655
chrUn_KI270589v1 5789405 6182867
chr19_GL383576v1_alt 4363702 4753141
chr8_KI270820v1_alt 1008865 1426444
chrUn_GL000220v1 7612088 7825236
chrUn_KI270591v1 9457975 9812609
chr19_GL949747v2_alt 1578276 2033118
chr13_KI270841v1_alt 8680841 9033557
chr17_KI270859v1_alt 3996864 4344945
chr20_GL383577v2_alt 3002112 3480396
chrUn_KI270322v1 1563181 1629375
chrUn_KI270419v1 5482364 5893900
chrUn_KI270310v1 4229845 4626802
chr17_KI270907v1_alt 1735306 2201566
chr9 2052238 2476827
chr1_KI270713v1_random 7260088 7358876
chr19_KI270891v1_alt 7485890 7719006
chr19_KI270917v1_alt 7816269 7864474
chrUn_KI270378v1 8848225 9158581
chr19_KI270933v1_alt 1600444 2095219
chr3_KI270937v1_alt 9902343 10113942
chr12_KI270904v1_alt 6924313 7067502
chrUn_GL000214v1 2468418 2728693
chr19_KI270884v1_alt 2760167 3027068
chrUn_KI270582v1 6390491 6398266
chr5_KI270897v1_alt 324409 548890
chr1_GL383520v2_alt 3718510 3948906
chr19_KI270888v1_alt 3809944 3824324
chr11_JH159137v1_alt 6090599 6480896
chr1_KI270762v1_alt 4196600 4684821
chr11_KI270831v1_alt 2112401 2490081
chr5_GL339449v2_alt 7220557 7718111
chr19_KI270889v1_alt 8709969 8823256
chr19_KI270931v1_alt 6356216 6811953
chr5_GL383530v1_alt 109389 128439
chr11_KI270927v1_alt 2119470 2348459
chr17_KI270730v1_random 404268 854145
chrUn_KI270387v1 7430161 7648856
chr10 2656835 2873499
chr19_GL383573v1_alt 7863497 7896279
chrUn_KI270741v1 1292450 1371680
chrUn_KI270335v1 7360266 7748998
chr9_GL383539v1_alt 3394859 3499461
chrX 4490524 4892623
chrUn_KI270311v1 963681 1069745
chr11_JH159136v1_alt 8171978 8319851
chr17_JH159148v1_alt 3708868 3984631
chrUn_KI270544v1 7025954 7392905
chr19_KI270888v1_alt 9683785 10166473
chrUn_KI270521v1 6924036 7001552
chr1_KI270710v1_random 8843336 9304602
chr19_GL949746v1_alt 8572018 8832793
chrUn_KI270322v1 9392841 9512920
chrUn_KI270366v1 3332191 3576201
chr13_KI270841v1_alt 9033828 9276044
chr19_GL949748v2_alt 3321575 3743545
chr16 3704062 4122526
chr17_GL383563v3_alt 7476487 7845527
chr8_KI270810v1_alt 8499569 8873953
chr22_KI270732v1_random 9187626 9418154
chr20_GL383577v2_alt 3031797 3421627
chr9_GL383541v1_alt 3771485 3927905
chr9_KI270720v1_random 8948742 9157627
chr16_KI270854v1_alt 7278604 7297845
chr1_KI270763v1_alt 1518275 1527443
chrUn_KI270517v1 3377374 3454859
chr15_KI270850v1_alt 6358541 6822565
chr22_KB663609v1_alt 4944645 4971886
chr7_KI270806v1_alt 9032201 9471652
chrUn_KI270438v1 8523562 8944980
chr17_KI270730v1_random 3544067 3796807
chr18 1856815 2144546
chr20_KI270869v1_alt 2269342 2353172
chr5_GL383530v1_alt 2139701 2285854
chr8_KI270819v1_alt 7048265 7503415
chr17_JH159148v1_alt 7040113 7042904
chr5_KI270793v1_alt 9363008 9816819
chr19_KI270931v1_alt 1794178 2143519
chr3_KI270778v1_alt 2228100 2549359
chr19_KI270932v1_alt 8320855 8486835
chr12_GL877876v1_alt 3736839 3820171
chr12_GL877876v1_alt 2805577 2974710
chr4_KI270789v1_alt 3309756 3669565
chr19_KI270917v1_alt 9443280 9678387
chr11_KI270831v1_alt 8603751 9028904
chrUn_KI270387v1 5189812 5439563
chrUn_KI270507v1 599827 666674
chr1_KI270711v1_random 6111532 6446935
chr2_KI270773v1_alt 8604675 8922311
chr16_KI270856v1_alt 5855373 6089898
chr3 3097856 3436204
chr13_KI270840v1_alt 3127654 3295739
chr15_KI270849v1_alt 8948327 9336376
chr18_KI270863v1_alt 3984764 4166006
chr19_KI270920v1_alt 5554949 5919294
chr12 6624798 7106478
chr4_GL383528v1_alt 2099770 2280753
chr2_KI270769v1_alt 8706329 9147304
chr8_KI270812v1_alt 6146462 6388606
分類輸出
chr1_GL383520v2_alt 3718510 3948906
chr1_KI270708v1_random 9240458 9393655
chr1_KI270710v1_random 8843336 9304602
chr1_KI270711v1_random 6111532 6446935
chr1_KI270713v1_random 7260088 7358876
chr1_KI270762v1_alt 4196600 4684821
chr1_KI270763v1_alt 1518275 1527443
chr2_KI270769v1_alt 8706329 9147304
chr2_KI270773v1_alt 8604675 8922311
chr3 3097856 3436204
chr3_KI270778v1_alt 2228100 2549359
chr3_KI270937v1_alt 9902343 10113942
chr4_GL383528v1_alt 2099770 2280753
chr4_KI270789v1_alt 3309756 3669565
chr5_GL339449v2_alt 7220557 7718111
chr5_GL383530v1_alt 109389 128439
chr5_GL383530v1_alt 2139701 2285854
chr5_KI270793v1_alt 9363008 9816819
chr5_KI270897v1_alt 324409 548890
chr7_KI270806v1_alt 9032201 9471652
chr8_KI270810v1_alt 8499569 8873953
chr8_KI270812v1_alt 6146462 6388606
chr8_KI270819v1_alt 7048265 7503415
chr8_KI270820v1_alt 1008865 1426444
chr9 2052238 2476827
chr9_GL383539v1_alt 3394859 3499461
chr9_GL383541v1_alt 3771485 3927905
chr9_KI270720v1_random 8948742 9157627
chr10 2656835 2873499
chr11_JH159136v1_alt 8171978 8319851
chr11_JH159137v1_alt 6090599 6480896
chr11_KI270831v1_alt 2112401 2490081
chr11_KI270831v1_alt 8603751 9028904
chr11_KI270927v1_alt 2119470 2348459
chr12 6624798 7106478
chr12_GL877876v1_alt 2805577 2974710
chr12_GL877876v1_alt 3736839 3820171
chr12_KI270904v1_alt 6924313 7067502
chr13_KI270840v1_alt 3127654 3295739
chr13_KI270841v1_alt 8680841 9033557
chr13_KI270841v1_alt 9033828 9276044
chr15_KI270849v1_alt 8948327 9336376
chr15_KI270850v1_alt 6358541 6822565
chr16 3704062 4122526
chr16_KI270854v1_alt 7278604 7297845
chr16_KI270856v1_alt 5855373 6089898
chr17_GL383563v3_alt 7476487 7845527
chr17_JH159148v1_alt 3708868 3984631
chr17_JH159148v1_alt 7040113 7042904
chr17_KI270730v1_random 404268 854145
chr17_KI270730v1_random 3544067 3796807
chr17_KI270859v1_alt 3996864 4344945
chr17_KI270907v1_alt 1735306 2201566
chr18 1856815 2144546
chr18_KI270863v1_alt 3984764 4166006
chr19_GL383573v1_alt 7863497 7896279
chr19_GL383576v1_alt 4363702 4753141
chr19_GL949746v1_alt 8572018 8832793
chr19_GL949747v2_alt 1578276 2033118
chr19_GL949748v2_alt 3321575 3743545
chr19_KI270884v1_alt 2760167 3027068
chr19_KI270888v1_alt 3809944 3824324
chr19_KI270888v1_alt 9683785 10166473
chr19_KI270889v1_alt 8709969 8823256
chr19_KI270891v1_alt 7485890 7719006
chr19_KI270917v1_alt 7816269 7864474
chr19_KI270917v1_alt 9443280 9678387
chr19_KI270920v1_alt 5554949 5919294
chr19_KI270931v1_alt 1794178 2143519
chr19_KI270931v1_alt 6356216 6811953
chr19_KI270932v1_alt 8320855 8486835
chr19_KI270933v1_alt 1600444 2095219
chr20_GL383577v2_alt 3002112 3480396
chr20_GL383577v2_alt 3031797 3421627
chr20_KI270869v1_alt 2269342 2353172
chr22_KB663609v1_alt 4944645 4971886
chr22_KI270732v1_random 9187626 9418154
chrUn_GL000214v1 2468418 2728693
chrUn_GL000220v1 7612088 7825236
chrUn_KI270310v1 4229845 4626802
chrUn_KI270311v1 963681 1069745
chrUn_KI270322v1 1563181 1629375
chrUn_KI270322v1 9392841 9512920
chrUn_KI270335v1 7360266 7748998
chrUn_KI270366v1 3332191 3576201
chrUn_KI270378v1 8848225 9158581
chrUn_KI270387v1 5189812 5439563
chrUn_KI270387v1 7430161 7648856
chrUn_KI270419v1 5482364 5893900
chrUn_KI270438v1 8523562 8944980
chrUn_KI270507v1 599827 666674
chrUn_KI270517v1 3377374 3454859
chrUn_KI270521v1 6924036 7001552
chrUn_KI270544v1 7025954 7392905
chrUn_KI270582v1 6390491 6398266
chrUn_KI270589v1 5789405 6182867
chrUn_KI270591v1 9457975 9812609
chrUn_KI270741v1 1292450 1371680
chrX 4490524 4892623
https://www.gnu.org/software/gawk/manual/html_node/Controlling-Array-Traversal.html
uj5u.com熱心網友回復:
一種gnu-awk方法,將輸入分成數字和字串陣列,然后使用2種不同的排序型別:
awk '{
v = $1
gsub(/[a-z] /, "", v)
如果(v 0 == v)
narr[v * 10^8 $2] = $0
否則
sarr[v,2美元] = 0美元
}
END {
PROCINFO["sorted_in"]="@ind_num_asc"
for (i in narr) print narr[i].
PROCINFO["sorted_in"]="@ind_str_asc"
for (i in sarr) print sarr[i].
}'檔案
chr1 1 100
chr1 5291 5500
chr2 33 52
chr2 555 1200
chr3 3 200
chr5 11 66
chr7 44 222
CHR7 300 444
chr10 1 500
CHR11 444 2141
CHR16 224 5521
chr21 24 6023
chr22 99 1052
chrX 6234 79593
chrY 204 310
uj5u.com熱心網友回復:
如果你不介意在數字排序中字母排序在數字之前,這可能是你所需要的全部(或者可以找到一個不存在這種情況的地區):
$ cat tst.awk
{ vals[gensub(/chr/,"",1,1)][2][3] }
END {
OFS = " "
PROCINFO["sorted_in"] = "@ind_num_asc"
for (chr in vals) {
for (beg in vals[chr]) {
for (end in vals[chr][beg]) {
print "chr "chr, beg, end
}
}
}
}
$ awk -f tst.awk file
chrX 6234 79593
chrY 204 310
chr1 1 100
chr1 5291 5500
chr2 33 52
chr2 555 1200
chr3 3 200
chr5 11 66
chr7 44 222
CHR7 300 444
chr10 1 500
CHR11 444 2141
CHR16 224 5521
chr21 24 6023
chr22 99 1052
以上使用GNU awk來處理陣列的陣列和排序的_in。
uj5u.com熱心網友回復:
假設:
[:alpha:] [:digit:]*)d23FgE55不會出現)[:alnum:])awk陣列中可能會出現 "記憶體不足 "的錯誤)在輸入中添加一些更多的資料點:
$ cat dat.raw
chr1 1 100
chr10 1 500
chr2 33 52
bec9 3 17
chr5 11 66
CHR22 99 1052
CHR11 444 2141
CHR2 555 1200
chr7 300 444
defA 3 15
def7 13 15
def7 3 15
chr7 44 222
chr21 24 6023
chr16 224 5521
chr3 3 200
chrX 6234 79593
chr1 5291 5500
chrY 204 310
使用GNU awk處理陣列的陣列和PROCINFO["sorted_in"]:
awk'
{ keyL=gensub(/([[:digit:]])/,"", "g",1) # 從第一個欄位中剝離數字
keyN=gensub(/([[:alpha:]])/,"", "g",1) # 從第一個欄位中剝離字母。
if ( keyN == "" ) keyN=999999999 # 如果第一個欄位中沒有數字,則使用一個非常大的數字,以確保這些行被放到串列的最后。
arr[keyL][keyN][$2]=$0 # arr[] index = 第一欄字母/第一欄數字/第二欄
}
END { PROCINFO["sorted_in"]="@ind_str_asc" # 將第一個索引排序為字串
for (i in arr) {
PROCINFO["sorted_in"]="@ind_num_asc" # 將第2/3個索引排序為數字
for (j in arr[i])
for (k in arr[i][j])
print arr[i][j][k]
}
}
' dat.raw
這就產生了:
bec9 3 17
chr1 1 100
chr1 5291 5500
chr2 33 52
chr2 555 1200
chr3 3 200
chr5 11 66
chr7 44 222
CHR7 300 444
chr10 1 500
CHR11 444 2141
CHR16 224 5521
chr21 24 6023
chr22 99 1052
chrX 6234 79593
chrY 204 310
def7 3 15
def7 13 15
defA 3 15
uj5u.com熱心網友回復:
GNU實用程式的版本排序規則要完全實作并不容易。
這里有一個 ruby。它適用于這個例子,但沒有完全實作 GNU -V 方法。它可能比 GNU awk 更容易定制。
ruby -e 'ar=$<.read.split(/)
?
/);
puts ar.sort_by { |v| [v[/^[a-zA-Z] /]] 。 v.scan(/d /)。
map{|e|e.to_i}。}.
join("
")' 檔案
chr1 1 100
chr1 5291 5500
chr2 33 52
chr2 555 1200
chr3 3 200
chr5 11 66
chr7 44 222
CHR7 300 444
chr10 1 500
CHR11 444 2141
CHR16 224 5521
chr21 24 6023
chr22 99 1052
chrX 6234 79593
chrY 204 310
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/328363.html
標籤:
上一篇:無法在pyspark中實體化com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem。
下一篇:如何隨機化div的位置
