基于兩個引數字串csv檔案的最重復字串。重擊-有解無憂

第一次使用 bash 練習，這需要很多時間......

我正在嘗試創建一個腳本，在該腳本中，在 sports.csv 上給出 2 個引數（身高、體重）會基于此回傳值和主要國籍的重合數。如果這還不夠，如果 2 個國家的優勢地位相等，那么echoid 最低的優勢地位。

我也不能使用 awk、grep、sed 或 csvkit。

這是csv標頭：

id,name,nationality,sex,date_of_birth,height,weight,sport,gold,silver,bronze,info
736041664,A Jesus Garcia,ESP,male,1969-10-17,1.72,64,athletics,0,0,0,
532037425,A Lam Shin,KOR,female,1986-09-23,1.68,56,fencing,0,0,0,
435962603,Aaron Brown,CAN,male,1992-05-27,1.98,79,athletics,0,0,1,
521041435,Aaron Cook,MDA,male,1991-01-02,1.83,80,taekwondo,0,0,0,
33922579,Aaron Gate,NZL,male,1990-11-26,1.81,71,cycling,0,0,0,
173071782,Aaron Royle,AUS,male,1990-01-26,1.80,67,triathlon,0,0,0,
266237702,Aaron Russell,USA,male,1993-06-04,2.05,98,volleyball,0,0,1,

到現在：

count=0

while IFS=, read -a id _ nation _ _ height weight _ _ _ _; do

    if (( $height == "$2" )) && (( "$weight" == $3 )) ; then

        ((count  ))
    fi

done < athletes.csv

echo "$count"

我見過一個類似的問題。但是找不到回傳最常見國籍（字串）的方法。

尋找類似的東西：

Count, Predominant_nationality 1.85 130
8460, BRA

我應該嘗試使用陣列而不是嘗試使用 lopps 進行孔練習嗎？可能我可以做索引，但看起來陣列在這里是一維的？

任何幫助都是一種祝福

uj5u.com熱心網友回復：

這是一個排序和計數的問題，可以用 Linux 標準文本實用程式解決

csv='athletes.csv'
crit='1\.85,90'

echo "Count Predominant_nationality $crit"
# Get fields from csv and sort on filtered fields 2,3
cut -d ',' -f 1,3,6,7 "$csv" | grep "$crit" | sort -t ',' -k2,3 | tr ',' ' ' | \
# Count unique skipping first field, get first
uniq -f 1 -c | sort -n -k1,1nr -k2n | head -n1 | tr -s ' ' | \
# print result
cut -d ' ' -f 2,4 --output-delimiter='    '

結果

Count Predominant_nationality 1.85,90
2    BRA

uj5u.com熱心網友回復：

當前代碼的一些問題：

read -a說將值讀入陣列，但您真正想要的是將值讀入單個變數
read -r在這種情況下很典型（-r禁用反斜杠作為轉義）
構造通常用于整數比較，if (( ... ))并且由于高度是非整數（例如，1.85），因此最好堅持使用字串比較（尤其是因為我們只對相等匹配感興趣）

設定; 而不是下載鏈接/資料檔案，我將在 OP 的示例輸入中添加 4x 假行，確保所有 4x 行都與 OP 的示例搜索引數（1.85和130）匹配：

$ cat athletes.csv
id,name,nationality,sex,date_of_birth,height,weight,sport,gold,silver,bronze,info
736041664,A Jesus Garcia,ESP,male,1969-10-17,1.72,64,athletics,0,0,0,
532037425,A Lam Shin,KOR,female,1986-09-23,1.68,56,fencing,0,0,0,
435962603,Aaron Brown,CAN,male,1992-05-27,1.98,79,athletics,0,0,1,
521041435,Aaron Cook,MDA,male,1991-01-02,1.83,80,taekwondo,0,0,0,
33922579,Aaron Gate,NZL,male,1990-11-26,1.81,71,cycling,0,0,0,
173071782,Aaron Royle,AUS,male,1990-01-26,1.80,67,triathlon,0,0,0,
266237702,Aaron Russell,USA,male,1993-06-04,2.05,98,volleyball,0,0,1,
134,Aaron XX1,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
127,Aaron XX2,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,
34,Aaron XX3,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
27,Aaron XX4,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,

一個bash想法：

arg1="1.85"
arg2="130"

maxid=99999999999

unset counts ids maxcount
declare -A counts ids
maxcount=0

while IFS=, read -r id _ nation _ _ height weight _
do
    if [[ "${height}" == "${arg1}" && "${weight}" == "${arg2}" ]]
    then
        (( counts[${nation}]   ))

        # keep track of overall max count

        [[ "${counts[${nation}]}" -gt "${maxcount}"  ]] && maxcount="${counts[${nation}]}"

        # keep track of min(id) for each nation

        [[ "${id}" -lt "${ids[${nation}]:-${maxid}}" ]] && ids[${nation}]="${id}"
    fi
done < athletes.csv

或者，由于看起來我們的搜索模式是在一起的，并且只能出現在一行中的一個位置，我們可以使用grep它來僅過濾掉匹配的行：

$ grep ",${arg1},${arg2}," athletes.csv
134,Aaron XX1,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
127,Aaron XX2,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,
34,Aaron XX3,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
27,Aaron XX4,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,

然后我們可以將此結果提供給while/read回圈并消除測驗height/weight變數的需要，例如：

while IFS=, read -r id _ nation _
do
   (( counts[${nation}]   ))
   [[ "${counts[${nation}]}" -gt "${maxcount}"  ]] && maxcount="${counts[${nation}]}"
   [[ "${id}" -lt "${ids[${nation}]:-${maxid}}" ]] && ids[${nation}]="${id}"
done < <(grep ",${arg1},${arg2}," athletes.csv)

此時，這兩個while/read回圈都會產生：

$ typeset -p counts ids maxcount
declare -A counts=([USA]="2" [CAD]="2" )
declare -A ids=([USA]="34" [CAD]="27" )
declare -- maxcount="2"

從這里 OP 可以遍歷國家串列 ( "${!counts[@]}") 尋找等于的計數，maxcount然后在找到時應用額外的檢查來查看國家是否具有ids[]迄今為止在回圈中看到的最低 id ( )。在回圈結束時，OP 應該具有國家 a) 計數等于maxcount和 b) 具有最低 id。

uj5u.com熱心網友回復：

您可以嘗試rq（https://github.com/fuyuncat/rquery/releases）
counta(;1)進行計數，mina(;id)回傳最小 id，f height=@h and weight=@w過濾具有給定引數的記錄，e @2=@3 trim @1, @4, @h, @w匹配??最小 id 并顯示結果。

[ rquery]$ ./rq -n -v "h:1.85;w:130" -q "p d/,/\"\"/ | s counta(;1) ,mina(;id),id, nationality | f height=@h and weight=@w | e @2=@3 trim @1, @4, @h, @w " samples/athletes.csv
2       HON     1.85    130

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/525079.html

標籤：重击壳CSV

上一篇：如何獲得與模式匹配的檔案名的總空間？

下一篇：bash中的回圈以組織fastqs