我有一個包含數十億條記錄和標題的輸入檔案。標題由元資訊、總行數和第六列的總和組成。我將檔案拆分為小尺寸,因此我的標題記錄必須隨著第六列和總行數的總和而更新。
這是樣本記錄
檔案名:testFile.text
00|STMT|08-09-2022 13:24:56||5|13.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP
預期的:
檔案名:testFile_1.text
00|STMT|08-09-2022 13:24:56||3|6.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
檔案名:testFile_2.text
00|STMT|08-09-2022 13:24:56||2|7.00|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP
我能夠拆分檔案并計算總和,但無法替換標題部分中的值。這是我制作的腳本
#!/bin/bash
splitRowCount=$1
transactionColumn=$2
filename=$(basename -- "$3")
extension="${filename##*.}"
nameWithoutExt="${filename%.*}"
echo "splitRowCount: $splitRowCount"
echo "transactionColumn: $transactionColumn"
awk 'NR == 1 { head = $0 } NR % '$splitRowCount' == 2 { filename = "'$nameWithoutExt'_" int((NR-1)/'$splitRowCount') 1 ".'$extension'"; print head > filename } NR != 1 { print >> filename }' $filename
ls *.txt | while read line
do
firstLine=$(head -n 1 $line);
awk -F '|' 'NR !=1 {sum = '$transactionColumn'}END {print sum} ' $line
done
uj5u.com熱心網友回復:
這是awk將原始檔案拆分為n記錄檔案的解決方案。這個想法是累積記錄直到達到給定的計數,然后生成一個帶有更新的標題和累積記錄的檔案:
n=3
file=./testFile.text
awk -v numRecords="$n" '
BEGIN {
FS = OFS = "|"
if ( match(ARGV[1],/[^\/]\.[^\/]*$/) ) {
filePrefix = substr(ARGV[1],1,RSTART)
fileSuffix = substr(ARGV[1],RSTART 1)
} else {
filePrefix = ARGV[1]
fileSuffix = ""
}
if (getline headerStr <= 0)
exit 1
split(headerStr, headerArr)
}
(NR-2) % numRecords == 0 && recordsCount {
outfile = filePrefix "_" filesCount fileSuffix
print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
printf("%s", records) > outfile
close(outfile)
records = ""
recordsCount = recordsSum = 0
}
{
records = records $0 ORS
recordsCount
recordsSum = $6
}
END {
if (recordsCount) {
outfile = filePrefix "_" filesCount fileSuffix
print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
printf("%s", records) > outfile
close(outfile)
}
}
' "$file"
使用給定的樣本,您將獲得:
testFile_1.text
00|STMT|08-09-2022 13:24:56||3|6.1|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
testFile_2.text
00|STMT|08-09-2022 13:24:56||2|7|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP
uj5u.com熱心網友回復:
使用您顯示的示例,請嘗試以下awk代碼(在 GNU 中撰寫和測驗awk)。在這里,我定義了awk名為的變數fileInitials,其中包含輸出檔案的初始名稱,例如:testFile然后extension包含輸出檔案的擴展名,例如:.txthere。然后是lines你想要在輸出檔案中有多少行的值。
您不需要運行 shell awk代碼,這可以awk像下面顯示的那樣一次性完成。
awk -v count="1" -v fileInitials="testFile" -v extension=".txt" -v lines="3" '
BEGIN { FS=OFS="|" }
FNR==1{
match($0,/^([^|]*\|[^|]*\|[^|]*\|[^|]*\|[^|]*)\|[^|]*(.*)/,arr)
header1=arr[1]
header2=arr[2]
outputFile=(fileInitials count extension)
next
}
{
if(prev!=count){
print (header1,sum header2 ORS val) > (outputFile)
close(outputFile)
outputFile=(fileInitials count extension)
sum=0
val=""
}
sum =$6
val=(val?val ORS:"") $0
prev=count
count=( countline%lines==0? count:count)
}
END{
if(count && val){
print (header1,sum header2 ORS val) > (outputFile)
close(outputFile)
}
}
' Input_file
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/512846.html
