我有時間序列資料,其中來自不同傳感器的測量值已在同一個 ascii 檔案中異步捕獲。這些值以空格分隔。
原始檔案如下所示。
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 NOTSAMPLED 460 NOTSAMPLED
2022-04-03 21:46:30 NOTSAMPLED NOTSAMPLED CLOSE
2022-04-03 21:47:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:48:30 NOTSAMPLED 460 NOTSAMPLED
2022-04-03 21:49:30 NOTSAMPLED NOTSAMPLED CLOSE
2022-04-03 21:50:30 10.19 NOTSAMPLED NOTSAMPLED
2022-04-03 21:51:30 NOTSAMPLED 460 NOTSAMPLED
2022-04-03 21:52:30 NOTSAMPLED NOTSAMPLED OPEN
2022-04-03 21:53:30 10.19 NOTSAMPLED NOTSAMPLED
現在,除非在其他測量值可用的特定時間測量不可用,否則我需要將字串“NOTSAMPLED”替換為另一個傳感器的先前實體值,如下所示。
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN
可以使用 sed/awk 或任何其他 bash shell 腳本命令來實作嗎?
uj5u.com熱心網友回復:
更新 1:
2.66GB, 60.4mn row使用樣本資料的合成版本對結果進行基準測驗in0: 2.66GiB 0:00:36 [75.0MiB/s] [75.0MiB/s] [===..====>] 100% out9: 2.01GiB 0:00:36 [56.5MiB/s] [56.5MiB/s] [ <=> ] 60,406,830 lines 2054.705 MB (2154514147) /dev/stdin % pvE0 < sample3.txt | mawk2 ' BEGIN { ____["NOTSAMPLED"] OFS=sprintf("%c",(___= _ ( _))^--___) } { if (NR<___) { NF = split($!_,__) } else { _=NF do { __[_]=$_=($_ in ____)? \ __[_]:$_ } while(___<--_) } }_' | pvE9 | wc4
輸入吞吐量
75.0 MB/s~1.66 mn rows/sec
=================================================
< sample2.txt gawk -e '
BEGIN { ____["NOTSAMPLED"]
OFS=sprintf("%c",(___= _ ( _))^--___)
} {
if (NR<___) {
NF = split($!_,__)
} else { _=NF
do { __[_]=$_=($_ in ____) ? \
__[_]:$_ } while(___<--_) } }_'
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN
測驗并確認在gawk 5.1.1, mawk 1.3.4, mawk 1.996, 和macOS nawk
--The 4Chan Teller
uj5u.com熱心網友回復:
首先..這不是一個有效的答案,但它可以完成作業,并顯示正在發生的事情。
檔案 text.txt 包含示例輸入
#!/bin/bash
#set -x
# first set the variables for the first run
oldfield1="NOTSAMPLED"
oldfield2="NOTSAMPLED"
oldfield3="NOTSAMPLED"
oldfield4="NOTSAMPLED"
oldfield5="NOTSAMPLED"
NOTSAMPLED="NOTSAMPLED"
while read line; do
field1=$(echo ${line}| cut -d ' ' -f 1)
field2=$(echo ${line}| cut -d ' ' -f 2)
field3=$(echo ${line}| cut -d ' ' -f 3)
field4=$(echo ${line}| cut -d ' ' -f 4)
field5=$(echo ${line}| cut -d ' ' -f 5)
[[ ${field1} == ${NOTSAMPLED} ]] && field1=${oldfield1}
[[ ${field2} == ${NOTSAMPLED} ]] && field2=${oldfield2}
[[ ${field3} == ${NOTSAMPLED} ]] && field3=${oldfield3}
[[ ${field4} == ${NOTSAMPLED} ]] && field4=${oldfield4}
[[ ${field5} == ${NOTSAMPLED} ]] && field5=${oldfield5}
echo "${field1} ${field2} ${field3} ${field4} ${field5}"
oldfield1="${field1}"
oldfield2="${field2}"
oldfield3="${field3}"
oldfield4="${field4}"
oldfield5="${field5}"
done <test.txt
輸出:
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN
uj5u.com熱心網友回復:
這是一個awk填寫所有NONSAMPLED欄位的解決方案(從欄位 #3 開始):
編輯:移動NR==1{split($0, filldown)}到BEGIN塊,因為它減慢了兩個大檔案的處理速度
awk '
BEGIN { getline; split($0, filldown) }
{
for (i = 3; i <= NF; i )
if ($i != "NOTSAMPLED")
filldown[i] = $i
else
$i = filldown[i]
} 1
' file.txt
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN
這是bash實作相同邏輯的版本。使用 bash 進行文本處理很慢,不被認為是一種好的做法,但如果您不熟悉以下內容,您可能會更好地理解它awk:
#!/bin/bash
{
read -ra filldown
while read -ra fields
do
for ((i = 2; i < ${#fields[@]}; i ))
do
if [[ ${fields[i]} != NOTSAMPLED ]]
then
filldown[i]=${fields[i]}
else
fields[i]=${filldown[i]}
fi
done
printf '%s\n' "${fields[*]}"
done
} < file.txt
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/455979.html
上一篇:這是一個隱式管道嗎?
