如何使用bash腳本將非采樣值替換為時間序列資料檔案中的先前實體值？-有解無憂

我有時間序列資料，其中來自不同傳感器的測量值已在同一個 ascii 檔案中異步捕獲。這些值以空格分隔。

原始檔案如下所示。

2022-04-03 21:42:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:45:30  NOTSAMPLED 460     NOTSAMPLED
2022-04-03 21:46:30  NOTSAMPLED NOTSAMPLED      CLOSE
2022-04-03 21:47:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:48:30  NOTSAMPLED 460     NOTSAMPLED
2022-04-03 21:49:30  NOTSAMPLED NOTSAMPLED      CLOSE
2022-04-03 21:50:30  10.19      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:51:30  NOTSAMPLED 460  NOTSAMPLED
2022-04-03 21:52:30  NOTSAMPLED NOTSAMPLED      OPEN
2022-04-03 21:53:30  10.19      NOTSAMPLED      NOTSAMPLED

現在，除非在其他測量值可用的特定時間測量不可用，否則我需要將字串“NOTSAMPLED”替換為另一個傳感器的先前實體值，如下所示。

2022-04-03 21:42:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:45:30  10.20      460     NOTSAMPLED
2022-04-03 21:46:30  10.20      460     CLOSE
2022-04-03 21:47:30  10.20      460     CLOSE
2022-04-03 21:48:30  10.20      460     CLOSE
2022-04-03 21:49:30  10.20      460     CLOSE
2022-04-03 21:50:30  10.19      460     CLOSE
2022-04-03 21:51:30  10.19      460     CLOSE
2022-04-03 21:52:30  10.19      460     OPEN
2022-04-03 21:53:30  10.19      460     OPEN

可以使用 sed/awk 或任何其他 bash shell 腳本命令來實作嗎？

uj5u.com熱心網友回復：

更新 1：

2.66GB, 60.4mn row使用樣本資料的合成版本對結果進行基準測驗

 in0: 2.66GiB 0:00:36 [75.0MiB/s] [75.0MiB/s] [===..====>] 100%            
out9: 2.01GiB 0:00:36 [56.5MiB/s] [56.5MiB/s] [ <=> ]

   60,406,830 lines 2054.705 MB (2154514147) /dev/stdin

% pvE0 < sample3.txt | mawk2 '

       BEGIN { ____["NOTSAMPLED"]
             OFS=sprintf("%c",(___=  _ (  _))^--___)
       } {
           if (NR<___) {
                     NF = split($!_,__)
       } else {    _=NF
               do { __[_]=$_=($_ in ____)? \
                    __[_]:$_ } while(___<--_) } }_' | pvE9 | wc4

輸入吞吐量

75.0 MB/s
~1.66 mn rows/sec

=================================================

    < sample2.txt gawk -e '

      BEGIN { ____["NOTSAMPLED"]
              OFS=sprintf("%c",(___=  _ (  _))^--___)
          } {
              if (NR<___) {
                        NF = split($!_,__)
          } else {    _=NF
                  do { __[_]=$_=($_ in ____) ? \
                       __[_]:$_ } while(___<--_) } }_'

2022-04-03  21:42:30    10.20   NOTSAMPLED  NOTSAMPLED
2022-04-03  21:45:30    10.20   460     NOTSAMPLED
2022-04-03  21:46:30    10.20   460     CLOSE
2022-04-03  21:47:30    10.20   460     CLOSE
2022-04-03  21:48:30    10.20   460     CLOSE
2022-04-03  21:49:30    10.20   460     CLOSE
2022-04-03  21:50:30    10.19   460     CLOSE
2022-04-03  21:51:30    10.19   460     CLOSE
2022-04-03  21:52:30    10.19   460     OPEN
2022-04-03  21:53:30    10.19   460     OPEN

測驗并確認在gawk 5.1.1, mawk 1.3.4, mawk 1.996, 和macOS nawk

--The 4Chan Teller

uj5u.com熱心網友回復：

首先..這不是一個有效的答案，但它可以完成作業，并顯示正在發生的事情。

檔案 text.txt 包含示例輸入

#!/bin/bash
#set -x

# first set the variables for the first run
oldfield1="NOTSAMPLED"
oldfield2="NOTSAMPLED"
oldfield3="NOTSAMPLED"
oldfield4="NOTSAMPLED"
oldfield5="NOTSAMPLED"
NOTSAMPLED="NOTSAMPLED"

while read line; do

        field1=$(echo ${line}| cut -d ' ' -f 1)
        field2=$(echo ${line}| cut -d ' ' -f 2)
        field3=$(echo ${line}| cut -d ' ' -f 3)
        field4=$(echo ${line}| cut -d ' ' -f 4)
        field5=$(echo ${line}| cut -d ' ' -f 5)

        [[ ${field1} == ${NOTSAMPLED} ]] && field1=${oldfield1}
        [[ ${field2} == ${NOTSAMPLED} ]] && field2=${oldfield2}
        [[ ${field3} == ${NOTSAMPLED} ]] && field3=${oldfield3}
        [[ ${field4} == ${NOTSAMPLED} ]] && field4=${oldfield4}
        [[ ${field5} == ${NOTSAMPLED} ]] && field5=${oldfield5}

        echo "${field1} ${field2} ${field3} ${field4} ${field5}"

        oldfield1="${field1}"
        oldfield2="${field2}"
        oldfield3="${field3}"
        oldfield4="${field4}"
        oldfield5="${field5}"
done <test.txt

輸出：

2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN

uj5u.com熱心網友回復：

這是一個awk填寫所有NONSAMPLED欄位的解決方案（從欄位 #3 開始）：

_{編輯：移動NR==1{split($0, filldown)}到BEGIN塊，因為它減慢了兩個大檔案的處理速度}

awk '
    BEGIN { getline; split($0, filldown) }
    {
        for (i = 3; i <= NF; i  )
            if ($i != "NOTSAMPLED")
                filldown[i] = $i
            else
                $i = filldown[i]
    } 1
' file.txt

2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN

這是bash實作相同邏輯的版本。使用 bash 進行文本處理很慢，不被認為是一種好的做法，但如果您不熟悉以下內容，您可能會更好地理解它awk：

#!/bin/bash
{
    read -ra filldown

    while read -ra fields
    do
        for ((i = 2; i < ${#fields[@]}; i  ))
        do
            if [[ ${fields[i]} != NOTSAMPLED ]]
            then
                filldown[i]=${fields[i]}
            else
                fields[i]=${filldown[i]}
            fi
        done
        printf '%s\n' "${fields[*]}"
    done
} < file.txt

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/455979.html

標籤：重击代替时间序列 ASCII

上一篇：這是一個隱式管道嗎？

下一篇：從Linux的ftp中的檔案夾中加載2個最新檔案