如果一列中的值存在于另一列中，則使用awk忽略/洗掉輸出中的行-有解無憂

我想看看 column2 值是否存在于 column3 中，反之亦然，如果這是真的，那么我想從輸出中洗掉它。

樣本.txt

3   abc def
5   ghk lmn
8   opq abc
10  lmn rst
15  uvw xyz
4   bcd abc
89  ntz uhg

到目前為止我有

awk ' {
    x[$2]=$2
    y[$3]=$3
    if (!(($2 in y) ||( $3 in x)) )
    {
     print $1,$2,$3
    }


} ' sample.txt

我希望輸出如下。

15  uvw xyz
89  ntz uhg

我知道 awk 逐行讀取檔案，并且我的代碼不兼容，因為它不會檢查尚未看到的未來陣列索引。因此報告第一次發生。想看看這是否可以在 awk 中以更簡單的方式完成，因為我的真實日期集非常龐大（最多 500 萬行和 400-500 MB）。謝謝！

uj5u.com熱心網友回復：

awk使用輸入檔案的兩次傳遞的一個想法：

awk '

# 1st pass:

FNR==NR { seen[$2]             # increment our seen counter for $2
          if ($2 != $3)        # do not increment seen[] if $2==$3
             seen[$3]          # increment our seen counter for $3
          next
        }

# 2nd pass:

seen[$2] <= 1 &&               # if seen[] counts are <= 1 for both
seen[$3] <= 1                  # $2 and $3 then print current line
' sample.txt sample.txt

這會產生：

15  uvw xyz
89  ntz uhg

一遍又一遍地復制前 4 行直到sample.txt包含約 400 萬行，然后運行此awk腳本，生成相同的 2 行輸出并在我的系統上花費約 3 秒（在低端 9xxx i7 上運行的 VM）。

另一個awk使用一些額外記憶體但只需要一次通過輸入檔案的想法：

awk '
    { seen[$2]  
      if ($2 != $3)
         seen[$3]  
      if (seen[$2] <=1 && seen[$3] <= 1)
         lines[  c]=$0
    }
END { for (i=1;i<=c;i  ) {
          split(lines[i],arr)
          if (seen[arr[1]] <= 1 && seen[arr[2]] <= 1)
             print lines[i]
      }
    }
' sample.txt

這也會產生：

15  uvw xyz
89  ntz uhg

Peformance on this one is going to depend on the number of unique $2/$3 values and thus the amount of memory that has to be allocated/processed. For my 4 million-row sample.txt (where 4 million rows are duplicate rows thus little additional memory used) the run time comes in at ~1.7 seconds ... a tad better than the 2-pass solution (~3 secs) but for real world data (with a large volume of unique $2/$3 values) I'm guessing the times will be a bit closer.

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/425664.html

標籤：壳文件 Unix awk

上一篇：用用戶輸入值填充動態陣列

下一篇：提示變數中的特殊字符已更改？