我正在尋找一種方法來largefile.txt為queryfile.txt. 但是之后,我不想輸出/保存找到每個查詢詞的整行,而是只保存該查詢詞和我只知道開頭(例如“ABC”)并且我知道的第二個詞肯定是在同一行中找到第一個單詞。
例如,如果queryfile.txt有的話:
this
next
并largefile.txt有以下幾行:
this is the first line with an ABCword # contents of first line will be saved
and there is an ABCword2 in this one as well # contents of 2nd line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
(請注意,在每一行中largefile.txt總是有一個以開頭開頭的單詞ABC。其中一個查詢單詞也不可能以“ABC”開頭)
保存檔案應類似于:
this ABCword1
this ABCword2
next ABCword2
到目前為止,我已經研究了其他類似帖子的建議,即結合 grep 和 awk,命令類似于:
LC_ALL=C grep -f queryfile.txt largefile.txt | awk -F"," '$2~/ABC/' > results.txt
問題是不僅沒有保存查詢詞,而且 -F"," '$2~/ABC/' 命令似乎也不是獲取以 'ABC' 開頭的單詞的正確命令。
我也找到了只使用 awk 的方法,但仍然沒有設法調整代碼來保存單詞 #2 而不是整行:
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print}' queryfile.txt largefile.txt > results.txt
uj5u.com熱心網友回復:
基于更新的樣本輸入/輸出的第二次嘗試:
$ cat tst.awk
FNR==NR { words[$1]; next }
{
queryWord = otherWord = ""
for (i=1; i<=NF; i ) {
if ( $i in words ) {
queryWord = $i
}
else if ( $i ~ /^ABC/ ) {
otherWord = $i
}
}
if ( (queryWord != "") && (otherWord != "") ) {
print queryWord, otherWord
}
}
$ awk -f tst.awk queryfile.txt largefile.txt
this ABCword
next ABCword2
原答案:
這可能是你想要做的(未經測驗):
awk '
FNR==NR { word2lgth[$1] = length($1); next }
($1 in word2lgth) && (match(substr($0,word2lgth[$1] 1),/ ABC[[:alnum:]_] /) ) {
print substr($0,1,word2lgth[$1] 1 RSTART RLENGTH)
}
' queryfile.txt largefile.txt > results.txt
uj5u.com熱心網友回復:
鑒于:
cat large_file
this is the first line with an ABCword
and the next line has an ABCword2 too CRABCAKE
third line has an ABCword3
ABCword4 and this is behind
cat query_file
this
next
(您在 large_file 的每一行上的注釋將被洗掉,否則 ABCword3 會列印,因為注釋中有“this”。)
您實際上可以完全使用 GNUsed和tr查詢檔案的操作來完成此操作:
pat=$(gsed -E 's/^(. )$/\\b\1\\b/' query_file | tr '\n' '|' | gsed 's/|$//')
gsed -nE "s/.*(${pat}).*(\<ABC[a-zA-Z0-9]*).*/\1 \2/p; s/.*(\<ABC[a-zA-Z0-9]*).*(${pat}).*/\1 \2/p" large_file
印刷:
this ABCword
next ABCword2
ABCword4 this
uj5u.com熱心網友回復:
這個假設您的查詢檔案的條目多于大檔案中一行的單詞數。此外,它不會將您的評論視為評論,而是將它們作為常規資料處理,因此如果剪切和粘貼,第三條記錄也是匹配的。
$ awk '
NR==FNR { # process queryfile
a[$0] # hash those query words
next
}
{ # process largefile
for(i=1;i<=NF && !(f1 && f2);i ) # iterate until both words found
if(!f1 && ($i in a)) # f1 holds the matching query word
f1=$i
else if(!f2 && ($i~/^ABC/)) # f2 holds the ABC starting word
f2=$i
if(f1 && f2) # if both were found
print f1,f2 # output them
f1=f2=""
}' queryfile largefile
uj5u.com熱心網友回復:
使用sed在一個while回圈
$ cat queryfile.txt
this
next
$ cat largefile.txt
this is the first line with an ABCword # contents of this line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
$ while read -r line; do sed -n "s/.*\($line\).*\(ABC[^ ]*\).*/\1 \2/p" largefile.txt; done < queryfile.txt
this ABCword
next ABCword2
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/364454.html
