grep排除curl正文的注釋之間出現匹配的計數-有解無憂

我對 linux 和 bash 腳本非常陌生。我正在嘗試使用 curl 命令讀取 xml 檔案并計算其中單詞的出現次數</entity>。

curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text/xml;charset=utf-8" | grep '</entity>' -oP | wc -l

這可以正常作業，但是 xml 檔案包含如下注釋，導致計數錯誤。

示例 XML 檔案

.........
........
 <entity>
.......
.......
</entity>
........
........
<!--
.......
<entity>
........
</entity>
.......
.......
-->
<entity>
.......
........
</entity>

預期輸出應為 2，因為其中一個匹配項位于注釋塊內。

uj5u.com熱心網友回復：

由于您在gnu-grep這里使用的是針對您的問題的 PCRE 正則運算式解決方案：

curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text/xml;charset=utf-8" |
grep -ZzoP '(?s)<!--.*?-->(*SKIP)(*F)|</entity>' |
tr '\0' '\n' |
wc -l

2

正則運算式演示

正則運算式詳細資訊：

(?s)：啟用 DOTALL 模式，以便點也匹配換行符
: 匹配一個注釋塊
(*SKIP)(*F): 跳過此注釋塊并失敗
|：或者
</entity>: 匹配</entity>注釋塊外
tr '\0' '\n': 將 NUL 位元組轉換為換行符
wc -l: 計??算行數

uj5u.com熱心網友回復：

像往常一樣處理 XML 時，正則運算式是不適合這項作業的工具。使用一些了解格式的東西。例如，使用xmllint和一些 XPath：

curl ... | xmllint --xpath 'count(//entity)' -

（注意尾隨-；與許多程式不同，xmllint如果在命令列上沒有給出檔案名，則不會自動從標準輸入中讀取）

uj5u.com熱心網友回復：

使用您顯示的示例，請嘗試以下awk代碼。用 GNU 撰寫和測驗awk。

your_curl_command | 
awk -v RS="" '
match($0,/(^|\n)<!--[^-]*-->/){
  val=substr($0,RSTART,RLENGTH)
  gsub(val,"")
}
END{
  while(match($0,/(\n|^)[[:space:]]*<entity>[^<]*<\/entity>/)){
    count  
    $0=substr($0,RSTART RLENGTH)
  }
  print count
}
'

說明：為上述代碼添加詳細說明。

your_curl_command |                ##Running curl command and sending its output to awk command.
awk -v RS="" '                     ##Setting RS as NULL for this awk program.
match($0,/(^|\n)<!--[^-]*-->/){    ##Using match function of awk where using regex (^|\n)<!--[^-]*-->(explained below)
  val=substr($0,RSTART,RLENGTH)    ##if match of regex is found then assigning sub string value of matched value to val here.
  gsub(val,"")                     ##Using gsub(Global substitution) function to substitute globally val with NULL in current line in whole line.
}
END{                               ##Starting END block of this awk program from here.
  while(match($0,/(\n|^)[[:space:]]*<entity>[^<]*<\/entity>/)){  ##Using while loop to match regex (\n|^)[[:space:]]*<entity>[^<]*<\/entity> in match function to get all the matches to get count.
    count                          ##Adding 1 to count variable here.
    $0=substr($0,RSTART RLENGTH)   ##Assigning rest of line value to current line to avoid previous match.
  }
  print count                      ##Printing count value here.
}
'

第一個正則運算式（(^|\n)）的解釋：

(^|\n)    ##Matching either starting of value OR new line here.
<!--[^-]* ##Followed by <!-- till next value of - here.
-->       ##Followed by --> here.

第二個正則運算式（(\n|^)[[:space:]]*<entity>[^<]*<\/entity>）的解釋：

(\n|^)                ##Matching new line OR starting of value.
[[:space:]]*<entity>  ##Followed by spaces(0 or more occurrence) followed by <entity>
[^<]*                 ##Followed by matching just before <
<\/entity>            ##Followed by </entity> here.

uj5u.com熱心網友回復：

gawk/mawk/mawk2/nawk '
BEGIN {
 1      FS = RS = "^$"
 1      _____ = "[<][\\/]entity[>]"
 1      ____ = "\23\4"
 1      ___ =   "\32"
 1      __ = ("[\\n][<][!]")(_="[-][-][\\n]")
 1      sub("......","[\\n]&[>]",_)
}

# Rule(s)

 1  ($!-_=gsub(_____,"&",
     $((  gsub(__,____)*gsub(_, ___)*\
          gsub(____"[^"(___)"]*"___,""))~"")))_'

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/474228.html

標籤：正则表达式重击卷曲 awk grep

上一篇：多個數字的腳本linux回圈

下一篇：使用“awk”決議文本并使用“sed”修改其中一列