我對 linux 和 bash 腳本非常陌生。我正在嘗試使用 curl 命令讀取 xml 檔案并計算其中單詞的出現次數</entity>。
curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text/xml;charset=utf-8" | grep '</entity>' -oP | wc -l
這可以正常作業,但是 xml 檔案包含如下注釋,導致計數錯誤。
示例 XML 檔案
.........
........
<entity>
.......
.......
</entity>
........
........
<!--
.......
<entity>
........
</entity>
.......
.......
-->
<entity>
.......
........
</entity>
預期輸出應為 2,因為其中一個匹配項位于注釋塊內。
uj5u.com熱心網友回復:
由于您在gnu-grep這里使用的是針對您的問題的 PCRE 正則運算式解決方案:
curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text/xml;charset=utf-8" |
grep -ZzoP '(?s)<!--.*?-->(*SKIP)(*F)|</entity>' |
tr '\0' '\n' |
wc -l
2
正則運算式演示
正則運算式詳細資訊:
(?s):啟用 DOTALL 模式,以便點也匹配換行符<!--.*?-->: 匹配一個注釋塊(*SKIP)(*F): 跳過此注釋塊并失敗|: 或者</entity>: 匹配</entity>注釋塊外tr '\0' '\n': 將 NUL 位元組轉換為換行符wc -l: 計??算行數
uj5u.com熱心網友回復:
像往常一樣處理 XML 時,正則運算式是不適合這項作業的工具。使用一些了解格式的東西。例如,使用xmllint和一些 XPath:
curl ... | xmllint --xpath 'count(//entity)' -
(注意尾隨-;與許多程式不同,xmllint如果在命令列上沒有給出檔案名,則不會自動從標準輸入中讀取)
uj5u.com熱心網友回復:
使用您顯示的示例,請嘗試以下awk代碼。用 GNU 撰寫和測驗awk。
your_curl_command |
awk -v RS="" '
match($0,/(^|\n)<!--[^-]*-->/){
val=substr($0,RSTART,RLENGTH)
gsub(val,"")
}
END{
while(match($0,/(\n|^)[[:space:]]*<entity>[^<]*<\/entity>/)){
count
$0=substr($0,RSTART RLENGTH)
}
print count
}
'
說明:為上述代碼添加詳細說明。
your_curl_command | ##Running curl command and sending its output to awk command.
awk -v RS="" ' ##Setting RS as NULL for this awk program.
match($0,/(^|\n)<!--[^-]*-->/){ ##Using match function of awk where using regex (^|\n)<!--[^-]*-->(explained below)
val=substr($0,RSTART,RLENGTH) ##if match of regex is found then assigning sub string value of matched value to val here.
gsub(val,"") ##Using gsub(Global substitution) function to substitute globally val with NULL in current line in whole line.
}
END{ ##Starting END block of this awk program from here.
while(match($0,/(\n|^)[[:space:]]*<entity>[^<]*<\/entity>/)){ ##Using while loop to match regex (\n|^)[[:space:]]*<entity>[^<]*<\/entity> in match function to get all the matches to get count.
count ##Adding 1 to count variable here.
$0=substr($0,RSTART RLENGTH) ##Assigning rest of line value to current line to avoid previous match.
}
print count ##Printing count value here.
}
'
第一個正則運算式((^|\n)<!--[^-]*-->)的解釋:
(^|\n) ##Matching either starting of value OR new line here.
<!--[^-]* ##Followed by <!-- till next value of - here.
--> ##Followed by --> here.
第二個正則運算式((\n|^)[[:space:]]*<entity>[^<]*<\/entity>)的解釋:
(\n|^) ##Matching new line OR starting of value.
[[:space:]]*<entity> ##Followed by spaces(0 or more occurrence) followed by <entity>
[^<]* ##Followed by matching just before <
<\/entity> ##Followed by </entity> here.
uj5u.com熱心網友回復:
gawk/mawk/mawk2/nawk '
BEGIN {
1 FS = RS = "^$"
1 _____ = "[<][\\/]entity[>]"
1 ____ = "\23\4"
1 ___ = "\32"
1 __ = ("[\\n][<][!]")(_="[-][-][\\n]")
1 sub("......","[\\n]&[>]",_)
}
# Rule(s)
1 ($!-_=gsub(_____,"&",
$(( gsub(__,____)*gsub(_, ___)*\
gsub(____"[^"(___)"]*"___,""))~"")))_'
2
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/474228.html
上一篇:多個數字的腳本linux回圈
