如何從shell中的<a>標簽中提取href和鏈接文本或標簽？-有解無憂

我有很多包含很多不同內容的 HTML 檔案，我總是使用名為pup. 摘錄有時包含如下所示的標簽：

<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>

...或像這樣：

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>

......甚至像這樣：

<a class="someclasses"
    href="mailto:[email protected]" js-class>
    email

</a>

我想做的是...

...提取HREF值和錨文本（之間的文本<a ...>和</a>）。
... 將兩個摘錄放在單獨的一行中，但順序相反：首先是文本，而不是 href 值。
...在每個 href 值前放置三個字符： =>

所以結果看起來像這樣：

Visit Duck Duck Go!
=> https://www.duckduckgo.com

sed如果所有內容都在一行中，就像第一個示例一樣，我可以通過創建組/模式并切換它們的列印順序來使用一些連接的命令和一些 RegEx來獲得我想要的東西。但是我不知道如果錨標簽分布在幾行上，如何獲得我想要的東西。我試圖實作我的目標，sed但我沒有運氣。昨天我一直在從其他人那里讀到類似的問題，這sed不是為了解決換行問題。這是真的？可以awk這樣做嗎？我可以使用其他任何工具嗎？

uj5u.com熱心網友回復：

可以用xmllint和xpath運算式決議 HTML 片段

frag=$(cat <<EOF
<div>
<a 
    href="mailto:[email protected]" js-class>
    email

</a>
<a 
    href="http://example.com">
    URL

</a>
<a 
    href="http://example.com/2">
    URL 2

</a>
</div>
EOF
)


while read -r line; do
    if [ "${line%=*}" == 'href' ]; then
        url=$(tr -d '"' <<<"${line#*=}")
    elif [ -n "$line" ]; then
       echo "$line"
       echo "=> $url"
    fi
done < <(echo "$frag" | xmllint --recover --html --xpath "//a/text()| //a/@href" -)

結果：

email
=> mailto:[email protected]
URL
=> http://example.com
URL 2
=> http://example.com/2

xmllint 也可用于直接決議 HTML 檔案。

uj5u.com熱心網友回復：

你可以試試這個bash腳本，盡管它可能不如評論中提到的工具那么有效。

$ cat input_file
<a class="someclasses"
    href="mailto:[email protected]" js-class>
    email

</a>

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>

<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>

#!/usr/bin/env bash

IFS=$'\n'
i=0
count=$(( $(sed -En 's/<.*>(.*)<.*>/\1/g;/<|>/!p' input_file | sed '/^$/d' | wc -l) - 1 ))
while [[ "$i" -le "$count" ]];
    do for f in input_file; do
        first=($(sed -En 's/<.*>(.*)<.*>/\1/g;/<|>/!p' "$f" | sed '/^$/d'))
        second=($(sed -En 's|.*href="(.[^ ]*)".*|\1|p;' "$f"))
        echo "${first[$i]}" $'\n' " => ${second[$i]}"
        ((i  ))
    done
done

輸出

email
  => mailto:[email protected]
Visit Duck Duck Go!
  => https://www.duckduckgo.com
anchor text
  => https://www.stackoverflow.com

uj5u.com熱心網友回復：

sed如果所有內容都在一行中，就像第一個示例一樣，我可以通過創建組/模式并切換它們的列印順序來使用一些連接的命令和一些 RegEx來獲得我想要的東西。但是我不知道如果錨標簽分布在幾行上，如何獲得我想要的東西。

如果您需要保留已有的內容，請考慮在實際處理之前洗掉換行符，例如使用tr- translate 或 delete characters。

uj5u.com熱心網友回復：

將 GNU awk 用于 multi-char RS，第三個 arg tomatch()和\s/\S速記：

$ cat tst.awk
BEGIN { RS="</a>" }
match($0,/<a[^>] href="([^"] ).*>\s*(\S.*\S)/,a) {
    print a[2] "!" ORS "=> " a[1]
}

例如給定這個輸入檔案：

$ cat file
The extract contains sometimes tags which can look like this:

<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>

... or like this:

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>

... or even like this:

<a class="someclasses"
    href="mailto:[email protected]" js-class>
    email

</a>

$ awk -f tst.awk file
anchor text!
=> https://www.stackoverflow.com
Visit Duck Duck Go!!
=> https://www.duckduckgo.com
email!
=> mailto:[email protected]

uj5u.com熱心網友回復：

我會假設輸出pup是格式良好的 XML，如下所示：

<root>
<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class="x">
Visit Duck Duck Go!
</a>

<a class="someclasses"
    href="mailto:[email protected]" js-class="x">
  email
</a>
</root>

這意味著您需要一個根元素，例如root本例中的標記，并且每個屬性都有一個值，這就是我更改js-class為js-.

xmlstarlet提取你想要的命令是：

xmlstarlet sel -t -m "//a" -v "normalize-space()" -n -o "== " -v "@href" -n input.xml

上面輸入對應的輸出是：

anchor text
== https://www.stackoverflow.com
Visit Duck Duck Go!
== https://www.duckduckgo.com
email
== mailto:[email protected]

由于xmlstarlet無法輸出 a >，據我所知，您可能希望通過在命令末尾添加過濾器來更正==字串=>，如下所示：

xmlstarlet sel -t -m "//a" -v "normalize-space()" -n -o "== " -v "@href" -n input.xml | 
sed 's/==/=>/'

作為最終結果給出：

anchor text
=> https://www.stackoverflow.com
Visit Duck Duck Go!
=> https://www.duckduckgo.com
email
=> mailto:[email protected]

但一次又一次：不要使用regex和sed來處理 HTML 檔案。

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/314541.html

標籤：正则表达式猛击贝壳 awk sed

上一篇：在Makefile中匯出，在Shell中訪問

下一篇：通過每個子檔案夾迭代一個函式