我有一個 html 檔案,我試圖從中獲取資料。該網站是這個https://www.tv2.no/nyheter。我正在嘗試從網站上獲取所有新聞文章。
我這樣做 wget -O news.html https://www.tv2.no/nyheter
這為我創建了一個本地檔案。
然后我試圖獲取所有具有類文章的文章--nyheter。我嘗試運行此命令
tr '\n' ' ' < news.html | grep -E "^<article >.*$"
但我沒有得到任何結果。html結構是這樣的
<body>
<div>
<article class="article column large-4 small-12">
hello
</article>
</div>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336304/">
<figure class="image image__responsive" style="padding-bottom:51.312%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI py 0Po5yUFQA7"
data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t27 tm26">IEA: Mulig ? n? 2-gradersm?let om l?ftene fra Glasgow holdes</h2>
</div>
</a>
</article>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336420/">
<figure class="image image__responsive" style="padding-bottom:115.452%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI py 0Po5yUFQA7"
data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t26 tm20">Italienske jegere stoppet p? vei ut av landet med 2.027 nedfryste
troster</h2>
</div>
</a>
</article>
示例輸出,因為以下兩篇文章都包含類名文章--nyheter
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336420/">
<figure class="image image__responsive" style="padding-bottom:115.452%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI py 0Po5yUFQA7"
data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t26 tm20">Italienske jegere stoppet p? vei ut av landet med 2.027 nedfryste
troster</h2>
</div>
</a>
</article>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336304/">
<figure class="image image__responsive" style="padding-bottom:51.312%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI py 0Po5yUFQA7"
data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t27 tm26">IEA: Mulig ? n? 2-gradersm?let om l?ftene fra Glasgow holdes</h2>
</div>
</a>
</article>
為此,我必須使用 grep、sed、curl、awk。不能使用任何其他決議器。
所以我的預期輸出是獲得具有特定類的所有文章標簽。我想要那些文章標簽中的所有內容。
uj5u.com熱心網友回復:
假設:
- 為什么不使用以 HTML 為中心的工具來決議所需的部分是有正當理由的
- 輸入的格式與問題相同,否則建議的
sed解決方案可能無法正常作業 - 提取
<article> ... </article>其中對article class條目包含字串article--nyheter - OP 的預期輸出
article--nyheter以相反的順序列出了兩個部分;現在我將假設這是某種拼寫錯誤并且沒有要求對這兩個部分進行排序
sed使用范圍來提取所需資料的一種想法:
sed -n '/<article class.*article--nyheter/,/<\/article>/p' news.html
這會產生:
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336304/">
<figure class="image image__responsive" style="padding-bottom:51.312%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI py 0Po5yUFQA7"
data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t27 tm26">IEA: Mulig ? n? 2-gradersm?let om l?ftene fra Glasgow holdes</h2>
</div>
</a>
</article>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336420/">
<figure class="image image__responsive" style="padding-bottom:115.452%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI py 0Po5yUFQA7"
data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t26 tm20">Italienske jegere stoppet p? vei ut av landet med 2.027 nedfryste
troster</h2>
</div>
</a>
</article>
如果輸入資料的格式不符合問題中的格式(例如,缺少回車/換行符),則此sed解決方案可能不起作用;需要構建一個更“健壯”的決議器(例如,通過awk)......
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/350252.html
