使用BeautifulSoup從html代碼中收集url-有解無憂

我從檔案檔案夾中收集了一個鏈接串列，該檔案夾本質上是維基百科頁面。我最終意識到我的鏈接串列是不完整的，因為我的代碼只從每個維基百科頁面收集了一些鏈接。我的目標是獲取所有鏈接，然后對其進行過濾。我應該最終得到一個鏈接串列，用于培訓相關事故。鏈接中此類事故的關鍵字因災難、悲劇等而異。我事先不知道。

我的輸入是

list_of_urls = []

for file in files:     
    text = open('files_overview/' file, encoding="utf-8").read()
    soup = BeautifulSoup(text, features="lxml")
    for item in soup.findAll("div", attrs={'class':'mw-content-ltr'}):                 
        url = item.find('a', attrs={'class':'href'=="accident"}): 
#If i dont add something, like "accident" it gives me a syntax error.. 
        urls= url.get("href")               
        urls1="https://en.wikipedia.org" urls   
        list_of_urls.append(urls1)

我的其中一個檔案中的 HTML 代碼，其中多個鏈接位于下面：

</div><div class="mw-category-generated" lang="en" dir="ltr"><div id="mw-pages">
<h2><span class="anchor" id="Pages_in_category"></span>Pages in category "Railway accidents in 1860"</h2>
<p>The following 3 pages are in this category, out of  3 total. This list may not reflect recent changes (<a href="/wiki/Wikipedia:FAQ/Categorization#Why_might_a_category_list_not_be_up_to_date?" title="Wikipedia:FAQ/Categorization">learn more</a>).
</p><div lang="en" dir="ltr" class="mw-content-ltr"><h3>A</h3>
<ul><li><a href="/wiki/Atherstone_rail_accident" title="Atherstone rail accident">Atherstone rail accident</a></li></ul><h3>B</h3>
<ul><li><a href="/wiki/Bull_bridge_accident" title="Bull bridge accident">Bull bridge accident</a></li></ul><h3>H</h3>
<ul><li><span class="redirect-in-category"><a href="/wiki/Helmshore_rail_accident" class="mw-redirect" title="Helmshore rail accident">Helmshore rail accident</a></span></li></ul></div>
</div></div><noscript><img src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" alt="" title="" width="1" height="1" style="border: none; position: absolute;" /></noscript>
<div class="printfooter">Retrieved from "<a dir="ltr" href="https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&amp;oldid=895698968">https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&amp;oldid=895698968</a>"</div></div>
        <div id="catlinks" class="catlinks" data-mw="interface"><div id="mw-normal-catlinks"

從上面，我設法得到 Atherstone_rail_accident，但不是 Bull_bridge 或 helmshore。有沒有人有更好的方法？

感謝您的時間

uj5u.com熱心網友回復：

發生什么了？

您只需從結果集中迭代一個元素soup.findAll("div", attrs={'class':'mw-content-ltr'})，這就是為什么您只能獲得第一個鏈接。

例子

list_of_urls = []
for file in files:     
    text = open('files_overview/' file, encoding="utf-8").read()
    soup = BeautifulSoup(text, features="lxml")

    for a in soup.select('div.mw-content-ltr a'):
        list_of_urls.append(f'https://en.wikipedia.org{a["href"]}')

怎么修？

而不是選擇<div>select 中的所有鏈接<div>并對其進行迭代：

for a in soup.select('div.mw-content-ltr a'):
    list_of_urls.append(f'https://en.wikipedia.org{a["href"]}')

輸出

['https://en.wikipedia.org/wiki/Atherstone_rail_accident',
 'https://en.wikipedia.org/wiki/Bull_bridge_accident',
 'https://en.wikipedia.org/wiki/Helmshore_rail_accident']

編輯

https://en.wikipedia.org稍后在流程中添加前綴只需跳過此任務，同時將附加href到您的串列中：

for a in soup.select('div.mw-content-ltr a'):
    list_of_urls.append(a["href"])

如果您想在第二步中請求網址，您可以這樣做：

for url in list_of_urls:
    response = requests.get(f'https://en.wikipedia.org{url}')

或者，如果只需要一個包含完整網址的串列，您可以將其附加list comprehension：

list_of_urls = [f'https://en.wikipedia.org{a["href"]}' for a in list_of_urls]

uj5u.com熱心網友回復：

你可以這樣做。

首先找到所有的<div>類名作為mw-content-ltrusing.find_all()
對于<div>上面獲得的每個，使用找到所有<a>標簽.find_all()。這將為您<a>提供每個<div>.
迭代并href從上面的<a>標簽串列中提取。

這是代碼。

from bs4 import BeautifulSoup

s = """
<div  lang="en" dir="ltr">
   <div id="mw-pages">
      <h2><span  id="Pages_in_category"></span>Pages in category "Railway accidents in 1860"</h2>
      <p>The following 3 pages are in this category, out of  3 total. This list may not reflect recent changes (<a href="/wiki/Wikipedia:FAQ/Categorization#Why_might_a_category_list_not_be_up_to_date?" title="Wikipedia:FAQ/Categorization">learn more</a>).</p>
      <div lang="en" dir="ltr" >
         <h3>A</h3>
         <ul>
            <li><a href="/wiki/Atherstone_rail_accident" title="Atherstone rail accident">Atherstone rail accident</a></li>
         </ul>
         <h3>B</h3>
         <ul>
            <li><a href="/wiki/Bull_bridge_accident" title="Bull bridge accident">Bull bridge accident</a></li>
         </ul>
         <h3>H</h3>
         <ul>
            <li><span ><a href="/wiki/Helmshore_rail_accident"  title="Helmshore rail accident">Helmshore rail accident</a></span></li>
         </ul>
      </div>
   </div>
</div>
<noscript><img src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" alt="使用 BeautifulSoup 從 html 代碼中收集 url" title="" width="1" height="1" style="border: none; position: absolute;" /></noscript>
<div >Retrieved from "<a dir="ltr" href="https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&amp;oldid=895698968">https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&amp;oldid=895698968</a>"</div>
</div>
<div id="catlinks"  data-mw="interface">
"""
soup = BeautifulSoup(s, 'lxml')

divs = soup.find_all('div', class_='mw-content-ltr')

for div in divs:
    for a in div.find_all('a'):
        print(a['href'])

/wiki/Atherstone_rail_accident
/wiki/Bull_bridge_accident
/wiki/Helmshore_rail_accident

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/365330.html

標籤：Python html 网页抓取美汤

上一篇：Selenium中的抓取問題。它不刮

下一篇：如何提取兩個元素之間的html內容（Python、BeautifulSoup）