我從檔案檔案夾中收集了一個鏈接串列,該檔案夾本質上是維基百科頁面。我最終意識到我的鏈接串列是不完整的,因為我的代碼只從每個維基百科頁面收集了一些鏈接。我的目標是獲取所有鏈接,然后對其進行過濾。我應該最終得到一個鏈接串列,用于培訓相關事故。鏈接中此類事故的關鍵字因災難、悲劇等而異。我事先不知道。
我的輸入是
list_of_urls = []
for file in files:
text = open('files_overview/' file, encoding="utf-8").read()
soup = BeautifulSoup(text, features="lxml")
for item in soup.findAll("div", attrs={'class':'mw-content-ltr'}):
url = item.find('a', attrs={'class':'href'=="accident"}):
#If i dont add something, like "accident" it gives me a syntax error..
urls= url.get("href")
urls1="https://en.wikipedia.org" urls
list_of_urls.append(urls1)
我的其中一個檔案中的 HTML 代碼,其中多個鏈接位于下面:
</div><div class="mw-category-generated" lang="en" dir="ltr"><div id="mw-pages">
<h2><span class="anchor" id="Pages_in_category"></span>Pages in category "Railway accidents in 1860"</h2>
<p>The following 3 pages are in this category, out of 3 total. This list may not reflect recent changes (<a href="/wiki/Wikipedia:FAQ/Categorization#Why_might_a_category_list_not_be_up_to_date?" title="Wikipedia:FAQ/Categorization">learn more</a>).
</p><div lang="en" dir="ltr" class="mw-content-ltr"><h3>A</h3>
<ul><li><a href="/wiki/Atherstone_rail_accident" title="Atherstone rail accident">Atherstone rail accident</a></li></ul><h3>B</h3>
<ul><li><a href="/wiki/Bull_bridge_accident" title="Bull bridge accident">Bull bridge accident</a></li></ul><h3>H</h3>
<ul><li><span class="redirect-in-category"><a href="/wiki/Helmshore_rail_accident" class="mw-redirect" title="Helmshore rail accident">Helmshore rail accident</a></span></li></ul></div>
</div></div><noscript><img src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" alt="" title="" width="1" height="1" style="border: none; position: absolute;" /></noscript>
<div class="printfooter">Retrieved from "<a dir="ltr" href="https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968">https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968</a>"</div></div>
<div id="catlinks" class="catlinks" data-mw="interface"><div id="mw-normal-catlinks"
從上面,我設法得到 Atherstone_rail_accident,但不是 Bull_bridge 或 helmshore。有沒有人有更好的方法?
感謝您的時間
uj5u.com熱心網友回復:
發生什么了?
您只需從結果集中迭代一個元素soup.findAll("div", attrs={'class':'mw-content-ltr'}),這就是為什么您只能獲得第一個鏈接。
例子
list_of_urls = []
for file in files:
text = open('files_overview/' file, encoding="utf-8").read()
soup = BeautifulSoup(text, features="lxml")
for a in soup.select('div.mw-content-ltr a'):
list_of_urls.append(f'https://en.wikipedia.org{a["href"]}')
怎么修?
而不是選擇<div>select 中的所有鏈接<div>并對其進行迭代:
for a in soup.select('div.mw-content-ltr a'):
list_of_urls.append(f'https://en.wikipedia.org{a["href"]}')
輸出
['https://en.wikipedia.org/wiki/Atherstone_rail_accident',
'https://en.wikipedia.org/wiki/Bull_bridge_accident',
'https://en.wikipedia.org/wiki/Helmshore_rail_accident']
編輯
https://en.wikipedia.org稍后在流程中添加前綴只需跳過此任務,同時將 附加href到您的串列中:
for a in soup.select('div.mw-content-ltr a'):
list_of_urls.append(a["href"])
如果您想在第二步中請求網址,您可以這樣做:
for url in list_of_urls:
response = requests.get(f'https://en.wikipedia.org{url}')
或者,如果只需要一個包含完整網址的串列,您可以將其附加list comprehension:
list_of_urls = [f'https://en.wikipedia.org{a["href"]}' for a in list_of_urls]
uj5u.com熱心網友回復:
你可以這樣做。
- 首先找到所有的
<div>類名作為mw-content-ltrusing.find_all() - 對于
<div>上面獲得的每個,使用 找到所有<a>標簽.find_all()。這將為您<a>提供每個<div>. - 迭代并
href從上面的<a>標簽串列中提取。
這是代碼。
from bs4 import BeautifulSoup
s = """
<div lang="en" dir="ltr">
<div id="mw-pages">
<h2><span id="Pages_in_category"></span>Pages in category "Railway accidents in 1860"</h2>
<p>The following 3 pages are in this category, out of 3 total. This list may not reflect recent changes (<a href="/wiki/Wikipedia:FAQ/Categorization#Why_might_a_category_list_not_be_up_to_date?" title="Wikipedia:FAQ/Categorization">learn more</a>).</p>
<div lang="en" dir="ltr" >
<h3>A</h3>
<ul>
<li><a href="/wiki/Atherstone_rail_accident" title="Atherstone rail accident">Atherstone rail accident</a></li>
</ul>
<h3>B</h3>
<ul>
<li><a href="/wiki/Bull_bridge_accident" title="Bull bridge accident">Bull bridge accident</a></li>
</ul>
<h3>H</h3>
<ul>
<li><span ><a href="/wiki/Helmshore_rail_accident" title="Helmshore rail accident">Helmshore rail accident</a></span></li>
</ul>
</div>
</div>
</div>
<noscript><img src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" alt="使用 BeautifulSoup 從 html 代碼中收集 url" title="" width="1" height="1" style="border: none; position: absolute;" /></noscript>
<div >Retrieved from "<a dir="ltr" href="https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968">https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968</a>"</div>
</div>
<div id="catlinks" data-mw="interface">
"""
soup = BeautifulSoup(s, 'lxml')
divs = soup.find_all('div', class_='mw-content-ltr')
for div in divs:
for a in div.find_all('a'):
print(a['href'])
/wiki/Atherstone_rail_accident
/wiki/Bull_bridge_accident
/wiki/Helmshore_rail_accident
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/365330.html
