我正在嘗試使用Python和BeautifulSoup進行網路刮削,因此正在閱讀教程,但在成功完成request.get(url)之后我被卡住了。
當我根據標簽和它的類別來定義我想提取的元素(網站上出現的Excel檔案名)時,其中包含 "file-id-... "的字串。(......是指檔案的ID),我得到的是一個空串列。
我的目標是列出這個網址上的所有Excel檔案名,并在以后通過使用for回圈打開它們。所有這些,都是為了從國家勞動局提取特定的月度資料,這些資料在一年中具有相同的結構。
labour_office_web_text = requests.get("url"/span>).text
soup = BeautifulSoup(labor_office_web_text, "lxml")
file_names = soup.find_all('a[class*="file-id-"] ')
file_names
有什么建議嗎?謝謝你!
uj5u.com熱心網友回復:
要從該頁面獲得所有的.xls鏈接,你可以使用下一個例子:
import requests
from bs4 import BeautifulSoup
url = "https://www.upsvr.gov.sk/statistiky/nezamestnanost-mesacne-statistiky/2020.html?page_id=971502"/span>
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for link in soup.select('a[href*=".xls"]') 。
print(link["class"], link["href"] )
印刷品:
['file-id-1059252'] https://www.upsvr.gov.sk/buxus/docs/statistic/mesacne/2020/MS_2012.xlsx
['file-id-1050892'] https://www.upsvr.gov.sk/buxus/docs/statistic/mesacne/2020/MS_2011.xlsx
['file-id-1042979'] https://www.upsvr.gov.sk/buxus/docs/statistic/mesacne/2020/MS_2010.xlsx
['file-id-1034316'] https://www.upsvr.gov.sk/buxus/docs/statistic/mesacne/2020/MS_2009_okresy.xlsx
['file-id-1027296'] https://www.upsvr.gov.sk/buxus/docs/statistic/mesacne/2020/MS_2008_okresy.xlsx
['file-id-1021527'] https://www.upsvr.gov.sk/buxus/docs/statistic/mesacne/2020/MS_2007_okresy.xlsx
['file-id-1015636'] https://www.upsvr.gov.sk/buxus/docs/statistic/mesacne/2020/MS_2006_okresy.xlsx
['file-id-1009682'] https://www.upsvr.gov.sk/buxus/docs/statistic/mesacne/2020/MS_maj2020_okresy.xlsx
['file-id-1002749'] https://www.upsvr.gov.sk/buxus/docs/statistic/mesacne/2020/MS_apr2020_okresy.xlsx
['file-id-995793'] https://www.upsvr.gov.sk/buxus/docs/statistic/mesacne/2020/MS_mar_2020_okresy.xlsx
['file-id-983937'] https://www.upsvr.gov.sk/buxus/docs/statistic/mesacne/2020/MS_2002_okresy.xlsx
['file-id-971509'] https://www.upsvr.gov.sk/buxus/docs/statistic/mesacne/2020/MS_2001.xlsx
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/320232.html
標籤:
