我正在測驗這段代碼,嘗試從一個 URL 下載大約 120 個 Excel 檔案。
import requests
from bs4 import BeautifulSoup
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"}
resp = requests.get("https://healthcare.ascension.org/price-transparency/price-transparency-files",headers=headers)
soup = BeautifulSoup(resp.text,"html.parser")
for link in soup.find_all('a', href=True):
if 'xls' in link['href']:
print(link['href'])
url="https://healthcare.ascension.org" link['href']
data=requests.get(url)
print(data)
output = open(f'C:/Users/ryans/Downloads/{url.split("/")[-1].split(".")[0]}.xls', 'wb')
output.write(data.content)
output.close()
這一行:data=requests.get(url)
總是給我 Response [406] 結果。顯然,對于 HTTP.CAT 和 Mozilla,HTTP 406 是“不可接受”的狀態。不知道這里出了什么問題,但我想我應該下載 120 個帶有資料的 Excel 檔案。現在,我的筆記本電腦上有 120 個 Excel 檔案,但這些檔案中沒有任何資料。
uj5u.com熱心網友回復:
未指定用戶代理時會出現 HTTP 406 錯誤。
一旦解決了這個問題并適當地決議了 HREF(針對格式和相關性),那么 OP 的代碼應該可以作業。但是,它會非常慢,因為正在獲取的 XLSX 檔案的大小以許多 MB 為單位。
因此,使用多執行緒方法可以大大改善問題,如下所示:
import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor
import os
HEADERS = {'User-Agent': 'PostmanRuntime/7.29.0'}
TARGET = 'C:/Users/ryans/Downloads'
HOST = 'https://healthcare.ascension.org'
def download(url):
base = os.path.basename(url)
print(f'Processing {base}')
with requests.Session() as session:
(r := session.get(url, headers=HEADERS, stream=True)).raise_for_status()
with open(os.path.join(TARGET, base), 'wb') as xl:
for chunk in r.iter_content(chunk_size=4096):
xl.write(chunk)
with requests.Session() as session:
(r := session.get(f'{HOST}/price-transparency/price-transparency-files', headers=HEADERS)).raise_for_status()
soup = BS(r.text, 'lxml')
urls = []
for link in soup.find_all('a', href=True):
href = link['href']
if not href.startswith('java'):
if not href.startswith('http'):
href = HOST href
if href.endswith('xlsx'):
urls.append(href)
with ThreadPoolExecutor() as executor:
executor.map(download, urls)
print('Done')
筆記:
需要 Python 3.8
uj5u.com熱心網友回復:
該網站似乎過濾了用戶代理,因此您確實在字典中設定了標題,您只需在呼叫 get 方法時將其傳遞給請求:
requests.get(url, headers=headers)
似乎只檢查了用戶代理。
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/412099.html
標籤:
上一篇:具有多個引數的VLOOKUP?
下一篇:向現有陣列添加前綴
