為什么requests.get(url)會產生所有<Response[406]>結果？-有解無憂

我正在測驗這段代碼，嘗試從一個 URL 下載大約 120 個 Excel 檔案。

import requests
from bs4 import BeautifulSoup
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"}
resp = requests.get("https://healthcare.ascension.org/price-transparency/price-transparency-files",headers=headers)
soup = BeautifulSoup(resp.text,"html.parser")

for link in soup.find_all('a', href=True):
    if 'xls' in link['href']:
        print(link['href'])
        url="https://healthcare.ascension.org" link['href']
        data=requests.get(url)
        print(data)
        output = open(f'C:/Users/ryans/Downloads/{url.split("/")[-1].split(".")[0]}.xls', 'wb')
        output.write(data.content)
        output.close()

這一行：data=requests.get(url) 總是給我 Response [406] 結果。顯然，對于 HTTP.CAT 和 Mozilla，HTTP 406 是“不可接受”的狀態。不知道這里出了什么問題，但我想我應該下載 120 個帶有資料的 Excel 檔案。現在，我的筆記本電腦上有 120 個 Excel 檔案，但這些檔案中沒有任何資料。

uj5u.com熱心網友回復：

未指定用戶代理時會出現 HTTP 406 錯誤。

一旦解決了這個問題并適當地決議了 HREF（針對格式和相關性），那么 OP 的代碼應該可以作業。但是，它會非常慢，因為正在獲取的 XLSX 檔案的大小以許多 MB 為單位。

因此，使用多執行緒方法可以大大改善問題，如下所示：

import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor
import os

HEADERS = {'User-Agent': 'PostmanRuntime/7.29.0'}
TARGET = 'C:/Users/ryans/Downloads'
HOST = 'https://healthcare.ascension.org'

def download(url):
    base = os.path.basename(url)
    print(f'Processing {base}')
    with requests.Session() as session:
        (r := session.get(url, headers=HEADERS, stream=True)).raise_for_status()
        with open(os.path.join(TARGET, base), 'wb') as xl:
            for chunk in r.iter_content(chunk_size=4096):
                xl.write(chunk)

with requests.Session() as session:
    (r := session.get(f'{HOST}/price-transparency/price-transparency-files', headers=HEADERS)).raise_for_status()
    soup = BS(r.text, 'lxml')
    urls = []
    for link in soup.find_all('a', href=True):
        href = link['href']
        if not href.startswith('java'):
            if not href.startswith('http'):
                href = HOST   href
            if href.endswith('xlsx'):
                urls.append(href)
    with ThreadPoolExecutor() as executor:
        executor.map(download, urls)
    print('Done')

筆記：

需要 Python 3.8

uj5u.com熱心網友回復：

該網站似乎過濾了用戶代理，因此您確實在字典中設定了標題，您只需在呼叫 get 方法時將其傳遞給請求：

requests.get(url, headers=headers)

似乎只檢查了用戶代理。

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/412099.html

標籤：

上一篇：具有多個引數的VLOOKUP？

下一篇：向現有陣列添加前綴