我正在嘗試從我的 colab 環境中打開以下英國議會網站,但我無法在沒有 403 錯誤的情況下使其正常作業。標頭限制太嚴格。在對先前類似問題的幾個答案之后,我嘗試了更多擴展版本的標題,但仍然無法正常作業。
有什么辦法嗎?
from urllib.request import urlopen, Request
url = "https://members.parliament.uk/members/commons"
headers={'User-Agent': 'Mozilla/5.0'}
request= Request(url=url, headers=headers)
response = urlopen(request)
data = response.read()
較長的標題是這樣的:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15',
'Accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
uj5u.com熱心網友回復:
該網站受 Cloudflare 保護。正如 Andrew Ryan 已經說明了可能的解決方案。我也使用了 cloudcraper,但沒有作業,仍然得到 403,然后我使用playwright with bs4,現在它就像一個魅力一樣作業。
例子:
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
data = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=False,slow_mo=50)
page = browser.new_page()
page.goto('https://members.parliament.uk/members/commons')
page.wait_for_timeout(5000)
loc = page.locator('div[]')
html = loc.inner_html()
#print(html)
soup = BeautifulSoup(html,"lxml")
#print(soup.prettify())
for card in soup.select('.card.card-member'):
d = {
'Name':card.select_one('.primary-info').get_text(strip=True)
}
data.append(d)
print(data)
輸出:
[{'Name': 'Ms Diane Abbott'}, {'Name': 'Debbie Abrahams'}, {'Name': 'Nigel Adams'}, {'Name': 'Bim Afolami'}, {'Name': 'Adam Afriyie'}, {'Name': 'Nickie Aiken'}, {'Name': 'Peter Aldous'}, {'Name': 'Rushanara Ali'}, {'Name': 'Tahir Ali'}, {'Name': 'Lucy Allan'}, {'Name': 'Dr Rosena Allin-Khan'}, {'Name': 'Mike Amesbury'}, {'Name': 'Fleur Anderson'}, {'Name': 'Lee Anderson'}, {'Name': 'Stuart Anderson'}, {'Name': 'Stuart Andrew'}, {'Name': 'Caroline Ansell'}, {'Name': 'Tonia Antoniazzi'}, {'Name': 'Edward Argar'}, {'Name': 'Jonathan Ashworth'}]
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/525882.html
標籤:Python网页抓取http-status-code-403
