為什么在使用requests.get(URL)時使用AmazonAPI網關會給出錯誤的HTML頁面-有解無憂

我目前正在構建一個網路爬蟲，并遇到了被 IP 阻止的問題。為了解決這個問題，我嘗試使用requests_ip_rotator，它使用 AWS API Gateway 的大型 IP 池作為代理來生成用于網路抓取的偽無限 IP。按照這個答案，我已經將它實作到我的代碼中，如下所示：

import requests
from bs4 import BeautifulSoup
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS

url = "https://secure.runescape.com/m=hiscore_oldschool_ironman/a=13/group-ironman/?groupSize=5&page=1"
page1 = requests.get(url)
soup1 = BeautifulSoup(page1.content, "html.parser")

gateway = ApiGateway("https://secure.runescape.com/",access_key_id="****",access_key_secret="****")
gateway.start()
session = requests.Session()
session.mount("https://secure.runescape.com/", gateway)
page2 = session.get(url)
gateway.shutdown() 
soup2 = BeautifulSoup(page2.content, "html.parser")

print("\n" page1.url)
print(page2.url)
print(soup1.head.title==soup2.head.title)
input()

輸出：

Starting API gateways in 10 regions.
Using 10 endpoints with name 'https://secure.runescape.com/ - IP Rotate API' (10 new).
Deleting gateways for site 'https://secure.runescape.com'.
Deleted 10 endpoints with for site 'https://secure.runescape.com'.

https://secure.runescape.com/m=hiscore_oldschool_ironman/a=13/group-ironman/?groupSize=5&page=1
https://6kesqk9t6d.execute-api.eu-central-1.amazonaws.com/ProxyStage/m=hiscore_oldschool_ironman/a=13/overall
False

所以我兩次使用 .get(url) 方法我都使用相同的 url 但接收不同的頁面。Request.get(url) 給了我想要的頁面，但是當我使用帶有 session.get(url) 的亞馬遜網關時，它沒有給我與以前相同的頁面，而是來自同一站點的不同頁面。我很困惑這個問題可能是什么，所以任何幫助都將不勝感激！

uj5u.com熱心網友回復：

當使用 AWS 網關向“https://secure.runescape.com”域發出 get 請求時，我注意到如果 URL 路徑是："a=13/group-ironman/?groupSize=5&page=x"對于任何 x，那么我會收到一個 302 回應（重定向回應），它將我重定向到 URL路徑"/a=13/overall"。這讓我相信 runescape 服務器正在為某些 URL 重定向 AWS IP，但幸運的是它沒有重定向我自己的 IP。

所以我的解決方法是在requests.get()沒有 AWS 網關的情況下使用被重定向的 URL，而對于同一站點的其他 URL，AWS 網關沒有被重定向，所以我仍然使用它來避免被 IP 阻塞。

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/455154.html

標籤：Python 亚马逊网络服务网页抓取蟒蛇请求

上一篇：通過抓取從網頁中提取單個URL

下一篇：Selenium無法獲取所有div標簽