一段時間后(在分析了大約 100/200 種產品后),當我在亞馬遜和 Google 上使用 BeautifulSoup 進行抓取時,它會將我識別為機器人,我該如何防止這種情況發生?
通過更改 ip 我可以重新啟動,但過了一會兒他們又阻止了我。
這是我的代碼:
from bs4 import BeautifulSoup
import requests
cookies_goo = {
"NID": "511=ktkACo_ZFBfZiD_DvYTKQFmYYX7R3Esh1ZtJ6A3F87KG_YzkbqlHc0NmQsGPyc78KIOXyCtVuYE9QmX-ixl-HzpbE9N9K67sGQCTZ2CFZ1oZAhe-iSFKtCcsUCsY8CHmbDu9YtxaEs7prgZqRID19DI6bqN2lxQZjog8HY6ur_M",
"1P_JAR": "2021-11-05-13",
"CONSENT": "YES cb.20211102-08-p0.it FX 548"
}
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36",
"Accept-Language": "it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7"
}
response = requests.get(url, headers=header, cookies=cookies_goo)
soup = BeautifulSoup(response.content, "lxml")
uj5u.com熱心網友回復:
你是一個機器人,所以他們的演算法是完全正確的。嘗試改用他們的 API。
uj5u.com熱心網友回復:
- 輪換代理
- 延誤
- 避免相同的模式
- IP 速率限制(可能是您的問題)
IP 速率限制。這是一個基本的安全系統,可以禁止或阻止來自同一 IP 的傳入請求。這意味著普通用戶不會在幾秒鐘內以完全相同的模式(滾動、單擊、滾動、單擊、打開。例如)向同一個域發出 100 個請求。
如何減少網頁抓取搜索引擎被攔截的機會。
或者,您可以使用來自 SerpApi 的Google Shopping Results API。這是一個帶有免費計劃的付費 API。
您的情況的不同之處在于您不必花時間弄清楚如何繞過 Google 的阻止,因為它已經為最終用戶完成了。
集成以決議來自 Google 購物的資料的示例代碼和在線 IDE 中的示例:
import os
from serpapi import GoogleSearch
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_product",
"product_id": "14506091995175728218", # can be iterated over multiple product ids
"gl": "us", # country to search from
"hl": "en" # language
}
search = GoogleSearch(params)
results = search.get_dict()
title = results['product_results']['title']
prices = results['product_results']['prices']
reviews = results['product_results']['reviews']
rating = results['product_results']['rating']
extensions = results['product_results']['extensions']
description = results['product_results']['description']
user_reviews = results['product_results']['reviews']
reviews_results = results['reviews_results']['ratings']
print(f'{title}\n'
f'{prices}\n'
f'{reviews}\n'
f'{rating}\n'
f'{extensions}\n'
f'{description}\n'
f'{user_reviews}\n'
f'{reviews_results}')
'''
Google Pixel 4 White 64 GB, Unlocked
['$247.79', '$245.00', '$439.00']
526
3.7
['October 2019', 'Google', 'Pixel Family', 'Pixel 4', 'Android', '5.7″', 'Facial Recognition', '8 MP front camera', 'Smartphone', 'With Wireless Charging']
Point and shoot for the perfect photo. Capture brilliant color and control the exposure balance of different parts of your photos. Get the shot without the flash. Night Sight is now faster and easier to use it can even take photos of the Milky Way. Get more done with your voice. The new Google Assistant is the easiest way to send texts, share photos, and more. A new way to control your phone. Quick Gestures let you skip songs and silence calls – just by waving your hand above the screen. End the robocalls. With Call Screen, the Google Assistant helps you proactively filter our spam before your phone ever rings.
526
[{'stars': 1, 'amount': 101}, {'stars': 2, 'amount': 43}, {'stars': 3, 'amount': 39}, {'stars': 4, 'amount': 73}, {'stars': 5, 'amount': 270}]
'''
迭代多個專案 ID 的示例:
# import os
# from serpapi import GoogleSearch
# random numbers except the first one
products = ['14506091995175728218', '1450609199517512118', '145129895175728218']
for product in products:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_product",
"product_id": product,
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
title = results['product_results']['title']
print(title, sep='\n') # prints 3 titles from 3 different products
免責宣告,我為 SerpApi 作業。
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/354326.html
下一篇:ScrapyXpath回傳空串列
