使用request/selenium/cloudscraper進行Web抓取回傳空值-有解無憂

我正在嘗試從我相信的受 cloudflare 保護的網站收集資訊。我嘗試了三種選擇，它們都回傳空值。所以，我不知道該網站是否有任何阻塞或我做錯了什么。

- 更新

F.Hoque 提出的解決方案有效，但是，當我嘗試在 Colab 中使用它時，我只得到一個空值。

使用請求

import requests
import re
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.portaldoholanda.com.br/assaltante-surra/com-pedacos-de-madeira-populares-dao-surra-em-homem-assalt'

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

soup.find('h1',class_="noticia titulo").text # I tried with select too (soup.select('[]'))

使用云層

import cloudscraper

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
soup = BeautifulSoup(scraper.get(url, headers=headers).content, "html.parser")

soup.find('h1',class_="noticia titulo").text

使用硒

import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import InvalidSessionIdException
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument('--ignore-certificate-errors-spki-list')
options.add_argument('--ignore-ssl-errors')
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

river = webdriver.Chrome(options=options, executable_path='/usr/bin/chromedriver')
print("Current session is {}".format(driver.session_id))

driver.get(url)
html = BeautifulSoup(driver.page_source)
innerContent = html.find('h1',class_="noticia titulo").text

uj5u.com熱心網友回復：

是的，該網站正在使用 cloudflare 保護。

https://www.portaldoholanda.com.br/assaltante-surra/com-pedacos-de-madeira-populares-dao-surra-em-homem-assalt is using Cloudflare CDN/Proxy!

  

https://www.portaldoholanda.com.br/assaltante-surra/com-pedacos-de-madeira-populares-dao-surra-em-homem-assalt is using Cloudflare SSL!

這是使用cloudScraper代替的作業解決方案requests。

腳本：

import cloudscraper
from bs4 import BeautifulSoup
scraper = cloudscraper.create_scraper(delay=10,   browser={'custom': 'ScraperBot/1.0',})
url = "https://www.portaldoholanda.com.br/assaltante-surra/com-pedacos-de-madeira-populares-dao-surra-em-homem-assalt"
req= scraper.get(url)
#print(req)

soup = BeautifulSoup(req.content, "html.parser")
txt=soup.find('h1',class_="noticia titulo").text
print(txt)

輸出：

Com peda?os de madeira, populares d?o surra em homem em Manaus; veja vídeo

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/453392.html

標籤：Python 硒网页抓取美丽的汤蟒蛇请求

上一篇：如何使用Selenium和Python從表中刮取所有藝術家的姓名？

下一篇：點擊按鈕注冊后如何切換到新的打開頁面？C#Selenium規范流