收集瀏覽器中顯示但不回應的資料-有解無憂

情況

我正在嘗試抓取網頁以獲取一些資料。我需要可以在瀏覽器中作為一個整體為我的應用程式查看的 html 資料。

問題

但是當我抓取一些 url 時，我得到了無法從瀏覽器查看的資料。但在 html 代碼中。那么有沒有辦法抓取只能在瀏覽器中查看的資料

代碼

    from bs4 import BeautifulSoup
    import requests
    from selenium import webdriver
    from selenium.common.exceptions import WebDriverException
    from selenium.webdriver.chrome.service import Service
    
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    service = Service("/home/nebu/selenium_drivers/chromedriver")
    
    URL = "https://augustasymphony.com/event/top-of-the-world/"
    try:
        driver = webdriver.Chrome(service = service, options = options)
        driver.get(URL)
        driver.implicitly_wait(2)
        html_content = driver.page_source
        driver.quit()
    except WebDriverException:
        driver.quit()
    
    soup = BeautifulSoup(html_content)
    for each in ['header','footer']:
            s = soup.find(each)
            if s == None:
                continue
            else:
                s.extract()
    text = soup.getText(separator=u' ')
    print(text)

問題

我哪里出錯了？我該如何去除錯這個？

uj5u.com熱心網友回復：

這只是您需要以更具體的方式提取資料的情況。

你真的有兩個選擇：

選項 1：（在我看來更好，因為它速度更快，資源占用更少。）

import requests
from bs4 import BeautifulSoup as bs


headers = {'Accept': '*/*',
 'Connection': 'keep-alive',
 'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683 Safari/537.36 OPR/57.0.3098.91'}
res = requests.get("https://augustasymphony.com/event/top-of-the-world/", headers=headers)
soup = bs(res.text, "lxml")

event_header = soup.find("h2", {"class": "rhino-event-header"}).text.strip()
time = soup.find("p", {"class": "rhino-event-time"}).text.strip()

您可以非常簡單地使用請求來查找資料，如上面的代碼所示，專門選擇您想要的資料并將其保存在字典中。這是正常的處理方式。它可能在頁面中包含很多腳本，但是頁面不需要 JavaScript 來動態加載這些資料。

選項2：

您繼續使用 selenium 并可以使用多個選擇之一收集頁面的整個正文資訊。

driver.find_element_by_id('wrapper').get_attribute('innerHTML') # Entire body
driver.find_element_by_id('tribe-events').get_attribute('innerHTML') # the events list
driver.find_element_by_id('rhino-event-single-content').get_attribute('innerHTML') # the single event

這第二個選項更多只是獲取整個 html 并轉儲它。

就我個人而言，我會選擇第一個選項來創建已清理資料的字典。

編輯：

進一步說明我的例子


import requests
from bs4 import BeautifulSoup as bs
headers = {'Accept': '*/*',
 'Connection': 'keep-alive',
 'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683 Safari/537.36 OPR/57.0.3098.91'}
res = requests.get("https://augustasymphony.com/event/", headers=headers)
soup = bs(res.text, "lxml")
seedlist = {a["href"] for a in soup.find("div", {"id": "tribe-events-content-wrapper"}).find_all("a") if '?ical=1' not in a["href"]}
for seed in seedlist:
    res = requests.get(seed, headers=headers)
    soup = bs(res.text, "lxml")
    data = dict()
    data['event_header'] = soup.find("h2", {"class": "rhino-event-header"}).text.strip()
    data['time'] = soup.find("p", {"class": "rhino-event-time"}).text.strip()
    print(data)

在這里，我生成了一個事件 url 的種子串列，然后進入每個串列以查找資訊。

uj5u.com熱心網友回復：

這是因為某些網站會檢測它是否是網路瀏覽器。

所以他們不會發回 HTML 檔案。

這就是為什么沒有 HTML 發回的原因

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/338303.html

標籤：Python 硒网页抓取美汤

上一篇：如何處理大規模的網頁抓取？

下一篇：使用PythonSelenium更改css元素樣式