使用Selenium抓取博客文章標題-Python-有解無憂

我正在嘗試使用 Selenium 和 Python 來抓取以下 URL 的博客文章標題：https ://blog.coinbase.com/tagged/coinbase-pro 。當我使用 Selenium 獲取頁面源時，它不包含博客文章標題，但是當我右鍵單擊并選擇“查看頁面源”時，Chrome 源代碼會包含。我正在使用以下代碼：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get("https://blog.coinbase.com/tagged/coinbase-pro")
pageSource = driver.page_source
print(pageSource)

任何幫助，將不勝感激。謝謝。

uj5u.com熱心網友回復：

wait=WebDriverWait(driver,30)                                 
driver.get("https://blog.coinbase.com/tagged/coinbase-pro")
elements=wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".graf.graf--h3.graf-after--figure.graf--trailing.graf--title")))
for elem in elements:
   print(elem.text)

如果你想要這 8 個標題，你可以通過他們的 css 選擇器使用等待來獲取它們。

進口：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

輸出：

Inverse Finance (INV), Liquity (LQTY), Polyswarm (NCT) and Propy (PRO) are launching on Coinbase Pro
Goldfinch Protocol (GFI) is launching on Coinbase Pro
Decentralized Social (DESO) is launching on Coinbase Pro
API3 (API3), Bluezelle (BLZ), Gods Unchained (GODS), Immutable X (IMX), Measurable Data Token (MDT) and Ribbon…
Circuits of Value (COVAL), IDEX (IDEX), Moss Carbon Credit (MCO2), Polkastarter (POLS), ShapeShift FOX Token (FOX)…
Voyager Token (VGX) is launching on Coinbase Pro
Alchemix (ALCX), Ethereum Name Service (ENS), Gala (GALA), mStable USD (MUSD) and Power Ledger (POWR) are launching…
Crypto.com Protocol (CRO) is launching on Coinbase Pro

uj5u.com熱心網友回復：

您可以通過多種方式從該網頁獲取所有標題。最有效和最快的方法是選擇請求。

這是您可以使用請求獲取標題的方式：

import re
import json
import time
import requests

link = 'https://medium.com/the-coinbase-blog/load-more'
params = {
    'sortBy': 'tagged',
    'tagSlug': 'coinbase-pro',
    'limit': 25,
    'to': int(time.time() * 1000),
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    s.headers['accept'] = 'application/json'
    s.headers['referer'] = 'https://blog.coinbase.com/tagged/coinbase-pro'
    
    while True:
        res = s.get(link,params=params)
        container = json.loads(re.findall("[^{] (.*)",res.text)[0])
        for k,v in container['payload']['references']['Post'].items():
            title = v['title']
            print(title)

        try:
            next_page = container['payload']['paging']['next']['to']
        except KeyError:
            break

        params['to'] = next_page

但是，如果您想堅持使用硒，請嘗試以下操作：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC

def scroll_down_to_the_bottom():
    check_height = driver.execute_script("return document.body.scrollHeight;")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        try:
            WebDriverWait(driver,10).until(lambda driver: driver.execute_script("return document.body.scrollHeight;")  > check_height)
            check_height = driver.execute_script("return document.body.scrollHeight;") 
        except TimeoutException:
             break

with webdriver.Chrome() as driver:                          
    driver.get("https://blog.coinbase.com/tagged/coinbase-pro")
    scroll_down_to_the_bottom()
    for item in WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".section-content h3.graf--title"))):
       print(item.text)

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/411736.html

標籤：

上一篇：如何撰寫用于在控制臺中顯示文本的Xpath，使用<br>標簽分隔？

下一篇：使用Selenium上傳圖片