決議網頁時無法從iframe（html頁面內）提取/加載所有href-有解無憂

我真的在這個案子上苦苦掙扎，整天都在努力。請我需要你的幫助。我正在嘗試抓取這個網頁： https ://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont =&ref=&d1=2012-01- 01&d2=2022-01-31&p=&col=1&su=16&or= 我想獲取所有137個href-s（137個檔案）。我使用的代碼：

   def test(self):
        final_url = 'https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or='
        self.driver.get(final_url)
        soup = BeautifulSoup(self.driver.page_source, 'html.parser')
        iframes = soup.find('iframe')
        src = iframes['src']
        base = 'https://decisions.scc-csc.ca/'
        main_url = urljoin(base, src)
        self.driver.get((main_url))
        browser = self.driver
        elem = browser.find_element_by_tag_name("body")
        no_of_pagedowns = 20
        while no_of_pagedowns:
            elem.send_keys(Keys.PAGE_DOWN)
            time.sleep(0.2)
            no_of_pagedowns -= 1

問題是它只加載了 25 個第一個檔案（href）并且不知道該怎么做。

uj5u.com熱心網友回復：

此代碼向下滾動，直到所有元素都可見，然后將 pdf 的 url 保存在 list 中pdfs。請注意，所有作業都是使用 selenium 完成的，沒有使用 BeautifulSoup

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome(options=options, service=Service(your_chromedriver_path))
driver.get('https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or=')

# wait for the iframe to be loaded and then switch to it
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.ID, "decisia-iframe")))

# in this case number_of_results = 137
number_of_results = int(driver.find_element(By.XPATH, "//h2[contains(., 'result')]").text.split()[0])
pdfs = []

while len(pdfs) < number_of_results:
    pdfs = driver.find_elements(By.CSS_SELECTOR, 'a[title="Download the PDF version"]')
    # scroll down to the last visible row
    driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', pdfs[-1])
    time.sleep(1)

pdfs = [pdf.get_attribute('href') for pdf in pdfs]

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/479396.html

標籤：Python 硒网页抓取框架

上一篇：Seleniumpython：find_elements_by_tag_name和回圈作業但不是find_element_by_xpath

下一篇：如何在最新版本的Selenium中以無頭模式運行Edge？