IndeedWebscrape(Selenium)：腳本僅將一頁資料幀回傳到CSV/LongRunTime-有解無憂

我目前正在學習 Python 以進行網路抓取，并且我當前的腳本遇到了問題。在關閉 Indeed 的第 2 頁上的彈出視窗并回圈瀏覽頁面后，腳本僅將一頁回傳到資料框中的 CSV。但是，它確實列印了我終端區域中的每一頁。有時它也只回傳頁面中的部分資料。EX page 2 將回傳前 3 個作業的資訊作為我的列印（df_da）的一部分，但接下來的 12 個沒有任何資訊。此外，運行腳本似乎需要很長時間（平均大約 6 分 45 秒） 5 頁，每頁大約 1 分鐘到 1.5 分鐘）。有什么建議？我已經附上了我的腳本，如果需要，還可以附上我從 Print(df_da) 獲得的回報。先感謝您！

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

options = Options()
options.add_argument("window-size=1400,1400")

PATH = "C://Program Files (x86)//chromedriver.exe"
driver = webdriver.Chrome(PATH)

for i in range(0,50,10):
    driver.get('https://www.indeed.com/jobs?q=chemical engineer&l=united states&start=' str(i))
    driver.implicitly_wait(5)

    jobtitles = []
    companies = []
    locations = []
    descriptions = []



    jobs = driver.find_elements_by_class_name("slider_container")

    for job in jobs:

        jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip()
        jobtitles.append(jobtitle)
        company = job.find_element_by_class_name('companyName').text.replace("new", "").strip()
        companies.append(company)
        location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip()
        locations.append(location)
        description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip()
        descriptions.append(description)
        try:
            WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton"))).click()
        except:
            pass



    df_da=pd.DataFrame()
    df_da['JobTitle']=jobtitles
    df_da['Company']=companies
    df_da['Location']=locations
    df_da['Description']=descriptions
    print(df_da)
    df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')

uj5u.com熱心網友回復：

您正在定義df_da外部for回圈的內部，以便df_da僅包含最后一頁的資料。
您應該在回圈之外定義它，并且只有在收集了所有資料之后才將總資料放在那里。
由于彈出視窗，我猜您并沒有獲得第二頁上的所有作業。因此，您應該在收集該頁面上的作業詳細資訊之前關閉它。
此外，您可以減少對所有回圈迭代的彈出元素的等待，并僅將其留給第二次回圈迭代。
您的代碼可以是這樣的：

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

options = Options()
options.add_argument("window-size=1400,1400")

PATH = "C://Program Files (x86)//chromedriver.exe"
driver = webdriver.Chrome(PATH)

jobtitles = []
companies = []
locations = []
descriptions = []

for i in range(0,50,10):
    driver.get('https://www.indeed.com/jobs?q=chemical engineer&l=united states&start=' str(i))
    driver.implicitly_wait(5)

    jobs = driver.find_elements_by_class_name("slider_container")

    for idx, job in enumerate(jobs):
        if(idx == 1):
            try:
                WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton"))).click()
            except:
                pass

        jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip()
        jobtitles.append(jobtitle)
        company = job.find_element_by_class_name('companyName').text.replace("new", "").strip()
        companies.append(company)
        location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip()
        locations.append(location)
        description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip()
        descriptions.append(description)

df_da=pd.DataFrame()    
df_da['JobTitle']=jobtitles
df_da['Company']=companies
df_da['Location']=locations
df_da['Description']=descriptions
print(df_da)
df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/387389.html

標籤：Python 数据框硒 for循环网页抓取

上一篇：為什么網頁抓取的HTML與開發者工具ElementsPanel中的HTML不匹配

下一篇：puppeteer/node.js-進入頁面，點擊加載更多直到所有評論加載，將頁面保存為mhtml