我目前正在學習 Python 以進行網路抓取,并且我當前的腳本遇到了問題。在關閉 Indeed 的第 2 頁上的彈出視窗并回圈瀏覽頁面后,腳本僅將一頁回傳到資料框中的 CSV。但是,它確實列印了我終端區域中的每一頁。有時它也只回傳頁面中的部分資料。EX page 2 將回傳前 3 個作業的資訊作為我的列印(df_da)的一部分,但接下來的 12 個沒有任何資訊。此外,運行腳本似乎需要很長時間(平均大約 6 分 45 秒) 5 頁,每頁大約 1 分鐘到 1.5 分鐘)。有什么建議?我已經附上了我的腳本,如果需要,還可以附上我從 Print(df_da) 獲得的回報。先感謝您!
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
options = Options()
options.add_argument("window-size=1400,1400")
PATH = "C://Program Files (x86)//chromedriver.exe"
driver = webdriver.Chrome(PATH)
for i in range(0,50,10):
driver.get('https://www.indeed.com/jobs?q=chemical engineer&l=united states&start=' str(i))
driver.implicitly_wait(5)
jobtitles = []
companies = []
locations = []
descriptions = []
jobs = driver.find_elements_by_class_name("slider_container")
for job in jobs:
jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip()
jobtitles.append(jobtitle)
company = job.find_element_by_class_name('companyName').text.replace("new", "").strip()
companies.append(company)
location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip()
locations.append(location)
description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip()
descriptions.append(description)
try:
WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton"))).click()
except:
pass
df_da=pd.DataFrame()
df_da['JobTitle']=jobtitles
df_da['Company']=companies
df_da['Location']=locations
df_da['Description']=descriptions
print(df_da)
df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')
uj5u.com熱心網友回復:
您正在定義df_da外部for回圈的內部,以便df_da僅包含最后一頁的資料。
您應該在回圈之外定義它,并且只有在收集了所有資料之后才將總資料放在那里。
由于彈出視窗,我猜您并沒有獲得第二頁上的所有作業。因此,您應該在收集該頁面上的作業詳細資訊之前關閉它。
此外,您可以減少對所有回圈迭代的彈出元素的等待,并僅將其留給第二次回圈迭代。
您的代碼可以是這樣的:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
options = Options()
options.add_argument("window-size=1400,1400")
PATH = "C://Program Files (x86)//chromedriver.exe"
driver = webdriver.Chrome(PATH)
jobtitles = []
companies = []
locations = []
descriptions = []
for i in range(0,50,10):
driver.get('https://www.indeed.com/jobs?q=chemical engineer&l=united states&start=' str(i))
driver.implicitly_wait(5)
jobs = driver.find_elements_by_class_name("slider_container")
for idx, job in enumerate(jobs):
if(idx == 1):
try:
WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton"))).click()
except:
pass
jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip()
jobtitles.append(jobtitle)
company = job.find_element_by_class_name('companyName').text.replace("new", "").strip()
companies.append(company)
location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip()
locations.append(location)
description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip()
descriptions.append(description)
df_da=pd.DataFrame()
df_da['JobTitle']=jobtitles
df_da['Company']=companies
df_da['Location']=locations
df_da['Description']=descriptions
print(df_da)
df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/387389.html
