我剛剛開始學習如何使用 Python 來探索抓取作業門戶網站 - 所以請多多包涵,因為我可能會問一些非常基本的問題。
情況:我設法建立了以下幾行
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('C:/Users/ - Home/Desktop/Web Scraper/chromedriver.exe')
driver.get('https://www.mycareersfuture.gov.sg/search?sortBy=relevancy&page=0')
results =[]
content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
listing= soup.find('div', class_ = 'card-list')
job = listing.find('p')
print(job)
并發癥:我似乎無法從作業卡中提取以下專案:
- 職稱
- 公司名稱
- 薪水
我查看了幾個教程,每個教程都表示要查找具有相應類的 h2 標簽或 div。但是,我正在抓取的網站似乎沒有明確說明這一點。
網站鏈接:https ://www.mycareersfuture.gov.sg/search?sortBy=relevancy&page=0
例如,我檢查了 HTML 并發現職位名稱在這一行的某處;但是,我似乎無法提取它。
<span data-cy="job-card__job-title" class="f4-5 fw6 mv0 dib mr2 brand-sec JobCard__jobtitle___3HqOw" style="overflow-wrap: break-word;">2402 - IT Manager [ Amber Rd / / 5 days ]</span>
我真的很感激這方面的任何幫助。我整晚都在研究解決方案,但無濟于事......
uj5u.com熱心網友回復:
可能的解決方案之一:
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
options = webdriver.ChromeOptions()
# set headless mode
# options.add_argument("--headless")
# disable chromedriver log message in cmd
options.add_experimental_option("excludeSwitches", ["enable-automation", "enable-logging"])
service = Service(executable_path='path\to\your\chromedriver.exe')
driver = webdriver.Chrome(service=service, options=options)
# set an explicit wait (10 sec)
wait = WebDriverWait(driver, 10)
url = 'https://www.mycareersfuture.gov.sg/search?sortBy=relevancy&page=0'
# page where parsing will stop
last_page = 5225
# loads a web page
driver.get(url)
while True:
try:
# waiting(max 10 sec) for least one element with our css selector present on a web page.
company_names = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'p[data-testid="company-hire-info"]')))
job_titles = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'span[data-cy="job-card__job-title"]')))
salaries = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div[]')))
except TimeoutException:
# if TimeoutException refresh the page and try again
driver.refresh()
continue
# get data from received web elements
for data in zip(company_names, job_titles, salaries):
data = {
'Company name': data[0].text,
'Job title': data[1].text,
'Salary': data[2].text
}
# save received data in csv
with open(file='mycareersfuture.csv', mode='a', encoding="utf-8") as f:
writer = csv.writer(f, lineterminator='\n')
writer.writerow([data['Company name'], data['Job title'], data['Salary']])
# waiting for an element is present on the DOM of a page. after that click on it
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'button[aria-label="Next"]'))).click()
# if the current page is equal to the last_page stop parsing
if driver.current_url.endswith(str(last_page)):
break
driver.quit()
輸出 mycareersfuture.csv:
THE SUPREME HR ADVISORY PTE. LTD.,2402 - IT Manager [ Amber Rd / / 5 days ],$6 500to$7 000
TRITON AI PTE. LTD.,"Property Executive, Town Council (Facilities Management)",$2 000to$3 000
PISTACHIO RESTAURANT PTE. LTD.,Service Crew / Supervisor,$1 700to$3 000
THE SUPREME HR ADVISORY PTE. LTD.,2402 - Quantity Surveyor [ Admiralty / 5 days ],$3 000to$3 500
THE SUPREME HR ADVISORY PTE. LTD.,2402 - WSH Co-ordinator [ 5 days / WSQ Advanced Cert ],$2 200to$3 500
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/517886.html
標籤:Python网络网页抓取
上一篇:紋理不會加載-Three.js
