Scrape職位描述IndeedSelenium-有解無憂

存在類似的主題，但我找不到確切的答案，所以請您幫幫我嗎？

我從互聯網上復制了以下代碼，以從確實中抓取作業機會。問題是代碼無法抓取職位描述。

使用時：sum_div = job.find_elements_by_class_name('summary') 代碼沒有識別“摘要”，沒有得到職位描述所在的位置，也無法關閉確實出現的彈窗。

我嘗試了其他識別符號，例如：sum_div = job.find_element_by_class_name('job_seen_beacon') 它結束并關閉彈出視窗，但仍然不能很好地識別職位描述的位置。

請問您知道如何解決這個問題嗎？

for i in range(0,50,10):
    driver.get('https://www.indeed.co.in/jobs?q=artificial intelligence&l=India&start=' str(i))
    jobs = []
    driver.implicitly_wait(20)


for job in driver.find_elements_by_class_name('result'):
         
   
    #soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser')
    result_html = job.get_attribute('innerHTML')
    soup = BeautifulSoup(result_html, 'html.parser')
    
    try:
        title = soup.find(class_="jobTitle").text
        
    except:
        title = 'None'


    try:
        location = soup.find(class_="companyLocation").text
    except:
        location = 'None'

    try:
        company = soup.find(class_="companyName").text.replace("\n","").strip()
    except:
        company = 'None'


    
    sum_div = job.find_elements_by_class_name('summary')
    #sum_div = job.find_element_by_class_name('job_seen_beacon')
    
    try: 
        sum_div.click()

    except:
        close_button = driver.find_elements_by_class_name('popover-x-button-close')
        close_button.click()
        sum_div.click()
        
    driver.implicitly_wait(2)
    
    try: 
        job_desc = driver.find_element_by_css_selector('div#vjs-desc').text
        print(job_desc)
    
    except:
        job_desc = 'None'   


    df = df.append({'Title':title,'Location':location,"Company":company,
                            "Description":job_desc},ignore_index=True)

uj5u.com熱心網友回復：

url不是動態的。所以不需要使用selenium。你可以使用bs4和requests提取所需的資料。下面給出一個例子。

P/S：您不能使用 try 除非每個頁面包含 15 個專案。

from bs4 import BeautifulSoup
import requests
import pandas as pd
jobs = []
for i in range(0,50,10):
    url='https://www.indeed.co.in/jobs?q=artificial intelligence&l=India&start=' str(i)
    req=requests.get(url)
    
    soup = BeautifulSoup(req.content, 'html.parser')
    for job in soup.select('.result'):
             
        try:
            title = job.find(class_="jobTitle").text
        
        except:
            title = 'None'


        try:
            location = job.find(class_="companyLocation").text
        except:
            location = 'None'

        try:
            company = job.find(class_="companyName").text.replace("\n","").strip()
        except:
            company = 'None'

        try: 
            job_desc = job.select_one('div.job-snippet ul ').get_text(strip=True)
        except:
            job_desc = 'None' 

        jobs.append({'Title':title,'Location':location,"Company":company,"Description":job_desc})  
   

df =pd.DataFrame(jobs)
print(df)
#to store data 
#df.to_csv('data.csv',index=False)

輸出：

           Title                                         Description
                   
0          newData Scientist: Artificial Intelligence  ...  As a Data Scientist at IBM, you will help tran...
1                             AI and Machine Learning  ...  A machine learning 
engineer (ML engineer) focu...
2                      newGraduate Intern - Technical  ...  DPEA enables that data center which is the und...
3   Artificial Intelligence & Machine Learning Expert  ...  Define and drive projects in AI and Machine Le...
4                              newML Data Associate I  ...  Good familiarity with the Windows desktop envi...
..                                                ...  ...
                           ...
70                                  newData Scientist  ...  Perform data analysis and modelling on data se...
71          AI, Informatics & ML – Research Scientist  ...  Years of experience 2-4 yrs.Key Responsibiliti...
72                               Software Development  ...  Software Developers at IBM are the backbone of...
73            newB2B/EDI - Map Development Specialist  ...  Software Developers at IBM are the backbone of...
74  Artificial Intelligence / Data Science/ Machin...  ...  TATA ELXSI Ltd. is conducting off 
campus drive...

[75 rows x 4 columns]

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/462924.html

標籤：Python 硒网页抓取美丽的汤

上一篇：從Windows.Form中的div抓取資料

下一篇：Scrapy只抓取和抓取HTML和TXT