我是網路抓取的新手,我確實在嘗試抓取以進行練習。但是我遇到了一個問題,我只想抓取職位名稱,但它抓取了包括“新”在內的所有跨度。下面是我的代碼
from bs4 import BeautifulSoup as bs
import requests
def extract(page):
url = f'https://ph.indeed.com/jobs?q=python developer&l=Manila&start={page}'
r = requests.get(url)
soup = bs(r.content,'lxml')
return soup
def transform(soup):
results = soup.find_all('div',class_='slider_container')
for item in results:
job_title=item.find('span').text
print(job_title)
c = extract(0)
transform(c)
當我運行代碼時,結果是:
new
new
Python Developer
Python Developer
new
Jr. Python Developer
Python Developer
Python Developer
Python Developer
new
new
Junior Web Developer (Web Scraping)
new
Junior Web Developer Fullstack
Back End Developer (Work-from-Home)
預期輸出應該是職位,但不包括“新”:
Python Developer
Python Developer
Jr. Python Developer
Python Developer
Python Developer
Python Developer
Junior Web Developer (Web Scraping)
Junior Web Developer Fullstack
Back End Developer (Work-from-Home)
uj5u.com熱心網友回復:
問題是并非所有<span>'s 都包含“職位名稱”,因此您必須檢查標簽中是否存在title 屬性<span>。
代替:
job_title=item.find('span').text
用:
job_title = item.find(lambda tag: tag.name == "span" and "title" in tag.attrs).text
或者使用 CSS 選擇器:
job_title = item.select_one("span[title]").text
輸出:
Back-End Developer | Python/Django
Python Developer
Python Developer
Jr. Web Developer
PYTHON DEVELOPER
Front-End Developer - Consultant - Digital Customer - Philip...
...
uj5u.com熱心網友回復:
您可以使用if條件來排除“新”一詞。
試試這個:
from bs4 import BeautifulSoup as bs
import requests
def extract(page):
url = f'https://ph.indeed.com/jobs?q=python developer&l=Manila&start={page}'
r = requests.get(url)
soup = bs(r.content,'lxml')
return soup
def transform(soup):
results = soup.find_all('div',class_='slider_container')
for item in results:
job_title=item.find('span').text
if job_title !='new': # <<<----- Edited line here!
print(job_title)
c = extract(0)
transform(c)
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/317028.html
上一篇:JSON輸出與NodeJS一團糟
下一篇:從需要您點擊“接受”cookies的網站讀取時修改`pd.read_html()`-HTTPError:HTTPError500:InternalServerError?
