我是使用 python 進行網路抓取的新手。我使用 Selenium 和 BeautifulSoup 撰寫了從作業門戶網站提取資料的代碼。我做的流程是:
- 抓取職位門戶網站上的整個職位發布鏈接
- 從通過回圈獲得的職位發布的每個鏈接中抓取詳細資訊。
我在腳本標簽 type = 'application/ld json' 和 data-react-helmet 上使用 find_all BeautifulSoup 方法抓取了詳細資訊。但我得到一個錯誤訊息串列索引超出范圍。有誰知道如何解決它?
job_main_data = pd.DataFrame()
for i, url in enumerate(URL_job_list):
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept':
'text/html,application/xhtml xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
response = requests.get(url=url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
script_tags = soup.find_all('script', attrs={'data-react helmet':'true','type':'application/ld json'})
metadata = script_tags[-1].text
temp_dict = {}
try:
job_info_json = json.loads(metadata, strict=False)
try:
jobID = job_info_json['identifier']['value']
temp_dict['Job ID'] = jobID
print('Job ID = ' jobID)
except AttributeError :
jobID = ''
try:
jobTitle = job_info_json['title']
temp_dict['Job Title'] = jobTitle
print('Title = ' jobTitle)
except AttributeError :
jobTitle = ''
try:
occupationalCategory = job_info_json['occupationalCategory']
temp_dict['occupationalCategory'] = occupationalCategory
print('Occupational Category = ' occupationalCategory)
except AttributeError :
occupationalCategory = ''
temp_dict['Job Link'] = URL_job_list
job_main_data = job_main_data.append(temp_dict, ignore_index=True)
except json.JSONDecodeError:
print("Empty response")
uj5u.com熱心網友回復:
資料由 Javascript 從 API 呼叫 json 回應動態加載,您可以隨心所欲地獲取所有資料。下面給出了一個示例,如何requests僅使用模塊從 api 中提取資料
import requests
import json
payload={
"requests":[
{
"indexName":"job_postings",
"params":"query=&hitsPerPage=20&maxValuesPerFacet=1000&page=0&facets=["*","city.work_country_name","position.name","industries.vertical_name","experience","job_type.name","is_salary_visible","has_equity","currency.currency_code","salary_min","taxonomies.slug"]&tagFilters=&facetFilters=[["city.work_country_name:Indonesia"]]"
},
{
"indexName":"job_postings",
"params":"query=&hitsPerPage=1&maxValuesPerFacet=1000&page=0&attributesToRetrieve=[]&attributesToHighlight=[]&attributesToSnippet=[]&tagFilters=&analytics=false&clickAnalytics=false&facets=city.work_country_name"
}
]
}
headers={'content-type': 'application/x-www-form-urlencoded'}
api_url = "https://219wx3mpv4-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia for vanilla JavaScript 3.30.0;JS Helper 2.26.1&x-algolia-application-id=219WX3MPV4&x-algolia-api-key=b528008a75dc1c4402bfe0d8db8b3f8e"
jsonData=requests.post(api_url,data=json.dumps(payload),headers=headers).json()
#print(jsonData)
for item in jsonData['results'][0]['hits']:
title=item['_highlightResult']['title']['value']
company=item['_highlightResult']['company']['name']['value']
skill=item['_highlightResult']['job_skills'][0]['name']['value']
salary_max=item['salary_max']
salary_min=item['salary_min']
print(title)
print(company)
print(skill)
print(salary_max)
print(salary_min)
輸出:
Corporate PR
Rocketindo
Sales Strategy & Management
12000000
7000000
Social Media Specialist
Rocketindo
Content Marketing
12000000
7000000
Performance Marketing Analyst (Mama's Choice)
The Parent Inc (theAsianparent)
Marketing Strategy
12000000
5000000
Business Development (Associate Consultant) - CRM
Mekari (PT. Mid Solusi Nusantara)
Business Development & Partnerships
7000000
5000000
Account Payable
Ritase
Corporate Finance
0
0
Data Engineer
Topremit
Databases
0
0
Public Relation KOL
Rocketindo
Business Development & Partnerships
7000000
5000000
Graphic Designer
Rocketindo
Adobe Illustrator
12000000
7000000
Yogyakarta City Coordinator
Deliveree Indonesia
Business Operations
6000000
5250000
Marketing Manager
Deliveree Indonesia
Marketing Strategy
0
0
Graphic Designer
Deliveree Indonesia
Graphic Design
6000000
5250000
Quality Assurance
PT Rekeningku Dotcom Indonesia
Javascript
10000000
4500000
Internship Program
TADA
Attention to Detail
3700000
3000000
Product Management Support
Hangry
Data Warehouse
0
0
Content Writer
Bobobox Indonesia
Copywriting
0
0
UX Researcher
Bobobox Indonesia
UI/UX Design
0
0
UX Copywriter
Bobobox Indonesia
Problem Solving
0
0
Internship HR (Recruitment)
PT Formasi Agung Selaras (Famous Allstars)
Human Resources
1500000
1000000
Fullstack Developer - Banking Industry
SIGMATECH
React.js
12000000
8000000
REACT NATIVE DEVELOPER
BGT Solution
MySQL
16000000
6000000
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/465632.html
標籤:javascript python-3.x 硒 网页抓取 美丽的汤
上一篇:不經驗證取消表格
下一篇:顯示地圖函式內另一個陣列的資料
