爬取重慶上萬條招聘資訊，看看你的薪酬在重慶屬于什么段位。-有解無憂

作為一個即將作業的人，想看看自己的薪酬在重慶處于什么段位，如果對如何實作的不感興趣，可以拉到后面，看第三部分，結果分析可視化部分，

一資料爬取部分

手里沒有招聘資訊，我們需要從網上爬取，

一.目標站點分析
資料來源:51job
網站超鏈接：51job重慶

二.爬取單頁網頁html
采用python +requests

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# @Author: yudengwu(余登武）
# @Date  : 2021/2/7
#@email:1344732766@qq.com
import requests

def get_page(url,page):#獲取請求并回傳決議頁面，offest,keyword為可變引數

    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
    try:
        response = requests.get(url,headers = headers,timeout=10)
        response.encoding = 'gbk'  # 加編碼，重要！轉換為字串編碼，read()得到的是byte格式的
        if response.status_code == 200 :
            return response.text
    except ConnectionError:
        print('程式錯誤')
        return None

def main():
    for i in range(1, 2):  # 遍歷網頁1
       base_url = 'https://jobs.51job.com/chongqing/p'+str(i)+'/'
       html=get_page(base_url,i)
       print(html)
if __name__ =='__main__':
    main()

和瀏覽器里的html一樣

三.網頁資訊提取
BeautifulSoup+re正則

一條招聘資訊對應html區域如圖
對前端的css有些熟悉，所以用 BeautifulSoup中select來選擇資訊，
教程見鏈接
爬蟲筆記：BeautifulSoup詳解
該部分的 BeautifulSoup 選擇代碼為items=soup.select('.maincenter .mcon .left .detlist.gbox .e')

打開一個招聘資訊查看

根據上圖，可以慢慢確定資訊提取代碼
如下
再遍歷每一個招聘資訊for item in items:#遍歷每一條招聘資訊
職位選擇position=item.select('.info .title a')[0]
公司名company_name=item.select('.info a')[1]
作業地workplace=item.select('.info .location.name')
工資salary=item.select('.info .location')[1]
發布日期release_time=item.select('.info .time')[0]
學歷要求degree=item.select('.order')[0] degree=degree.get_text() degree_study=re.split(r'\|',degree)[0]
作業經驗work_experience=re.split(r'\|',degree)[1]
公司性質nature_company=re.split(r'\|',degree)[2]
公司規模company_size=re.split(r'\|',degree)[3]
崗位描述JobDescribe=item.select('.text')[0]

爬蟲部分總代碼
開啟多執行緒爬蟲

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# @Author: yudengwu(余登武）
# @Date  : 2021/2/7
#@email:1344732766@qq.com
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import pandas as pd
import time
from requests.exceptions import  RequestException
def get_page(url,page):#獲取請求并回傳決議頁面，offest,keyword為可變引數

    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
    try:
        response = requests.get(url,headers = headers,timeout=10)
        response.encoding = 'gbk'  # 加編碼，重要！轉換為字串編碼，read()得到的是byte格式的
        if response.status_code == 200 :
            return response.text
    except RequestException:
        print('程式錯誤')
        return None

#決議網頁

def pase_page(url,page):
    html = get_page(url, page)
    html = str(html)
    if html is not None:
        soup = BeautifulSoup(html, 'lxml')
        "--先選擇招聘資訊,得到的是一組招聘資訊--"
        items=soup.select('.maincenter .mcon .left .detlist.gbox .e')
        #print(items)
        for item in items:#遍歷每一條招聘資訊
            '職位'
            position=item.select('.info .title a')[0]
            position = position.get_text()
            #print(position)

            '公司名'
            company_name=item.select('.info a')[1]
            company_name=company_name.get_text()
            #print(company_name)
            '作業地'
            workplace=item.select('.info .location.name')
            workplace=' '.join(i.get_text()for i in workplace)
            #print(workplace)
            '工資'
            salary=item.select('.info .location')[1]
            #salary=' '.join(i.get_text()for i in salary)
            salary=salary.get_text()
            # print(salary)
            '發布日期'
            release_time=item.select('.info .time')[0]
            release_time=release_time.get_text()
            #print(release_time)
            '學歷要求'
            degree=item.select('.order')[0]
            degree=degree.get_text()
            degree_study=re.split(r'\|',degree)[0]
            #print(degree_study)
            '作業經驗'
            work_experience=re.split(r'\|',degree)[1]
            #print(work_experience)
            '公司性質'
            nature_company=re.split(r'\|',degree)[2]
            #print(nature_company)
            '公司規模'
            company_size=re.split(r'\|',degree)[3]
            #print(company_size)
            '崗位描述'
            JobDescribe=item.select('.text')[0]
            JobDescribe=JobDescribe.get_text()
            #print(JobDescribe)

            '-------------寫入表格-------------------'

            information = [position,company_name,workplace,salary,release_time,degree_study,work_experience,nature_company,company_size,JobDescribe]
            information = np.array(information)
            information = information.reshape(-1, 10)
            information = pd.DataFrame(information, columns=['position','company_name','workplace','salary','release_time','degree_study','work_experience','nature_company','company_size','JobDescribe'])

            information.to_csv('51job重慶招聘資料.csv', mode='a+', index=False, header=False)  # mode='a+'追加寫入
        print('第{0}頁存盤資料成功'.format(page))
    else:
        print('決議失敗')


#多執行緒爬取
import threading

def main():
    for i in range(1, 4997,3):  # 遍歷網頁1-5000
       base_url1 = 'https://jobs.51job.com/chongqing/p'+str(i)+'/'
       base_url2 = 'https://jobs.51job.com/chongqing/p' + str(i+1) + '/'
       base_url3 = 'https://jobs.51job.com/chongqing/p' + str(i + 2) + '/'
       t1 = threading.Thread(target=pase_page, args=(base_url1, i))  # 執行緒1
       t2 = threading.Thread(target=pase_page, args=(base_url2, i+1))  # 執行緒2
       t3 = threading.Thread(target=pase_page, args=(base_url3, i + 2))  # 執行緒3
       t1.start()
       t2.start()
       t3.start()

if __name__ =='__main__':
    main()

有些頁由于網頁反映快慢問題或者其他問題沒有抓取下來，不過對資料集影響不大，至少我們抓取的資料足夠多近2萬條，

二.資料處理部分

得到的資料表格，里面有些資料需要處理，如工資有的是按千計算，有的是按萬計算，有的按月計算，有的按年計算，給我們分析帶來很大困擾，我還想把學歷要求，公式規模，公式性質這些多余的字去掉，

為了方便處理，首先給表格添加欄位

工資處理

import pandas as pd
import re
import numpy as np
#讀取資料
csv=pd.read_csv(r'D:\deeplearn\xuexicaogao\51job重慶招聘資料.csv',error_bad_lines=False,encoding='utf-8')
#洗掉空值
csv.dropna(axis=0, how='any', inplace=True)

'----工資處理---'
def get_money_max_min(x):
    try:
        if x[-3] == "萬":
            z = [float(i)*10000 for i in re.findall("[0-9]+\.?[0-9]*",x)]
        elif x[-3] == "千":
            z = [float(i) * 1000 for i in re.findall("[0-9]+\.?[0-9]*", x)]
        if x[-1] == "年":
            z = [i/12 for i in z]
        return z
    except:
        return x
salary = csv["salary"].apply(get_money_max_min)#得到的工資為[3000.0, 4500.0]是一個區間，對應我們常說的3-4.5千、月
csv['最低工資']=salary.str[0]#16446

csv['最高工資']=salary.str[1]
csv['工資水平']=csv[["最低工資","最高工資"]].mean(axis=1)
csv.to_csv('51job重慶招聘資料1.csv')

處理后續代碼

import pandas as pd
import re
import numpy as np
import jieba
#讀取資料
csv=pd.read_csv(r'D:\deeplearn\xuexicaogao\51job重慶招聘資料1.csv',error_bad_lines=False,encoding='gbk',header=0)
#洗掉空值
csv.dropna(axis=0, how='any', inplace=True)

#學歷處理
csv['degree_study']=csv['degree_study'].apply(lambda x:re.findall("本科|大專|應屆生|在校生|碩士|博士",x))
def func(x):
    if len(x) == 0:
        return np.nan
    elif len(x) == 1 or len(x) == 2:
        return x[0]
    else:
        return x[2]
csv["degree_study"] = csv["degree_study"].apply(func)
csv["degree_study"].value_counts()

'公司規模處理'
def func1(x):
    if x == '公司規模：少于50人':
        return "<50"
    elif x == '公司規模：50-150人':
        return "50-150"
    elif x == '公司規模：150-500人':
        return '150-500'
    elif x == '公司規模：500-1000人':
        return '500-1000'
    elif x == '公司規模：1000-5000人':
        return '1000-5000'
    elif x == '公司規模：5000-10000人':
        return '5000-10000'
    elif x == '公司規模：10000人以上':
        return ">10000"
    else:
        return np.nan
csv["company_size"] = csv["company_size"].apply(func1)

'公司型別理'
#csv['nature_company']=csv.loc[csv['nature_company'].apply(lambda x:len(x)<6),"公司型別"] = np.nan
csv['nature_company']=csv['nature_company'].apply(lambda x:re.sub("公司性質：","",x))
'作業經驗處理'
csv['work_experience']=csv['work_experience'].apply(lambda x:re.sub("作業經驗：","",x))

'對崗位描述進行分詞，為后續繪制詞云圖做好基礎'
csv['崗位描述詞']=csv['JobDescribe'].apply(lambda x:" ".join([w for w in list(jieba.cut(x)) if len(w)>1]))
#洗掉空值
csv.dropna(axis=0, how='any', inplace=True)

csv.to_csv('51job重慶招聘資料2.csv')

三結果分析可視化部分

1.工資水平分析
如一家公司給出的報酬是6-8千/月，則我們認為工資水平是7千/月

import pandas as pd

df = pd.read_csv('51job重慶招聘資料2.csv',encoding="utf-8")

"---工資水平分析---"
money=df['工資水平']
print(money.describe())

平均工資9304元，我快拖后腿啦，被灰心，能把招聘資訊放在求職網上的公司都是好公司，所有平均工資偏高，

在表格里面查找最高工資的是什么單位，干什么的，
專案經理300-450萬/年，5-7年作業經驗，招聘的是CEO，

import pandas as pd
import matplotlib.pyplot as plt
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
df = pd.read_csv('51job重慶招聘資料2.csv',encoding="utf-8")

#頻率分析，各階段工資占比

money1=df[df['工資水平']<6000]['position'].count()/len(df) #小于6000的占比
money2=df[(df['工資水平']>=6000) &(df['工資水平']<10000)]['position'].count()/len(df) #大于6000，小于10000的占比
money3=df[(df['工資水平']>=10000) &(df['工資水平']<15000)]['position'].count()/len(df) #大于10000，小于15000的占比
money4=df[(df['工資水平']>=15000) &(df['工資水平']<20000)]['position'].count()/len(df) #大于15000，小于20000的占比
money5=df[df['工資水平']>=20000]['position'].count()/len(df) #大于20000的占比

labels = ['小于6000', '大于6000小于10000', '大于10000小于15000', '大于15000小于20000','大于20000']
sizes = [money1,money2,money3,money4,money5]
explode = (0.1, 0.1, 0, 0,0)
plt.pie(sizes, explode=explode,labels=labels, autopct = '%1.1f%%',
        shadow= False, startangle =90)
plt.title('工資分布情況')
plt.show()

由圖可以看到70.8%的人月工資少于1萬，等于說4個人中有3個人月工資少于1萬，月工資大于1萬5的人就是top10%啦，加油，小伙，

2.學歷與工資關系
學歷要求為大專（不要多想，應該是以上）

#大專工資水平
junior_collegey=df[df['degree_study']=='大專']
juniormoney1=junior_collegey[junior_collegey['工資水平']<6000]['position'].count()/len(junior_collegey) #小于6000的占比
juniormoney2=junior_collegey[(junior_collegey['工資水平']>=6000) &(junior_collegey['工資水平']<10000)]['position'].count()/len(junior_collegey) #大于6000，小于10000的占比
juniormoney3=junior_collegey[(junior_collegey['工資水平']>=10000) &(junior_collegey['工資水平']<15000)]['position'].count()/len(junior_collegey) #大于10000，小于15000的占比
juniormoney4=junior_collegey[(junior_collegey['工資水平']>=15000) &(junior_collegey['工資水平']<20000)]['position'].count()/len(junior_collegey) #大于15000，小于20000的占比
juniormoney5=junior_collegey[junior_collegey['工資水平']>=20000]['position'].count()/len(junior_collegey) #大于20000的占比
juniormean=df[df['degree_study']=='大專']['工資水平'].mean()#平均工資

labels = ['小于6000', '大于6000小于10000', '大于10000小于15000', '大于15000小于20000','大于20000']
sizes = [juniormoney1,juniormoney2,juniormoney3,juniormoney4,juniormoney5]
explode = (0.1, 0.1, 0, 0,0)
plt.pie(sizes, explode=explode,labels=labels, autopct = '%1.1f%%',
        shadow= False, startangle =90)
plt.title('大專工資分布情況,大專平均工資為{0}'.format(juniormean))
plt.show()

大專77.3%的人月收入低于1萬，平均月工資8464

本科，本科的代碼直接將大專代碼中的大專修改為本科即可
本科62.1%的人月收入低于1萬，平均工資10487

碩士
碩士64%的人月收入低于1萬，平均工資12408，

博士
博士月收入就沒有低于1萬的，平均工資23022，

讀書還是有好處，平均工資，本科比大專多2000，碩士比本科多2000，博士是碩士的2倍，
讀書對于我們這些窮苦人出身的還是最好的出路，

3.公司性質與工資關系
民營公司

nature_company=df[df['nature_company']=='民營公司']
money1=nature_company[nature_company['工資水平']<6000]['position'].count()/len(nature_company) #小于6000的占比
money2=nature_company[(nature_company['工資水平']>=6000) &(nature_company['工資水平']<10000)]['position'].count()/len(nature_company) #大于6000，小于10000的占比
money3=nature_company[(nature_company['工資水平']>=10000) &(nature_company['工資水平']<15000)]['position'].count()/len(nature_company) #大于10000，小于15000的占比
money4=nature_company[(nature_company['工資水平']>=15000) &(nature_company['工資水平']<20000)]['position'].count()/len(nature_company) #大于15000，小于20000的占比
money5=nature_company[nature_company['工資水平']>=20000]['position'].count()/len(nature_company) #大于20000的占比
mean=df[df['nature_company']=='民營公司']['工資水平'].mean()#平均工資
labels = ['小于6000', '大于6000小于10000', '大于10000小于15000', '大于15000小于20000','大于20000']
sizes = [money1,money2,money3,money4,money5]
explode = (0.1, 0.1, 0, 0,0)
plt.pie(sizes, explode=explode,labels=labels, autopct = '%1.1f%%',
        shadow= False, startangle =90)
plt.title('民營公司工資分布情況,平均工資為{0}'.format(mean))
plt.show()

民營公司73.1%月收入低于1萬，平均月收入9156

合資公司
合資公司73%月收入低于1萬，平均月收入8891

上市公司
上市公司64.3%月收入低于1萬，平均月收入10038

外資公司和非歐美外資公司

shuju=['外資','外資（非歐美）','外資（歐美）']
nature_company=df[df['nature_company'].isin(shuju)]
money1=nature_company[nature_company['工資水平']<6000]['position'].count()/len(nature_company) #小于6000的占比
money2=nature_company[(nature_company['工資水平']>=6000) &(nature_company['工資水平']<10000)]['position'].count()/len(nature_company) #大于6000，小于10000的占比
money3=nature_company[(nature_company['工資水平']>=10000) &(nature_company['工資水平']<15000)]['position'].count()/len(nature_company) #大于10000，小于15000的占比
money4=nature_company[(nature_company['工資水平']>=15000) &(nature_company['工資水平']<20000)]['position'].count()/len(nature_company) #大于15000，小于20000的占比
money5=nature_company[nature_company['工資水平']>=20000]['position'].count()/len(nature_company) #大于20000的占比
mean=df[df['nature_company'].isin(shuju)]['工資水平'].mean()#平均工資
labels = ['小于6000', '大于6000小于10000', '大于10000小于15000', '大于15000小于20000','大于20000']
sizes = [money1,money2,money3,money4,money5]
explode = (0.1, 0.1, 0, 0,0)
plt.pie(sizes, explode=explode,labels=labels, autopct = '%1.1f%%',
        shadow= False, startangle =90)
plt.title('外資公司工資分布情況,平均工資為{0}'.format(mean))
plt.show()

外資公司 67%的人月收入低于1萬，平均月收入9225

國企
國企56%的人月收入低于1萬，平均月收入10462

4.查看公司地址分布

workplace=df.groupby('workplace').count()
workplace=workplace['position'].sort_values()
workplace=pd.DataFrame(workplace)
workplace.plot(kind='barh')
plt.title('公司區域分布')
plt.show()

公司還是集中在主城，一些小縣城，招聘資訊只有個位數，

5.作業描述詞云圖

text = df['崗位描述詞']
text = (' '.join(map(str, text))).split(' ')
#print(text)

from collections import Counter
from wordcloud import WordCloud

def generate_wordcloud(tup):
    wordcloud = WordCloud(background_color='white',
                          font_path='simhei.ttf',
                          max_words=50, max_font_size=40,
                          random_state=42
                          ).generate(str(tup))
    return wordcloud
most300 = Counter(text).most_common(300)#10個高頻詞
plt.imshow(generate_wordcloud(most300), interpolation="bilinear")
plt.title("Top 300", fontsize=30)
plt.show()

沒有進行停用詞去除（主要是想減少作業量），詞云圖得到的資訊不是很滿意，
從停用詞中發現：公司對溝通能力，經驗，學歷，組織能力，專業相關比較看重，

在這里插入圖片描述
通過對這篇文章，我了解到啦我處于哪個段位啦，算是比較優秀吧
小伙子加油

作者：電氣余登武，原創屬實不易，請一鍵三連（點贊，收藏，關注）再走

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/258188.html

標籤：python

上一篇：中國天氣網爬蟲資料可視化

下一篇：用Python分析了5萬條相親網站資料，看相親男女畫像