?? 作者:韓信子@ShowMeAI
?? 資料分析實戰系列:https://www.showmeai.tech/tutorials/40
?? AI 崗位&攻略系列:https://www.showmeai.tech/tutorials/47
?? 本文地址:https://www.showmeai.tech/article-detail/402
?? 宣告:著作權所有,轉載請聯系平臺與作者并注明出處
?? 收藏ShowMeAI查看更多精彩內容
?? 引言
資料科學在互聯網、醫療、電信、零售、體育、航空、藝術等各個領域仍然越來越受歡迎,在 ??Glassdoor的美國最佳職位串列中,資料科學職位排名第三,2022 年有近 10,071 個職位空缺,
除了資料獨特的魅力,資料科學相關崗位的薪資也備受關注,在本篇內容中,ShowMeAI會基于資料對下述問題進行分析:
- 資料科學中薪水最高的作業是什么?
- 哪個國家的薪水最高,機會最多?
- 典型的薪資范圍是多少?
- 作業水平對資料科學家有多重要?
- 資料科學,全職vs自由職業者
- 資料科學領域薪水最高的作業是什么?
- 資料科學領域平均薪水最高的作業是什么?
- 資料科學專業的最低和最高工資
- 招聘資料科學專業人員的公司規模如何?
- 工資是不是跟公司規模有關?
- WFH(遠程辦公)和 WFO 的比例是多少?
- 資料科學作業的薪水每年如何增長?
- 如果有人正在尋找與資料科學相關的作業,你會建議他在網上搜索什么?
- 如果你有幾年初級員工的經驗,你應該考慮跳槽到什么規模的公司?
?? 資料說明
我們本次用到的資料集是 ??資料科學作業薪水資料集,大家可以通過 ShowMeAI 的百度網盤地址下載,
?? 實戰資料集下載(百度網盤):公眾號『ShowMeAI研究中心』回復『實戰』,或者點擊 這里 獲取本文 [37]基于pandasql和plotly的資料科學家薪資分析與可視化 『ds_salaries資料集』
? ShowMeAI官方GitHub:https://github.com/ShowMeAI-Hub
資料集包含 11 列,對應的名稱和含義如下:
| 引數 | 含義 |
|---|---|
| work_year | 支付工資的年份 |
| experience_level : 發薪時的經驗等級 | |
| employment_type | 就業型別 |
| job_title | 崗位名稱 |
| salary | 支付的總工資總額 |
| salary_currency | 支付的薪水的貨幣 |
| salary_in_usd | 支付的標準化工資(美元) |
| employee_residence | 員工的主要居住國家 |
| remote_ratio | 遠程完成的作業總量 |
| company_location | 雇主主要辦公室所在的國家/地區 |
| company_size | 根據員工人數計算的公司規模 |
本篇分析使用到Pandas和SQL,歡迎大家閱讀ShowMeAI的資料分析教程和對應的工具速查表文章,系統學習和動手實踐:
??圖解資料分析:從入門到精通系列教程
??編程語言速查表 | SQL 速查表
??資料科學工具庫速查表 | Pandas 速查表
??資料科學工具庫速查表 | Matplotlib 速查表
?? 匯入工具庫
我們先匯入需要使用的工具庫,我們使用pandas讀取資料,使用 Plotly 和 matplotlib 進行可視化,并且我們在本篇中會使用 SQL 進行資料分析,我們這里使用到了 ??pandasql 工具庫,
# For loading data
import pandas as pd
import numpy as np
# For SQL queries
import pandasql as ps
# For ploting graph / Visualization
import plotly.graph_objects as go
import plotly.express as px
from plotly.offline import iplot
import plotly.figure_factory as ff
import plotly.io as pio
import seaborn as sns
import matplotlib.pyplot as plt
# To show graph below the code or on same notebook
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
# To convert country code to country name
import country_converter as coco
import warnings
warnings.filterwarnings('ignore')
?? 加載資料集
我們下載的資料集是 CSV 格式的,所以我們可以使用 read_csv 方法來讀取我們的資料集,
# Loading data
salaries = pd.read_csv('ds_salaries.csv')
要查看前五個記錄,我們可以使用 salaries.head() 方法,
借助 pandasql完成同樣的任務是這樣的:
# Function query to execute SQL queries
def query(query):
return ps.sqldf(query)
# Showing Top 5 rows of data
query("""
SELECT *
FROM salaries
LIMIT 5
""")
輸出:
?? 資料預處理
我們資料集中的第1列“Unnamed: 0”是沒有用的,在分析之前我們把它剔除:
salaries = salaries.drop('Unnamed: 0', axis = 1)
我們查看一下資料集中缺失值情況:
salaries.isna().sum()
輸出:
work_year 0
experience_level 0
employment_type 0
job_title 0
salary 0
salary_currency 0
salary_in_usd 0
employee_residence 0
remote_ratio 0
company_location 0
company_size 0
dtype: int64
我們的資料集中沒有任何缺失值,因此不用做缺失值處理,employee_residence 和 company_location 使用的是短國家代碼,我們映射替換為國家的全名以便于理解:
# Converting countries code to country names
salaries["employee_residence"] = coco.convert(names=salaries["employee_residence"], to="name")
salaries["company_location"] = coco.convert(names=salaries["company_location"], to="name")
這個資料集中的experience_level代表不同的經驗水平,使用的是如下縮寫:
- CN: Entry Level (入門級)
- ML:Mid level (中級)
- SE:Senior Level (高級)
- EX:Expert Level (資深專家級)
為了更容易理解,我們也把這些縮寫替換為全稱,
# Replacing values in column - experience_level :
salaries['experience_level'] = query("""SELECT
REPLACE(
REPLACE(
REPLACE(
REPLACE(
experience_level, 'MI', 'Mid level'),
'SE', 'Senior Level'),
'EN', 'Entry Level'),
'EX', 'Expert Level')
FROM
salaries""")
同樣的方法,我們對作業形式也做全稱替換
- FT: Full Time (全職)
- PT: Part Time (兼職)
- CT:Contract (合同制)
- FL:Freelance (自由職業)
# Replacing values in column - experience_level :
salaries['employment_type'] = query("""SELECT
REPLACE(
REPLACE(
REPLACE(
REPLACE(
employment_type, 'PT', 'Part Time'),
'FT', 'Full Time'),
'FL', 'Freelance'),
'CT', 'Contract')
FROM
salaries""")
資料集中公司規模欄位處理如下:
- S:Small (小型)
- M:Medium (中型)
- L:Large (大型)
# Replacing values in column - company_size :
salaries['company_size'] = query("""SELECT
REPLACE(
REPLACE(
REPLACE(
company_size, 'M', 'Medium'),
'L', 'Large'),
'S', 'Small')
FROM
salaries""")
我們對遠程比率欄位也做一些處理,以便更好理解
# Replacing values in column - remote_ratio :
salaries['remote_ratio'] = query("""SELECT
REPLACE(
REPLACE(
REPLACE(
remote_ratio, '100', 'Fully Remote'),
'50', 'Partially Remote'),
'0', 'Non Remote Work')
FROM
salaries""")
這是預處理后的最終輸出,
?? 資料分析&可視化
?? 資料科學中薪水最高的作業是什么?
top10_jobs = query("""
SELECT job_title,
Count(*) AS job_count
FROM salaries
GROUP BY job_title
ORDER BY job_count DESC
LIMIT 10
""")
我們繪制條形圖以便更直觀理解:
data = https://www.cnblogs.com/showmeai/p/go.Bar(x = top10_jobs['job_title'], y = top10_jobs['job_count'],
text = top10_jobs['job_count'], textposition = 'inside',
textfont = dict(size = 12,
color = 'white'),
marker = dict(color = px.colors.qualitative.Alphabet,
opacity = 0.9,
line_color = 'black',
line_width = 1))
layout = go.Layout(title = {'text': "<b>Top 10 Data Science Jobs</b>",
'x':0.5, 'xanchor': 'center'},
xaxis = dict(title = '<b>Job Title</b>', tickmode = 'array'),
yaxis = dict(title = '<b>Total</b>'),
width = 900,
height = 600)
fig = go.Figure(data = https://www.cnblogs.com/showmeai/p/data, layout = layout)
fig.update_layout(plot_bgcolor ='#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 資料科學職位的市場分布
fig = px.pie(top10_jobs, values='job_count',
names='job_title',
color_discrete_sequence = px.colors.qualitative.Alphabet)
fig.update_layout(title = {'text': "<b>Distribution of job positions</b>",
'x':0.5, 'xanchor': 'center'},
width = 900,
height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 擁有最多資料科學作業的國家
top10_com_loc = query("""
SELECT company_location AS company,
Count(*) AS job_count
FROM salaries
GROUP BY company
ORDER BY job_count DESC
LIMIT 10
""")
data = https://www.cnblogs.com/showmeai/p/go.Bar(x = top10_com_loc['company'], y = top10_com_loc['job_count'],
textfont = dict(size = 12,
color = 'white'),
marker = dict(color = px.colors.qualitative.Alphabet,
opacity = 0.9,
line_color = 'black',
line_width = 1))
layout = go.Layout(title = {'text': "<b>Top 10 Data Science Countries</b>",
'x':0.5, 'xanchor': 'center'},
xaxis = dict(title = '<b>Countries</b>', tickmode = 'array'),
yaxis = dict(title = '<b>Total</b>'),
width = 900,
height = 600)
fig = go.Figure(data = https://www.cnblogs.com/showmeai/p/data, layout = layout)
fig.update_layout(plot_bgcolor ='#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
從上圖中,我們可以看出美國在資料科學方面的作業機會最多,現在我們來看看世界各地的薪水,大家可以繼續運行代碼,查看可視化結果,
df = salaries
df["company_country"] = coco.convert(names = salaries["company_location"], to = 'name_short')
temp_df = df.groupby('company_country')['salary_in_usd'].sum().reset_index()
temp_df['salary_scale'] = np.log10(df['salary_in_usd'])
fig = px.choropleth(temp_df, locationmode = 'country names', locations = "company_country",
color = "salary_scale", hover_name = "company_country",
hover_data = https://www.cnblogs.com/showmeai/p/temp_df[['salary_in_usd']],
color_continuous_scale = 'Jet',
)
fig.update_layout(title={'text':'<b>Salaries across the World</b>',
'xanchor': 'center','x':0.5})
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 平均工資(基于貨幣計算)
df = salaries[['salary_currency','salary_in_usd']].groupby(['salary_currency'], as_index = False).mean().set_index('salary_currency').reset_index().sort_values('salary_in_usd', ascending = False)
#Selecting top 14
df = df.iloc[:14]
fig = px.bar(df, x = 'salary_currency',
y = 'salary_in_usd',
color = 'salary_currency',
color_discrete_sequence = px.colors.qualitative.Safe,
)
fig.update_layout(title={'text':'<b>Average salary as a function of currency</b>',
'xanchor': 'center','x':0.5},
xaxis_title = '<b>Currency</b>',
yaxis_title = '<b>Mean Salary</b>')
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
人們以美元賺取的收入最多,其次是瑞士法郎和新加坡元,
df = salaries[['company_country','salary_in_usd']].groupby(['company_country'], as_index = False).mean().set_index('company_country').reset_index().sort_values('salary_in_usd', ascending = False)
#Selecting top 14
df = df.iloc[:14]
fig = px.bar(df, x = 'company_country',
y = 'salary_in_usd',
color = 'company_country',
color_discrete_sequence = px.colors.qualitative.Dark2,
)
fig.update_layout(title = {'text': "<b>Average salary as a function of company location</b>",
'x':0.5, 'xanchor': 'center'},
xaxis = dict(title = '<b>Company Location</b>', tickmode = 'array'),
yaxis = dict(title = '<b>Mean Salary</b>'),
width = 900,
height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 資料科學作業經驗水平分布
job_exp = query("""
SELECT experience_level, Count(*) AS job_count
FROM salaries
GROUP BY experience_level
ORDER BY job_count ASC
""")
data = https://www.cnblogs.com/showmeai/p/go.Bar(x = job_exp['job_count'], y = job_exp['experience_level'],
orientation = 'h', text = job_exp['job_count'],
marker = dict(color = px.colors.qualitative.Alphabet,
opacity = 0.9,
line_color = 'white',
line_width = 2))
layout = go.Layout(title = {'text': "<b>Jobs on Experience Levels</b>",
'x':0.5, 'xanchor':'center'},
xaxis = dict(title='<b>Total</b>', tickmode = 'array'),
yaxis = dict(title='<b>Experience lvl</b>'),
width = 900,
height = 600)
fig = go.Figure(data = https://www.cnblogs.com/showmeai/p/data, layout = layout)
fig.update_layout(plot_bgcolor ='#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
從上圖可以看出,大多數資料科學都是 高級水平 ,專家級很少,
?? 資料科學作業就業型別分布
job_emp = query("""
SELECT employment_type,
COUNT(*) AS job_count
FROM salaries
GROUP BY employment_type
ORDER BY job_count ASC
""")
data = https://www.cnblogs.com/showmeai/p/go.Bar(x = job_emp['job_count'], y = job_emp['employment_type'],
orientation ='h',text = job_emp['job_count'],
textposition ='outside',
marker = dict(color = px.colors.qualitative.Alphabet,
opacity = 0.9,
line_color = 'white',
line_width = 2))
layout = go.Layout(title = {'text': "<b>Jobs on Employment Type</b>",
'x':0.5, 'xanchor': 'center'},
xaxis = dict(title='<b>Total</b>', tickmode = 'array'),
yaxis =dict(title='<b>Emp Type lvl</b>'),
width = 900,
height = 600)
fig = go.Figure(data = https://www.cnblogs.com/showmeai/p/data, layout = layout)
fig.update_layout(plot_bgcolor ='#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
從上圖中,我們可以看到大多數資料科學家從事 全職作業 ,而合同工和自由職業者 則較少
?? 資料科學作業數量趨勢
job_year = query("""
SELECT work_year, COUNT(*) AS 'job count'
FROM salaries
GROUP BY work_year
ORDER BY 'job count' DESC
""")
data = https://www.cnblogs.com/showmeai/p/go.Scatter(x = job_year['work_year'], y = job_year['job count'],
marker = dict(size = 20,
line_width = 1.5,
line_color = 'white',
color = px.colors.qualitative.Alphabet),
line = dict(color = '#ED7D31', width = 4), mode = 'lines+markers')
layout = go.Layout(title = {'text' : "<b><i>Data Science jobs Growth (2020 to 2022)</i></b>",
'x' : 0.5, 'xanchor' : 'center'},
xaxis = dict(title = '<b>Year</b>'),
yaxis = dict(title = '<b>Jobs</b>'),
width = 900,
height = 600)
fig = go.Figure(data = https://www.cnblogs.com/showmeai/p/data, layout = layout)
fig.update_xaxes(tickvals = ['2020','2021','2022'])
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 資料科學作業薪水分布
salary_usd = query("""
SELECT salary_in_usd
FROM salaries
""")
import matplotlib.pyplot as plt
plt.figure(figsize = (20, 8))
sns.set(rc = {'axes.facecolor' : '#f1e7d2',
'figure.facecolor' : '#f1e7d2'})
p = sns.histplot(salary_usd["salary_in_usd"],
kde = True, alpha = 1, fill = True,
edgecolor = 'black', linewidth = 1)
p.axes.lines[0].set_color("orange")
plt.title("Data Science Salary Distribution \n", fontsize = 25)
plt.xlabel("Salary", fontsize = 18)
plt.ylabel("Count", fontsize = 18)
plt.show()
?? 薪酬最高的 10 大資料科學作業
salary_hi10 = query("""
SELECT job_title,
MAX(salary_in_usd) AS salary
FROM salaries
GROUP BY salary
ORDER BY salary DESC
LIMIT 10
""")
data = https://www.cnblogs.com/showmeai/p/go.Bar(x = salary_hi10['salary'],
y = salary_hi10['job_title'],
orientation = 'h',
text = salary_hi10['salary'],
textposition = 'inside',
insidetextanchor = 'middle',
textfont = dict(size = 13,
color = 'black'),
marker = dict(color = px.colors.qualitative.Alphabet,
opacity = 0.9,
line_color = 'black',
line_width = 1))
layout = go.Layout(title = {'text': "<b>Top 10 Highest paid Data Science Jobs</b>",
'x':0.5,
'xanchor': 'center'},
xaxis = dict(title = '<b>salary</b>', tickmode = 'array'),
yaxis = dict(title = '<b>Job Title</b>'),
width = 900,
height = 600)
fig = go.Figure(data = https://www.cnblogs.com/showmeai/p/data, layout
= layout)
fig.update_layout(plot_bgcolor ='#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
首席資料工程師 是資料科學領域的高薪作業,
?? 不同崗位平均薪資與排名
salary_av10 = query("""
SELECT job_title,
ROUND(AVG(salary_in_usd)) AS salary
FROM salaries
GROUP BY job_title
ORDER BY salary DESC
LIMIT 10
""")
data = https://www.cnblogs.com/showmeai/p/go.Bar(x = salary_av10['salary'],
y = salary_av10['job_title'],
orientation = 'h',
text = salary_av10['salary'],
textposition = 'inside',
insidetextanchor = 'middle',
textfont = dict(size = 13,
color = 'white'),
marker = dict(color = px.colors.qualitative.Alphabet,
opacity = 0.9,
line_color = 'white',
line_width = 2))
layout = go.Layout(title = {'text': "<b>Top 10 Average paid Data Science Jobs</b>",
'x':0.5,
'xanchor': 'center'},
xaxis = dict(title = '<b>salary</b>', tickmode = 'array'),
yaxis = dict(title = '<b>Job Title</b>'),
width = 900,
height = 600)
fig = go.Figure(data = https://www.cnblogs.com/showmeai/p/data, layout = layout)
fig.update_layout(plot_bgcolor ='#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 資料科學薪資趨勢
salary_year = query("""
SELECT ROUND(AVG(salary_in_usd)) AS salary,
work_year AS year
FROM salaries
GROUP BY year
ORDER BY salary DESC
""")
data = https://www.cnblogs.com/showmeai/p/go.Scatter(x = salary_year['year'],
y = salary_year['salary'],
marker = dict(size = 20,
line_width = 1.5,
line_color = 'black',
color = '#ED7D31'),
line = dict(color = 'black', width = 4), mode = 'lines+markers')
layout = go.Layout(title = {'text' : "<b>Data Science Salary Growth (2020 to 2022) </b>",
'x' : 0.5,
'xanchor' : 'center'},
xaxis = dict(title = '<b>Year</b>'),
yaxis = dict(title = '<b>Salary</b>'),
width = 900,
height = 600)
fig = go.Figure(data = https://www.cnblogs.com/showmeai/p/data, layout = layout)
fig.update_xaxes(tickvals = ['2020','2021','2022'])
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 經驗水平&薪資
salary_exp = query("""
SELECT experience_level AS 'Experience Level',
salary_in_usd AS Salary
FROM salaries
""")
fig = px.violin(salary_exp, x = 'Experience Level', y = 'Salary', color = 'Experience Level', box = True)
fig.update_layout(title = {'text': "<b>Salary on Experience Level</b>",
'xanchor': 'center','x':0.5},
xaxis = dict(title = '<b>Experience level</b>'),
yaxis = dict(title = '<b>salary</b>',
ticktext = [-300000, 0, 100000, 200000, 300000, 400000, 500000, 600000, 700000]),
width = 900,
height = 600)
fig.update_layout(paper_bgcolor= '#f1e7d2',
plot_bgcolor = '#f1e7d2',
showlegend = False)
fig.show()
?? 不同經驗水平的薪資趨勢
tmp_df = salaries.groupby(['work_year', 'experience_level']).median()
tmp_df.reset_index(inplace = True)
fig = px.line(tmp_df, x='work_year', y='salary_in_usd', color='experience_level', symbol="experience_level")
fig.update_layout(title = {'text': "<b>Median Salary Trend By Experience Level</b>",
'x':0.5, 'xanchor': 'center'},
xaxis = dict(title = '<b>Working Year</b>', tickvals = [2020, 2021, 2022], tickmode = 'array'),
yaxis = dict(title = '<b>Salary</b>'),
width = 900,
height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
觀察 1. 在COVID-19大流行期間(2020 年至 2021 年),專家級員工薪資非常高,但是呈現部分下降趨勢, 2. 2021年以后專家級和高級職稱人員工資有所上漲,
?? 年份&薪資分布
year_gp = salaries.groupby('work_year')
hist_data = https://www.cnblogs.com/showmeai/p/[year_gp.get_group(2020)['salary_in_usd'],
year_gp.get_group(2021)['salary_in_usd'],
year_gp.get_group(2022)['salary_in_usd']]
group_labels = ['2020', '2021', '2022']
fig = ff.create_distplot(hist_data, group_labels, show_hist = False)
fig.update_layout(title = {'text': "<b>Salary Distribution By Working Year</b>",
'x':0.5, 'xanchor': 'center'},
xaxis = dict(title = '<b>Salary</b>'),
yaxis = dict(title = '<b>Kernel Density</b>'),
width = 900,
height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 就業型別&薪資
salary_emp = query("""
SELECT employment_type AS 'Employment Type',
salary_in_usd AS Salary
FROM salaries
""")
fig = px.box(salary_emp,x='Employment Type',y='Salary',
color = 'Employment Type')
fig.update_layout(title = {'text': "<b>Salary by Employment Type</b>",
'x':0.5, 'xanchor': 'center'},
xaxis = dict(title = '<b>Employment Type</b>'),
yaxis = dict(title = '<b>Salary</b>'),
width = 900,
height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 公司規模分布
comp_size = query("""
SELECT company_size,
COUNT(*) AS count
FROM salaries
GROUP BY company_size
""")
import plotly.graph_objects as go
data = https://www.cnblogs.com/showmeai/p/go.Pie(labels = comp_size['company_size'],
values = comp_size['count'].values,
hoverinfo = 'label',
hole = 0.5,
textfont_size = 16,
textposition = 'auto')
fig = go.Figure(data = https://www.cnblogs.com/showmeai/p/data)
fig.update_layout(title = {'text': "<b>Company Size</b>",
'x':0.5, 'xanchor': 'center'},
xaxis = dict(title = '<b></b>'),
yaxis = dict(title = '<b></b>'),
width = 900,
height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 不同公司規模的經驗水平比例
df = salaries.groupby(['company_size', 'experience_level']).size()
comp_s = np.round(df['Small'].values / df['Small'].values.sum(),2)
comp_m = np.round(df['Medium'].values / df['Medium'].values.sum(),2)
comp_l = np.round(df['Large'].values / df['Large'].values.sum(),2)
fig = go.Figure()
categories = ['Entry Level', 'Expert Level','Mid level','Senior Level']
fig.add_trace(go.Scatterpolar(
r = comp_s,
theta = categories,
fill = 'toself',
name = 'Company Size S'))
fig.add_trace(go.Scatterpolar(
r = comp_m,
theta = categories,
fill = 'toself',
name = 'Company Size M'))
fig.add_trace(go.Scatterpolar(
r = comp_l,
theta = categories,
fill = 'toself',
name = 'Company Size L'))
fig.update_layout(
polar = dict(
radialaxis = dict(range = [0, 0.6])),
showlegend = True,
)
fig.update_layout(title = {'text': "<b>Proportion of Experience Level In Different Company Sizes</b>",
'x':0.5, 'xanchor': 'center'},
xaxis = dict(title = '<b></b>'),
yaxis = dict(title = '<b></b>'),
width = 900,
height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 不同公司規模&作業薪資
salary_size = query("""
SELECT company_size AS 'Company size',
salary_in_usd AS Salary
FROM salaries
""")
fig = px.box(salary_size, x='Company size', y = 'Salary',
color = 'Company size')
fig.update_layout(title = {'text': "<b>Salary by Company size</b>",
'x':0.5, 'xanchor': 'center'},
xaxis = dict(title = '<b>Company size</b>'),
yaxis = dict(title = '<b>Salary</b>'),
width = 900,
height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? WFH(遠程辦公)和 WFO 的比例
rem_type = query("""
SELECT remote_ratio,
COUNT(*) AS total
FROM salaries
GROUP BY remote_ratio
""")
data = https://www.cnblogs.com/showmeai/p/go.Pie(labels = rem_type['remote_ratio'], values = rem_type['total'].values,
hoverinfo = 'label',
hole = 0.4,
textfont_size = 18,
textposition = 'auto')
fig = go.Figure(data = https://www.cnblogs.com/showmeai/p/data)
fig.update_layout(title = {'text': "<b>Remote Ratio</b>",
'x':0.5, 'xanchor': 'center'},
width = 900,
height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 薪水受Remote Type影響程度
salary_remote = query("""
SELECT remote_ratio AS 'Remote type',
salary_in_usd AS Salary
From salaries
""")
fig = px.box(salary_remote, x = 'Remote type', y = 'Salary', color = 'Remote type')
fig.update_layout(title = {'text': "<b>Salary by Remote Type</b>",
'x':0.5, 'xanchor': 'center'},
xaxis = dict(title = '<b>Remote type</b>'),
yaxis = dict(title = '<b>Salary</b>'),
width = 900,
height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 不同經驗水平&遠程比率
exp_remote = salaries.groupby(['experience_level', 'remote_ratio']).count()
exp_remote.reset_index(inplace = True)
fig = px.histogram(exp_remote, x = 'experience_level',
y = 'work_year', color = 'remote_ratio',
barmode = 'group',
text_auto = True)
fig.update_layout(title = {'text': "<b>Respondent Count In Different Experience Level Based on Remote Ratio</b>",
'x':0.5, 'xanchor': 'center'},
xaxis = dict(title = '<b>Experience Level</b>'),
yaxis = dict(title = '<b>Number of Respondents</b>'),
width = 900,
height = 600)
fig.update_layout(plot_bgcolor = '#f1e7d2',
paper_bgcolor = '#f1e7d2')
fig.show()
?? 分析結論
-
資料科學領域Top3多的職位是資料科學家、資料工程師和資料分析師,
-
資料科學作業越來越受歡迎,員工比例從2020年的11.9%增加到2022年的52.4%,
-
美國是資料科學公司最多的國家,
-
工資分布的IQR在62.7k和150k之間,
-
在資料科學員工中,大多數是高級水平,而專家級則更少,
-
大多數資料科學員工都是全職作業,很少有合同工和自由職業者,
-
首席資料工程師是薪酬最高的資料科學作業,
-
資料科學的最低工資(入門級經驗)為4000美元,具有專家級經驗的資料科學的最高工資為60萬美元,
-
公司構成:53.7%中型公司,32.6%大型公司,13.7%小型資料科學公司,
-
工資也受公司規模影響,規模大的公司支付更高的薪水,
-
62.8%的資料科學是完全遠程作業,20.9%是非遠程作業,16.3%是部分遠程作業,
-
資料科學薪水隨時間和經驗積累而增長,
參考資料
- ?? Glassdoor
- ?? pandasql
- ?? 資料科學作業薪水資料集(Kaggle)
- ?? 圖解資料分析:從入門到精通系列教程:https://www.showmeai.tech/tutorials/33
- ?? 編程語言速查表 | SQL 速查表:https://www.showmeai.tech/article-detail/99
- ?? 資料科學工具庫速查表 | Pandas 速查表:https://www.showmeai.tech/article-detail/101
- ?? 資料科學工具庫速查表 | Matplotlib 速查表:https://www.showmeai.tech/article-detail/103
推薦閱讀
- ?? 資料分析實戰系列 :https://www.showmeai.tech/tutorials/40
- ?? 機器學習資料分析實戰系列:https://www.showmeai.tech/tutorials/41
- ?? 深度學習資料分析實戰系列:https://www.showmeai.tech/tutorials/42
- ?? TensorFlow資料分析實戰系列:https://www.showmeai.tech/tutorials/43
- ?? PyTorch資料分析實戰系列:https://www.showmeai.tech/tutorials/44
- ?? NLP實戰資料分析實戰系列:https://www.showmeai.tech/tutorials/45
- ?? CV實戰資料分析實戰系列:https://www.showmeai.tech/tutorials/46
- ?? AI 面試題庫系列:https://www.showmeai.tech/tutorials/48
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/539600.html
標籤:Python
下一篇:python基礎-常用內置包
