Pandas 11-綜合練習

import pandas as pd
import numpy as np
np.seterr(all = 'ignore')

{'divide': 'ignore', 'over': 'ignore', 'under': 'ignore', 'invalid': 'ignore'}

【任務一】企業收入的多樣性

【題目描述】一個企業的產業收入多樣性可以仿照資訊熵的概念來定義收入熵指標 :
I = ? ∑ i p ( x i ) l o g ( p ( x i ) ) I=-\sum_i{p(x_i)log(p(x_i))} I=?i∑?p(xi?)log(p(xi?))
其中 p ( x i ) p(x_i) p(xi?)是企業該年某產業收入額占該年所有產業總收入的比重，
在company.csv中存有需要計算的企業和年份 , 在company_data.csv中存有企業、各類收入額和收入年份的資訊，現請利用后一張表中的資料 , 在前一張表中增加一串列示該公司該年份的收入熵指標I，
【資料下載】鏈接：https://pan.baidu.com/s/1leZZctxMUSW55kZY5WwgIw 53 密碼：u6fd

My solution :

讀取兩表資料

df1 = pd.read_csv('company.csv')
df2 = pd.read_csv('company_data.csv')

df1.head()

	證券代碼	日期
0	#000007	2014
1	#000403	2015
2	#000408	2016
3	#000408	2017
4	#000426	2015

df2.head()

	證券代碼	日期	收入型別	收入額
0	1	2008/12/31	1	1.084218e+10
1	1	2008/12/31	2	1.259789e+10
2	1	2008/12/31	3	1.451312e+10
3	1	2008/12/31	4	1.063843e+09
4	1	2008/12/31	5	8.513880e+08

經觀察兩表的證券代碼列和日期格式都不一致 , 因當首先變一致
將df1表中證券代碼列里的#去掉轉為int
將df2表日期列取前四位year轉為int

df1_ = df1.copy()
df1_['證券代碼'] = df1_['證券代碼'].str[1:].astype('int64')

df2['日期'] = df2['日期'].str[:4].astype('int64')

定義entropy函式計算資訊熵 , 并跳過NaN值
用df1表左連接df2表 , 連接列為證券代碼和日期 , 再繼續對這兩列分組 , 取出收入額列用apply呼叫資訊熵函式 , 重置索引

def entropy(x):
    if x.any():
        p = x/x.sum()
        return -(p*np.log2(p)).sum()
    return np.nan
res = df1_.merge(df2, on=['證券代碼','日期'], how='left').groupby(['證券代碼','日期'])['收入額'].apply(entropy).reset_index()
res.head()

	證券代碼	日期	收入額
0	7	2014	4.429740
1	403	2015	4.025963
2	408	2016	4.066295
3	408	2017	NaN
4	426	2015	4.449655

將df1表新增一列收入熵指標 , 值為結果表中的收入額

df1['收入熵指標'] = res['收入額']
df1

	證券代碼	日期	收入熵指標
0	#000007	2014	4.429740
1	#000403	2015	4.025963
2	#000408	2016	4.066295
3	#000408	2017	NaN
4	#000426	2015	4.449655
...	...	...	...
1043	#600978	2011	4.788391
1044	#600978	2014	4.022378
1045	#600978	2015	4.346303
1046	#600978	2016	4.358608
1047	#600978	2017	NaN

1048 rows × 3 columns

將上述所有程序封裝為函式 , 并測驗性能

def information_entropy():
    df1 = pd.read_csv('company.csv')
    df2 = pd.read_csv('company_data.csv')
    df1_ = df1.copy()
    df1_['證券代碼'] = df1_['證券代碼'].str[1:].astype('int64')
    df2['日期'] = df2['日期'].str[:4].astype('int64')
    res = df1_.merge(df2, on=['證券代碼','日期'], how='left').groupby(['證券代碼','日期'])['收入額'].apply(entropy).reset_index()
    df1['收入熵指標'] = res['收入額']
    return df1

%timeit -n 5 information_entropy()

1.62 s ± 44.5 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)

【任務二】組隊學習資訊表的變換

【題目描述】請把組隊學習的隊伍資訊表變換為如下形態，其中'是否隊長'一列取1表示隊長，否則為0
在這里插入圖片描述
【資料下載】鏈接：https://pan.baidu.com/s/1ses24cTwUCbMx3rvYXaz-Q 34 密碼：iz57

My solution :

讀取資料

df = pd.read_excel('組隊資訊匯總表_Pandas.xlsx')

所在群列沒有用到 , drop掉

df.drop(columns='所在群', inplace=True)
df.head(2)

	隊伍名稱	隊長編號	隊長_群昵稱	隊員1 編號	隊員_群昵稱	隊員2 編號	隊員_群昵稱.1	隊員3 編號	隊員_群昵稱.2	隊員4 編號	...	隊員6 編號	隊員_群昵稱.5	隊員7 編號	隊員_群昵稱.6	隊員8 編號	隊員_群昵稱.7	隊員9 編號	隊員_群昵稱.8	隊員10編號	隊員_群昵稱.9
0	你說的都對隊	5	山楓葉紛飛	6	蔡	7.0	安慕希	8.0	信仰	20.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	熊貓人	175	魚呲呲	44	Heaven	37.0	呂青	50.0	余柳成蔭	82.0	...	25.0	Never say never	55.0	K	120.0	Y.	28.0	X.Y.Q	151.0	swrong

2 rows × 23 columns

為了使用wide_to_long將寬表轉長表 ,需要先對表columns進行重命名
對照結果表中的名字 , 分別將隊長和隊員用leader和member區分 , 結果表中隊長和隊員分別用1和0分類 , 不妨在重命名時就先分好類 , 在重命名的末尾追加1和0,最后直接取出字串最后一位即可

col_1 = np.array(['隊伍名稱','編號_leader01','昵稱_leader01'])
col_2 = np.array([[f'編號_member{i}0', f'昵稱_member{i}0']for i in range(1,11)]).flatten()
df.columns = np.r_[col_1,col_2]
df.head(2)

	隊伍名稱	編號_leader01	昵稱_leader01	編號_member10	昵稱_member10	編號_member20	昵稱_member20	編號_member30	昵稱_member30	編號_member40	...	編號_member60	昵稱_member60	編號_member70	昵稱_member70	編號_member80	昵稱_member80	編號_member90	昵稱_member90	編號_member100	昵稱_member100
0	你說的都對隊	5	山楓葉紛飛	6	蔡	7.0	安慕希	8.0	信仰	20.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	熊貓人	175	魚呲呲	44	Heaven	37.0	呂青	50.0	余柳成蔭	82.0	...	25.0	Never say never	55.0	K	120.0	Y.	28.0	X.Y.Q	151.0	swrong

2 rows × 23 columns

將重命名好的表用wide_to_long轉換為長表 , 命名對照結果表 , 省的還要再重命名
轉換后dropna洗掉NaN值 , 恢復索引

res = pd.wide_to_long(  df.reset_index(),
                        stubnames = ['昵稱','編號'],
                        i = ['index','隊伍名稱'],
                        j = '是否隊長',
                        sep = '_',
                        suffix = '.+').dropna().reset_index().drop(columns='index')
res

	隊伍名稱	是否隊長	昵稱	編號
0	你說的都對隊	leader01	山楓葉紛飛	5.0
1	你說的都對隊	member10	蔡	6.0
2	你說的都對隊	member20	安慕希	7.0
3	你說的都對隊	member30	信仰	8.0
4	你說的都對隊	member40	biubiu🙈🙈	20.0
...	...	...	...	...
141	七星聯盟	member40	Daisy	63.0
142	七星聯盟	member50	One Better	131.0
143	七星聯盟	member60	rain	112.0
144	應如是	leader01	思無邪	54.0
145	應如是	member10	Justzer0	58.0

146 rows × 4 columns

到這里已經接近結果了 , 把是否隊長一列的值最后一個取出最為該列的分類
編號列的型別為float轉為int
是否隊長和隊伍名稱兩列順序倒了 , 恢復一下即可

res['是否隊長'],res['編號'] = res['是否隊長'].str[-1],res['編號'].astype('int64')

res.reindex(columns=['是否隊長','隊伍名稱','昵稱','編號'])

	是否隊長	隊伍名稱	昵稱	編號
0	1	你說的都對隊	山楓葉紛飛	5
1	0	你說的都對隊	蔡	6
2	0	你說的都對隊	安慕希	7
3	0	你說的都對隊	信仰	8
4	0	你說的都對隊	biubiu🙈🙈	20
...	...	...	...	...
141	0	七星聯盟	Daisy	63
142	0	七星聯盟	One Better	131
143	0	七星聯盟	rain	112
144	1	應如是	思無邪	54
145	0	應如是	Justzer0	58

146 rows × 4 columns

將上述所有程序封裝為函式 , 并測驗性能

def transform_table():
    df = pd.read_excel('組隊資訊匯總表_Pandas.xlsx')
    df.drop(columns='所在群', inplace=True)
    col_1 = np.array(['隊伍名稱','編號_leader01','昵稱_leader01'])
    col_2 = np.array([[f'編號_member{i}0', f'昵稱_member{i}0']for i in range(1,11)]).flatten()
    df.columns = np.r_[col_1,col_2]
    res = pd.wide_to_long(  df.reset_index(),
                            stubnames = ['昵稱','編號'],
                            i = ['index','隊伍名稱'],
                            j = '是否隊長',
                            sep = '_',
                            suffix = '.+').dropna().reset_index().drop(columns='index')
    res['是否隊長'], res['編號'] = res['是否隊長'].str[-1], res['編號'].astype('int64')
    res.reindex(columns=['是否隊長','隊伍名稱','昵稱','編號'])

%timeit -n 50 transform_table()

45.7 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

【任務三】美國大選投票情況

【題目描述】兩張資料表中分別給出了美國各縣（county）的人口數以及大選的投票情況 , 請解決以下問題：

有多少縣滿足總投票數超過縣人口數的一半
把州（state）作為行索引 , 把投票候選人作為列名 , 列名的順序按照候選人在全美的總票數由高到低排序 , 行列對應的元素為該候選人在該州獲得的總票數

此處是一個樣例，實際的州或人名用原表的英語代替
		                拜登   川普
			  威斯康星州   2      1
			  德克薩斯州   3      4

每一個州下設若干縣 , 定義拜登在該縣的得票率減去川普在該縣的得票率為該縣的BT指標 , 若某個州所有縣BT指標的中位數大于0 , 則稱該州為Biden State , 請找出所有的Biden State

【資料下載】鏈接：https://pan.baidu.com/s/182rr3CpstVux2CFdFd_Pcg 32 提取碼：q674

My solution :

讀取兩表資料

df1 = pd.read_csv('president_county_candidate.csv')
df2 = pd.read_csv('county_population.csv')

df1.head()

	state	county	candidate	party	total_votes	won
0	Delaware	Kent County	Joe Biden	DEM	44552	True
1	Delaware	Kent County	Donald Trump	REP	41009	False
2	Delaware	Kent County	Jo Jorgensen	LIB	1044	False
3	Delaware	Kent County	Howie Hawkins	GRN	420	False
4	Delaware	New Castle County	Joe Biden	DEM	195034	True

df2.head()

	US County	Population
0	.Autauga County, Alabama	55869
1	.Baldwin County, Alabama	223234
2	.Barbour County, Alabama	24686
3	.Bibb County, Alabama	22394
4	.Blount County, Alabama	57826

為了后續分組或合并操作 , 先統一state和county列名和值
將df2中US County按,拆分 , 注意逗號后還有個空格 , 否則拆分后值并不相同

df2[['county','state']] = pd.DataFrame([*df2['US County'].str.split(', ')])
df2.county = df2.county.str[1:]
df2.drop(columns='US County', inplace=True)
df2.head()

	Population	county	state
0	55869	Autauga County	Alabama
1	223234	Baldwin County	Alabama
2	24686	Barbour County	Alabama
3	22394	Bibb County	Alabama
4	57826	Blount County	Alabama

1. 有多少縣滿足總投票數超過縣人口數的一半 ?

對df1按state和county分組 , 求和計算每個county總票數
再與df2按state和county兩列merge , 將Population轉移過來

df_merge = df1.groupby(['state','county'])['total_votes'].sum().reset_index().merge(df2, on=['state','county'], how='left')
df_merge.head()

	state	county	total_votes	Population
0	Alabama	Autauga County	27770	55869.0
1	Alabama	Baldwin County	109679	223234.0
2	Alabama	Barbour County	10518	24686.0
3	Alabama	Bibb County	9595	22394.0
4	Alabama	Blount County	27588	57826.0

對上述結果取出total_votes與Population作比較篩選出即可

df_merge[df_merge['total_votes'] > 0.5*df_merge['Population']]

	state	county	total_votes	Population
11	Alabama	Choctaw County	7464	12589.0
12	Alabama	Clarke County	13135	23622.0
13	Alabama	Clay County	6930	13235.0
16	Alabama	Colbert County	27886	55241.0
17	Alabama	Conecuh County	6441	12067.0
...	...	...	...	...
4626	Wyoming	Sheridan County	16428	30485.0
4627	Wyoming	Sublette County	4970	9831.0
4629	Wyoming	Teton County	14677	23464.0
4631	Wyoming	Washakie County	4012	7805.0
4632	Wyoming	Weston County	3542	6927.0

1434 rows × 4 columns

2. 把州（state）作為行索引 , 把投票候選人作為列名 , 列名的順序按照候選人在全美的總票數由高到低排序 , 行列對應的元素為該候選人在該州獲得的總票數

依題意可以用pivot_table透視 , 填入行和列 , 對同一位置用sum聚合 , 打開margins匯總 , 對最后一行All降序排列
可以看到第一列是每行的匯總 , 也就是每個state的匯總 , 第二列是Biden最高票 , Trump緊隨其后

df1.pivot_table(values = ['total_votes'],
                index = ['state'],
                columns = 'candidate',
                aggfunc = 'sum',
                margins = True).sort_values('All', 1, ascending=False).head()

	total_votes
candidate	All	Joe Biden	Donald Trump	Jo Jorgensen	Howie Hawkins	Write-ins	Rocky De La Fuente	Gloria La Riva	Kanye West	Don Blankenship	...	Tom Hoefling	Ricki Sue King	Princess Jacob-Fambro	Blake Huber	Richard Duncan	Joseph Kishore	Jordan Scott	Gary Swing	Keith McCormic	Zachary Scalf
state
Alabama	2323304	849648.0	1441168.0	25176.0	NaN	7312.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Alaska	391346	153405.0	189892.0	8896.0	NaN	34210.0	318.0	NaN	NaN	1127.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Arizona	3387326	1672143.0	1661686.0	51465.0	NaN	2032.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Arkansas	1219069	423932.0	760647.0	13133.0	2980.0	NaN	1321.0	1336.0	4099.0	2108.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
California	17495906	11109764.0	6005961.0	187885.0	81025.0	80.0	60155.0	51036.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 39 columns

3. 每一個州下設若干縣 , 定義拜登在該縣的得票率減去川普在該縣的得票率為該縣的BT指標 , 若某個州所有縣BT指標的中位數大于0 , 則稱該州為Biden State , 請找出所有的Biden State
方法一 :

定義一個計算BT指標的函式 , 分別取出Biden的票數 , Trump的票數 , 計算每個county總票數 , 做差相除得到BT
對state和county分組 , 取出candidate和total_votes兩列呼叫apply計算BT

def BT(x):
    biden = x[x['candidate']=='Joe Biden']['total_votes'].values
    trump = x[x['candidate']=='Donald Trump']['total_votes'].values
    return pd.Series((biden-trump)/x['total_votes'].sum(), index=['BT'])   
bt = df1.groupby(['state','county'])[['candidate','total_votes']].apply(BT)
bt.head()

		BT
state	county
Alabama	Autauga County	-0.444184
	Baldwin County	-0.537623
	Barbour County	-0.076631
	Bibb County	-0.577280
	Blount County	-0.800022

將bt結果恢復索引重新對state分組 , 用filter過濾每個state下county的BT指標中位數是否大于0
對state去重后即滿足條件的所有state , 只有9個

bt.reset_index().groupby('state').filter(lambda x:x.BT.median()>0)[['state']].drop_duplicates()

	state
197	California
319	Connecticut
488	Delaware
491	District of Columbia
725	Hawaii
1878	Massachusetts
2999	New Jersey
3536	Rhode Island
4065	Vermont

方法二 :

分別用bool條件取出biden和trump的所有行 , 再對state和county分組求出每個county的總票數
這三個df巧了都是一樣的大小 , 說明每個county都有biden和trump的票

biden_df = df1[df1['candidate']=='Joe Biden'][['state','county','total_votes']]
trump_df = df1[df1['candidate']=='Donald Trump'][['state','county','total_votes']]
sum_df = df1.groupby(['state','county'])[['total_votes']].sum().reset_index()

將上述三個一樣大的df合并

res = biden_df.merge(trump_df, on=['state','county'] ,suffixes=('_biden','_trump')).merge(sum_df, on=['state','county'])
res.head()

	state	county	total_votes_biden	total_votes_trump	total_votes
0	Delaware	Kent County	44552	41009	87025
1	Delaware	New Castle County	195034	88364	287633
2	Delaware	Sussex County	56682	71230	129352
3	District of Columbia	District of Columbia	39041	1725	41681
4	District of Columbia	Ward 2	29078	2918	32881

分別取出biden列和trump列做差后除以sum列得出BT指標

res['BT'] = (res['total_votes_biden']-res['total_votes_trump'])/res['total_votes']
res.head()

	state	county	total_votes_biden	total_votes_trump	total_votes	BT
0	Delaware	Kent County	44552	41009	87025	0.040712
1	Delaware	New Castle County	195034	88364	287633	0.370855
2	Delaware	Sussex County	56682	71230	129352	-0.112468
3	District of Columbia	District of Columbia	39041	1725	41681	0.895276
4	District of Columbia	Ward 2	29078	2918	32881	0.795596

同樣的 , 按要求過濾后取出所有滿足條件的state , 也是9個

res[['state','BT']].groupby('state').filter(lambda x:x.median()>0)[['state']].drop_duplicates()

	state
0	Delaware
3	District of Columbia
237	Hawaii
1390	Massachusetts
2511	New Jersey
3048	Rhode Island
3577	Vermont
4327	California
4449	Connecticut

下面將上述兩種方法分別封裝成方法 , 并測驗性能

def method1():
    bt = df1.groupby(['state','county'])[['candidate','total_votes']].apply(BT)
    bt.reset_index().groupby('state').filter(lambda x:x.BT.median()>0)[['state']].drop_duplicates()

def method2():
    biden_df = df1[df1['candidate']=='Joe Biden'][['state','county','total_votes']]
    trump_df = df1[df1['candidate']=='Donald Trump'][['state','county','total_votes']]
    sum_df = df1.groupby(['state','county'])[['total_votes']].sum().reset_index()
    res = biden_df.merge(trump_df, on=['state','county'] ,suffixes=('_biden','_trump')).merge(sum_df, on=['state','county'])
    res['BT'] = (res['total_votes_biden']-res['total_votes_trump'])/res['total_votes']
    res[['state','BT']].groupby('state').filter(lambda x:x.median()>0)[['state']].drop_duplicates()

%timeit method1()

6.56 s ± 210 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit method2()

90.9 ms ± 4.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

可以看到方法二雖然拆分好多步驟 , 但是沒有用apply呼叫自定義函式 , 性能強到飛起

【任務四】計算城市間的距離矩陣

【題目描述】資料中給出了若干城市的經緯度，請構造關于城市的距離DataFrame ，其橫縱索引為城市名稱，值為矩陣 M M M , M i j M_ij Mi?j表示城市i與城市j間的球面距離（可以利用geopy包中distance模塊的geodesic函式），并規定與自身的距離為0，

My solution :

讀取表資料 , 以,分割 , 并重命名列名

df = pd.read_table('map.txt', sep=',', names=['city1','longitude','latitude'], skiprows=1)
df.head()

	city1	longitude	latitude
0	沈陽市	123.429092	41.796768
1	長春市	125.324501	43.886841
2	哈爾濱市	126.642464	45.756966
3	北京市	116.405289	39.904987
4	天津市	117.190186	39.125595

將緯度和經度打包成元組供后面計算距離用
洗掉原來的兩列經緯度

df['coord1'] = pd.Series([*zip(df.latitude, df.longitude)])

df.drop(columns=['longitude','latitude'], inplace=True)
df.head(3)

	city1	coord1
0	沈陽市	(41.796768, 123.429092)
1	長春市	(43.886841, 125.324501)
2	哈爾濱市	(45.756966, 126.64246399999999)

復制一份df表 , 并重命名列名加以區分 , 并設定city2為index , 為后續透視表做準備

df2 = df.rename(columns={'city1':'city2','coord1':'coord2'}).set_index('city2')
df2.head(3)

	coord2
city2
沈陽市	(41.796768, 123.429092)
長春市	(43.886841, 125.324501)
哈爾濱市	(45.756966, 126.64246399999999)

將df和df2擴展 , 先借用groupby對df兩列分組 , 看似分了個寂寞 , 實則用apply將df2一個一個拼上去了 , 將原表在索引里的坐標coord1恢復到資料列 , 用stack把列移下來做一個reshape , 再重置索引 , 將空列名起個名字coords , 這一列都是坐標了 , 為后續透視表做完了準備

df_expand = df.groupby(['city1','coord1']).apply(lambda x:df2).reset_index(1).stack().reset_index().rename(columns={0:'coords'})
df_expand.head(3)

	city1	city2	level_2	coords
0	上海市	沈陽市	coord1	(31.231707, 121.472641)
1	上海市	沈陽市	coord2	(41.796768, 123.429092)
2	上海市	長春市	coord1	(31.231707, 121.472641)

匯入計算距離的函式geodesic
將上述準備好的表進行透視 , 并對透視結果坐標列用geodesic計算距離 , 用km做單位 , 再保留兩位小數

from geopy.distance import geodesic
df_expand.pivot_table(values = 'coords',
                      index = 'city1',
                      columns = 'city2',
                      aggfunc = lambda x : geodesic(*x).km
                     ).round(2).head(3)

city2	上海市	烏魯木齊市	蘭州市	北京市	南京市	南寧市	南昌市	臺北市	合肥市	呼和浩特市	...	福州市	西寧市	西安市	貴陽市	鄭州市	重慶市	銀川市	長春市	長沙市	香港
city1
上海市	0.00	3272.69	1718.73	1065.83	271.87	1601.34	608.49	687.22	403.87	1378.10	...	609.42	1912.40	1220.03	1527.44	827.47	1449.73	1606.18	1444.77	887.53	1229.26
烏魯木齊市	3272.69	0.00	1627.85	2416.78	3010.73	3005.60	3023.06	3708.72	2907.56	2009.15	...	3466.84	1443.79	2120.86	2572.82	2448.76	2305.96	1666.74	3004.86	2850.22	3415.02
蘭州市	1718.73	1627.85	0.00	1182.61	1447.27	1529.94	1397.55	2085.95	1326.02	870.72	...	1841.48	194.64	506.74	1086.41	904.20	765.89	343.02	2023.00	1226.07	1826.19

3 rows × 34 columns

將上述所有程序封裝為函式 , 并測驗性能

def calculate_M():
    df = pd.read_table('map.txt', sep=',', names=['city1','longitude','latitude'], skiprows=1)
    df['coord1'] = pd.Series([*zip(df.latitude, df.longitude)])
    df.drop(columns=['longitude','latitude'], inplace=True)
    df2 = df.rename(columns={'city1':'city2', 'coord1':'coord2'}).set_index('city2')
    df_expand = df.groupby(['city1','coord1']).apply(lambda x:df2).reset_index(1).stack().reset_index().rename(columns={0:'coords'})
    return df_expand.pivot_table(values = 'coords',
                                  index = 'city1',
                                  columns = 'city2',
                                  aggfunc = lambda x : geodesic(*x).km
                                 ).round(2)

%timeit -n 10 calculate_M()

395 ms ± 23.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

附錄 :

`geopy`包

根據城市名查城市位置

創建定位器 :

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36")

根據城市名稱查詢位置 :

location = geolocator.geocode("南京市雨花臺區")
location.address

'雨花臺區, 南京市, 江蘇省, 中國'

經度 :

location.longitude

118.7724224

緯度 :

location.latitude

31.9932018

根據經緯度查詢位置 :

location = geolocator.reverse("31.997858805465647, 118.78544536405718")

location.address

'雨花東路, 雨花臺區, 建鄴區, 南京市, 江蘇省, 21006, 中國'

location.raw

{'place_id': 134810031,
 'licence': 'Data ? OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'osm_type': 'way',
 'osm_id': 189414212,
 'lat': '31.99705152324867',
 'lon': '118.78513775762214',
 'display_name': '雨花東路, 雨花臺區, 建鄴區, 南京市, 江蘇省, 21006, 中國',
 'address': {'road': '雨花東路',
  'suburb': '雨花臺區',
  'city': '建鄴區',
  'state': '江蘇省',
  'postcode': '21006',
  'country': '中國',
  'country_code': 'cn'},
 'boundingbox': ['31.9964788', '31.9989487', '118.7819222', '118.7866616']}

計算距離 :

from geopy.distance import distance, geodesic

wellington, salamanca = (-41.32, 174.81), (40.96, -5.50)
distance(wellington, salamanca, ellipsoid='GRS-80').miles

12402.369702934551

shanghai, beijing = (31.235929042252014,121.48053886017651), (39.910924547299565,116.4133836971231)
distance(shanghai, beijing).km

1065.985103985533

geodesic(shanghai, beijing).km

1065.985103985533

因此 , 任務四的資料集就可以自己造了 :

cities = df['city1']
cities.head()

0     沈陽市
1     長春市
2    哈爾濱市
3     北京市
4     天津市
Name: city1, dtype: object

def get_lon_lat(city):
    location = geolocator.geocode(city)
    return location.longitude,location.latitude
longitude, latitude = [*zip(*[get_lon_lat(city) for city in cities])]
data = pd.DataFrame({'city':cities, 'longitude':longitude, 'latitude':latitude})
data.head()

	city	longitude	latitude
0	沈陽市	123.458674	41.674989
1	長春市	125.317122	43.813074
2	哈爾濱市	126.530400	45.798827
3	北京市	116.718521	39.902080
4	天津市	117.195107	39.085673

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/243848.html

標籤：python

上一篇：爬取冰冰B站千條評論，看看大家說了什么

下一篇：python80行代碼寫一個檔案整理軟體

Pandas 11-綜合練習

Pandas 11-綜合練習

【任務一】企業收入的多樣性

My solution :

【任務二】組隊學習資訊表的變換

My solution :

【任務三】美國大選投票情況

My solution :

【任務四】計算城市間的距離矩陣

My solution :

附錄 :

geopy包

`geopy`包