一、聚合函式
(1)通過numpy或者pandas中統計分析方法;
(2)agg([np.sum,np.mean])
或agg({‘columns’:np.sum,np.mean],‘columns2’:np.sum,np.mean]})
(3)transform:轉換只有一個引數func
transform聚合方法:
在pandas或者numpy中沒有現成的函式可以使用,可以通過transform使用自定義的函式
data=pd.read_excel(r'meal_order_detail.xlsx')
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2779 entries, 0 to 2778
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 detail_id 2779 non-null int64
1 order_id 2779 non-null int64
2 dishes_id 2779 non-null int64
3 logicprn_name 0 non-null float64
4 parent_class_name 0 non-null float64
5 dishes_name 2779 non-null object
6 itemis_add 2779 non-null int64
7 counts 2779 non-null int64
8 amounts 2779 non-null int64
9 cost 0 non-null float64
10 place_order_time 2779 non-null datetime64[ns]
11 discount_amt 0 non-null float64
12 discount_reason 0 non-null float64
13 kick_back 0 non-null float64
14 add_inprice 2779 non-null int64
15 add_info 0 non-null float64
16 bar_code 0 non-null float64
17 picture_file 2779 non-null object
18 emp_id 2779 non-null int64
dtypes: datetime64[ns](1), float64(8), int64(8), object(2)
memory usage: 412.6+ KB
None
# 寫法一:
data['counts'].transform(lambda x:x*2)
0 2
1 2
2 2
3 2
4 2
..
2774 2
2775 2
2776 2
2777 2
2778 2
Name: counts, Length: 2779, dtype: int64
# 寫法二:
def transform1(value):
values=value*2
return values
data['counts'].transform(transform1)
0 2
1 2
2 2
3 2
4 2
..
2774 2
2775 2
2776 2
2777 2
2778 2
Name: counts, Length: 2779, dtype: int64
二、分組:
groupby(by=‘columns1’)
三、透視表
index:行分組鍵,分完組以后,分組鍵的取值在行索引的位置;
aggfunc:聚合函式==>和agg方法一致;
values:指定想要進行聚合的列;
columns:列分組鍵,分完組之后,分組鍵的取值,在列索引的位置;
fill_value:將資料為np.nan的值填充為對應的值;
margins:表示匯總開關,默認是False;
margins_name:‘ALL’,匯總的列或者行的columns,index的索引;
pd.pivot_table(data,
index=['order_id','amounts'],
aggfunc=[np.mean,np.sum],
values=['counts','dishes_name']).head()
| mean | sum | ||
|---|---|---|---|
| counts | counts | ||
| order_id | amounts | ||
| 137 | 1 | 4.0 | 4 |
| 6 | 1.0 | 1 | |
| 26 | 1.0 | 1 | |
| 27 | 1.0 | 1 | |
| 35 | 1.0 | 1 |
pd.pivot_table(data,
columns=['order_id'],
aggfunc=[np.mean,np.sum],
values=['counts','add_inprice']).head()
| mean | ... | sum | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| order_id | 137 | 165 | 166 | 171 | 177 | 193 | 201 | 203 | 239 | 242 | ... | 1290 | 1293 | 1298 | 1302 | 1303 | 1309 | 1314 | 1317 | 1319 | 1323 |
| add_inprice | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.00 | 0.0 | 0.0 | 0.000000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| counts | 1.5 | 1.166667 | 1.4 | 1.428571 | 1.0 | 1.0 | 1.25 | 1.0 | 1.0 | 1.833333 | ... | 15 | 5 | 9 | 22 | 9 | 15 | 12 | 18 | 9 | 15 |
2 rows × 556 columns
pd.pivot_table(data,
columns=['order_id'],
index='dishes_name',
values='counts',
aggfunc=np.sum,
fill_value=0,
margins=True
).head()
| order_id | 137 | 165 | 166 | 171 | 177 | 193 | 201 | 203 | 239 | 242 | ... | 1293 | 1298 | 1302 | 1303 | 1309 | 1314 | 1317 | 1319 | 1323 | All |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dishes_name | |||||||||||||||||||||
| 42度海之藍 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
| 北冰洋汽水 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 45 |
| 38度劍南春 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 6 |
| 50度古井貢酒 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
| 52度瀘州老窖 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 |
5 rows × 279 columns
四、交叉表
這里面不傳表,因此行索引,列索引,值都得從表開始
這個方法必須行列都傳
pd.crosstab(index=data['order_id'],
columns=data['dishes_name'],
values=data['counts'],
aggfunc=np.sum).head()
| dishes_name | 42度海之藍 | 北冰洋汽水 | 38度劍南春 | 50度古井貢酒 | 52度瀘州老窖 | 53度茅臺 | 一品香酥藕 | 三絲鱔魚 | 三色涼拌手撕兔 | 不加一滴油的酸奶蛋糕 | ... | 香辣腐乳炒蝦 | 香酥兩吃大蝦 | 魚香肉絲拌面 | 鮮美鱔魚 | 雞蛋、肉末腸粉 | 麻辣小龍蝦 | 黃尾袋鼠西拉子紅葡萄酒 | 黃油曲奇餅干 | 黃花菜炒木耳 | 黑米戀上葡萄 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| order_id | |||||||||||||||||||||
| 137 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN |
| 165 | NaN | NaN | 1.0 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN |
| 166 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 171 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 177 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 156 columns
五、表格合并方法
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
print(left)
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
print(right)
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
concat引數:
join='outer’外連接,求并集,默認
join='inner’內連接,求交集
# 表格縱向拼接
pd.concat((left,right),axis=0,join='outer')
| key1 | key2 | A | B | C | D | |
|---|---|---|---|---|---|---|
| 0 | K0 | K0 | A0 | B0 | NaN | NaN |
| 1 | K0 | K1 | A1 | B1 | NaN | NaN |
| 2 | K1 | K0 | A2 | B2 | NaN | NaN |
| 3 | K2 | K1 | A3 | B3 | NaN | NaN |
| 0 | K0 | K0 | NaN | NaN | C0 | D0 |
| 1 | K1 | K0 | NaN | NaN | C1 | D1 |
| 2 | K1 | K0 | NaN | NaN | C2 | D2 |
| 3 | K2 | K0 | NaN | NaN | C3 | D3 |
# 表格橫向拼接,一般不用,因為是找索引,索引亂的話合并是亂的
pd.concat((left,right),axis=1)
| key1 | key2 | A | B | key1 | key2 | C | D | |
|---|---|---|---|---|---|---|---|---|
| 0 | K0 | K0 | A0 | B0 | K0 | K0 | C0 | D0 |
| 1 | K0 | K1 | A1 | B1 | K1 | K0 | C1 | D1 |
| 2 | K1 | K0 | A2 | B2 | K1 | K0 | C2 | D2 |
| 3 | K2 | K1 | A3 | B3 | K2 | K0 | C3 | D3 |
merge:主鍵合并方法 橫向拼接
解決行索引沒有意義情況下,資料行不匹配問題(解決concat中橫向拼接問題)
on:必須是兩張表公共的欄位,才能作為主鍵
how:連接方式,默認是inner 內連接,取交集
how=‘outer’,外連接,取并集
how=‘left’,左連接,左表為主,只關心左表的主鍵,不管右表
how=‘right’,右連接,右表為主,只關心右表的主鍵,不管左表
pd.merge(left,right,on='key1')
| key1 | key2_x | A | B | key2_y | C | D | |
|---|---|---|---|---|---|---|---|
| 0 | K0 | K0 | A0 | B0 | K0 | C0 | D0 |
| 1 | K0 | K1 | A1 | B1 | K0 | C0 | D0 |
| 2 | K1 | K0 | A2 | B2 | K0 | C1 | D1 |
| 3 | K1 | K0 | A2 | B2 | K0 | C2 | D2 |
| 4 | K2 | K1 | A3 | B3 | K0 | C3 | D3 |
多個主鍵on=[ , ]
pd.merge(left,right,on=['key1','key2'],how='outer')
| key1 | key2 | A | B | C | D | |
|---|---|---|---|---|---|---|
| 0 | K0 | K0 | A0 | B0 | C0 | D0 |
| 1 | K0 | K1 | A1 | B1 | NaN | NaN |
| 2 | K1 | K0 | A2 | B2 | C1 | D1 |
| 3 | K1 | K0 | A2 | B2 | C2 | D2 |
| 4 | K2 | K1 | A3 | B3 | NaN | NaN |
| 5 | K2 | K0 | NaN | NaN | C3 | D3 |
當兩個表中主鍵名字不一樣的時候,不能使用on值
left_on:指定左表中的主鍵;
right_on:指定右表中的主鍵;
左表key1,右表中key2,作為主鍵
也就是左表的key1 找右表的key2
pd.merge(left,right,left_on='key1',right_on='key2',how='right')
| key1_x | key2_x | A | B | key1_y | key2_y | C | D | |
|---|---|---|---|---|---|---|---|---|
| 0 | K0 | K0 | A0 | B0 | K0 | K0 | C0 | D0 |
| 1 | K0 | K1 | A1 | B1 | K0 | K0 | C0 | D0 |
| 2 | K0 | K0 | A0 | B0 | K1 | K0 | C1 | D1 |
| 3 | K0 | K1 | A1 | B1 | K1 | K0 | C1 | D1 |
| 4 | K0 | K0 | A0 | B0 | K1 | K0 | C2 | D2 |
| 5 | K0 | K1 | A1 | B1 | K1 | K0 | C2 | D2 |
| 6 | K0 | K0 | A0 | B0 | K2 | K0 | C3 | D3 |
| 7 | K0 | K1 | A1 | B1 | K2 | K0 | C3 | D3 |
更改表格名稱的方法
inplace=True對原表進行修改,默認為False,不修改原表
df.rename(columns={‘old_column’:‘new_column’})
left.rename(columns={'key1':'鍵1'},inplace=False)
| 鍵1 | key2 | A | B | |
|---|---|---|---|---|
| 0 | K0 | K0 | A0 | B0 |
| 1 | K0 | K1 | A1 | B1 |
| 2 | K1 | K0 | A2 | B2 |
| 3 | K2 | K1 | A3 | B3 |
六、重疊合并,將殘缺的表合并成完整的
dict1 = {'ID':[1,2,3,4,5,6,7,8,9],
'System':['W10','w10',np.nan,'w10',np.nan,np.nan,'w7','w7','w8']}
dict2 = {'ID':[1,2,3,4,5,6,7,8,9],
'System':[np.nan,np.nan,'w7','w7','w7','w7','w8',np.nan,np.nan]}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
print(df1)
ID System
0 1 W10
1 2 w10
2 3 NaN
3 4 w10
4 5 NaN
5 6 NaN
6 7 w7
7 8 w7
8 9 w8
print(df2)
ID System
0 1 NaN
1 2 NaN
2 3 w7
3 4 w7
4 5 w7
5 6 w7
6 7 w8
7 8 NaN
8 9 NaN
# 用df2表補全 df1,df1缺少4、5行,就將df2的4、5行給df1
# 誰寫到前面,誰為主,只補主表空缺,當df2資料與df1不一致時,以df1為主
df1.combine_first(df2)
| ID | System | |
|---|---|---|
| 0 | 1 | W10 |
| 1 | 2 | w10 |
| 2 | 3 | w7 |
| 3 | 4 | w10 |
| 4 | 5 | w7 |
| 5 | 6 | w7 |
| 6 | 7 | w7 |
| 7 | 8 | w7 |
| 8 | 9 | w8 |
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/245225.html
標籤:python
上一篇:pandas的檔案讀取和保存
