本篇博客將會給出大家平時使用pandas的時候經常需要用到的功能代碼,同時也會給出運行結果,以幫助大家更進一步的理解,
另外,我也以注釋的形式更進一步的補充說明代碼的功能及其作用,需要本篇博文中用到的檔案檔案以及代碼的朋友,也可以三連支持一下,并評論留下你的郵箱,我會在看到后的第一時間發送給你,
當然啦,你也可以把本篇博文當作一本小小的pandas書籍,當需要用到pandas哪些知識的時候,Ctrl+F就可以搜索到啦,現在不看的話就先收藏著,
目錄
第一部分:pandas資料結構
第二部分:資料查看 第三部分:資料輸入與輸出 3.1 csv 3.2 Excel 3.3 HDF5 3.4 SQL
第四部分:資料選取 4.1 獲取資料 4.2 標簽選擇 4.3 位置選擇 4.4 boolean索引 4.5 賦值操作
第五部分:資料集成 5.1 concat資料串聯 5.2 資料插入 5.3 Join SQL風格合并
第六部分:資料清洗 第七部分:資料轉換 7.1 軸和元素轉換 7.2 map映射元素轉變 7.3 apply映射元素轉變 7.4 transform元素轉變 7.5 重排隨機抽樣啞變數
第八部分:資料重塑 第九部分:數學和統計方法 9.1 簡單統計指標 9.2 索引標簽、位置獲取 9.3 更多統計指標 9.4 高級統計指標
第十部分:排序 結束語
第一部分:pandas資料結構
import numpy as np
import pandas as pd # pandas基于NumPy,升級
pandas的主要資料結構是 Series(?維資料)與 DataFrame(二維資料),
1.1 Series
# Series
l = np. array( [ 1 , 2 , 3 , 6 , 9 ] ) # NumPy陣列
s1 = pd. Series( data = l)
display( l, s1) # Series是一維的陣列,和NumPy陣列不一樣:Series多了索引
array([1, 2, 3, 6, 9])
0 1
1 2
2 3
3 6
4 9
dtype: int64
s2 = pd. Series( data = l, index = list ( 'ABCDE' ) )
s2
A 1
B 2
C 3
D 6
E 9
dtype: int64
s3 = pd. Series( data = { 'A' : 149 , 'B' : 130 , 'C' : 118 , 'D' : 99 , 'E' : 66 } )
s3
A 149
B 130
C 118
D 99
E 66
dtype: int64
1.2 DataFrame
# Series是一維的,功能比較少
# DataFrame是二維的,多個Series公用索引,組成了DataFrame
# 像不像 Excel,所有資料,結構化
df1 = pd. DataFrame( data = np. random. randint( 0 , 151 , size = ( 10 , 3 ) ) ,
index = list ( 'ABCDEFHIJK' ) , # 行索引
columns= [ 'Python' , 'Math' , 'En' ] , dtype= np. float16) # 列索引
df1
Python Math En A 113.0 37.0 70.0 B 92.0 22.0 11.0 C 0.0 9.0 66.0 D 40.0 145.0 23.0 E 25.0 133.0 108.0 F 124.0 16.0 130.0 H 121.0 85.0 133.0 I 84.0 125.0 39.0 J 111.0 36.0 137.0 K 55.0 26.0 85.0
df2 = pd. DataFrame( data = { 'Python' : [ 66 , 99 , 128 ] , 'Math' : [ 88 , 65 , 137 ] , 'En' : [ 100 , 121 , 45 ] } )
df2 # 字典,key作為列索引,不指定index默認從0開始索引,自動索引一樣
Python Math En 0 66 88 100 1 99 65 121 2 128 137 45
第二部分:資料查看
df = pd. DataFrame( data = np. random. randint( 0 , 151 , size = ( 100 , 3 ) ) ,
columns= [ 'Python' , 'Math' , 'En' ] )
df
Python Math En 0 133 139 141 1 82 17 130 2 51 51 145 3 127 70 11 4 93 60 91 ... ... ... ... 95 57 133 96 96 91 21 134 97 76 109 113 98 99 82 29 99 28 54 88
100 rows × 3 columns
df. shape # 查看DataFrame形狀
(100, 3)
df. head( n = 3 ) # 顯示前N個,默認N = 5
Python Math En 0 133 139 141 1 82 17 130 2 51 51 145
df. tail( ) # 顯示后n個
Python Math En 95 57 133 96 96 91 21 134 97 76 109 113 98 99 82 29 99 28 54 88
df. dtypes # 資料型別
Python int64
Math int64
En int64
dtype: object
df. info( ) # 比較詳細資訊
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Python 100 non-null int64
1 Math 100 non-null int64
2 En 100 non-null int64
dtypes: int64(3)
memory usage: 2.5 KB
df. describe( ) # 描述:平均值、標準差、中位數、四等分、最大值,最小值
Python Math En count 100.000000 100.000000 100.000000 mean 85.790000 77.410000 67.630000 std 41.375173 44.905309 43.883835 min 3.000000 0.000000 3.000000 25% 54.500000 40.250000 31.250000 50% 84.500000 81.000000 58.500000 75% 123.000000 113.250000 103.000000 max 149.000000 149.000000 147.000000
df. values # 值,回傳的是NumPy陣列
array([[133, 139, 141],
[ 82, 17, 130],
[ 51, 51, 145],
[127, 70, 11],
[ 93, 60, 91],
[103, 110, 103],
[ 27, 133, 32],
[148, 99, 128],
[139, 97, 44],
[ 64, 85, 71],
[147, 94, 37],
[114, 12, 16],
[ 16, 54, 44],
[123, 3, 76],
[137, 97, 123],
[149, 113, 74],
[ 69, 38, 7],
[ 68, 122, 4],
[ 53, 13, 47],
[113, 127, 124],
[ 55, 139, 47],
[140, 114, 14],
[ 84, 111, 115],
[ 65, 5, 136],
[ 96, 50, 89],
[145, 130, 15],
[111, 30, 66],
[132, 122, 144],
[ 79, 5, 45],
[115, 29, 49],
[ 27, 55, 83],
[ 29, 74, 38],
[ 87, 100, 45],
[132, 147, 119],
[ 66, 90, 40],
[ 67, 108, 48],
[ 78, 28, 46],
[105, 137, 110],
[132, 119, 55],
[117, 23, 79],
[ 12, 29, 12],
[114, 58, 119],
[139, 0, 42],
[ 61, 69, 142],
[141, 73, 107],
[ 49, 12, 19],
[ 8, 1, 75],
[134, 60, 25],
[138, 80, 79],
[112, 115, 26],
[ 77, 4, 120],
[140, 100, 35],
[ 82, 129, 4],
[100, 8, 25],
[ 77, 97, 78],
[ 55, 113, 53],
[ 45, 73, 37],
[ 44, 0, 80],
[ 26, 74, 52],
[ 99, 75, 147],
[111, 8, 144],
[ 55, 146, 15],
[140, 106, 74],
[ 91, 78, 92],
[130, 108, 41],
[ 34, 41, 136],
[ 3, 139, 4],
[123, 93, 4],
[ 24, 103, 3],
[ 44, 122, 92],
[ 83, 45, 50],
[ 46, 149, 103],
[ 48, 127, 92],
[ 3, 51, 57],
[136, 136, 82],
[ 65, 102, 16],
[ 23, 61, 118],
[138, 15, 6],
[ 83, 91, 4],
[109, 24, 54],
[ 40, 43, 125],
[103, 123, 141],
[116, 113, 38],
[137, 71, 126],
[ 69, 143, 83],
[ 8, 60, 60],
[ 40, 22, 95],
[ 73, 19, 17],
[137, 129, 103],
[109, 142, 94],
[ 85, 105, 10],
[ 97, 107, 19],
[ 79, 12, 27],
[143, 74, 18],
[ 32, 114, 52],
[ 57, 133, 96],
[ 91, 21, 134],
[ 76, 109, 113],
[ 99, 82, 29],
[ 28, 54, 88]])
df. columns # 列索引
Index(['Python', 'Math', 'En'], dtype='object')
df. index # 行索引 0 ~ 99
RangeIndex(start=0, stop=100, step=1)
第三部分:資料輸入與輸出
3.1 csv
df = pd. DataFrame( data = np. random. randint( 0 , 151 , size = ( 100 , 3 ) ) ,
columns= [ 'Python' , 'Math' , 'En' ] )
df # 行索引,列索引
Python Math En 0 10 128 54 1 45 47 74 2 103 33 133 3 70 24 81 4 90 143 121 ... ... ... ... 95 145 25 139 96 53 51 109 97 35 7 130 98 86 51 20 99 149 66 75
100 rows × 3 columns
df. to_csv( './data.csv' , sep = ',' ,
index = True , # 保存行索引
header= True ) # 保存列索引
df. to_csv( './data2.csv' , sep = ',' ,
index = False , # 不保存行索引
header= False ) # 不保存列索引
pd. read_csv( './data.csv' ,
index_col= 0 ) # 第一列作為行索引
Python Math En 0 10 128 54 1 45 47 74 2 103 33 133 3 70 24 81 4 90 143 121 ... ... ... ... 95 145 25 139 96 53 51 109 97 35 7 130 98 86 51 20 99 149 66 75
100 rows × 3 columns
pd. read_csv( './data2.csv' , header = None )
0 1 2 0 10 128 54 1 45 47 74 2 103 33 133 3 70 24 81 4 90 143 121 ... ... ... ... 95 145 25 139 96 53 51 109 97 35 7 130 98 86 51 20 99 149 66 75
100 rows × 3 columns
3.2 Excel
df
Python Math En 0 10 128 54 1 45 47 74 2 103 33 133 3 70 24 81 4 90 143 121 ... ... ... ... 95 145 25 139 96 53 51 109 97 35 7 130 98 86 51 20 99 149 66 75
100 rows × 3 columns
df. to_excel( './data.xls' )
pd. read_excel( './data.xls' ,
index_col= 0 ) # 第一列作為行索引
Python Math En 0 10 128 54 1 45 47 74 2 103 33 133 3 70 24 81 4 90 143 121 ... ... ... ... 95 145 25 139 96 53 51 109 97 35 7 130 98 86 51 20 99 149 66 75
100 rows × 3 columns
3.3 HDF5
df. to_hdf( './data.h5' , key = 'score' )
df2 = pd. DataFrame( data = np. random. randint( 6 , 100 , size = ( 1000 , 5 ) ) ,
columns= [ '計算機' , '化工' , '生物' , '工程' , '教師' ] )
df2
計算機 化工 生物 工程 教師 0 64 22 16 68 60 1 95 47 72 76 37 2 88 48 92 50 37 3 75 38 8 63 83 4 62 14 20 21 45 ... ... ... ... ... ... 995 89 94 88 97 27 996 52 68 21 8 50 997 76 99 10 92 56 998 66 31 55 65 94 999 8 21 38 89 14
1000 rows × 5 columns
df2. to_hdf( './data.h5' , key = 'salary' )
pd. read_hdf( './data.h5' , key = 'salary' )
計算機 化工 生物 工程 教師 0 64 22 16 68 60 1 95 47 72 76 37 2 88 48 92 50 37 3 75 38 8 63 83 4 62 14 20 21 45 ... ... ... ... ... ... 995 89 94 88 97 27 996 52 68 21 8 50 997 76 99 10 92 56 998 66 31 55 65 94 999 8 21 38 89 14
1000 rows × 5 columns
3.4 SQL
from sqlalchemy import create_engine # 資料庫引擎,構建和資料庫的連接
# PyMySQL
# 類似網頁地址
engine = create_engine( 'mysql+pymysql://root:12345678@localhost/pandas?charset=utf8' )
df2. to_sql( 'salary' , engine, index= False ) # 將Python中資料DataFrame保存到Mysql
df3 = pd. read_sql( 'select * from salary limit 50' , con = engine)
df3
計算機 化工 生物 工程 教師 0 64 22 16 68 60 1 95 47 72 76 37 2 88 48 92 50 37 3 75 38 8 63 83 4 62 14 20 21 45 5 95 41 84 37 16 6 34 45 11 93 94 7 99 10 57 30 63 8 59 60 12 37 93 9 41 58 15 67 70 10 19 8 63 96 64 11 75 61 78 49 89 12 86 84 31 68 27 13 42 98 24 20 85 14 95 15 97 80 87 15 52 22 44 35 74 16 67 20 65 24 10 17 9 46 41 62 66 18 86 76 80 72 19 19 61 81 64 26 6 20 77 92 84 18 7 21 87 16 75 14 34 22 23 82 92 42 32 23 61 89 28 21 40 24 22 12 38 89 14 25 77 12 46 89 12 26 60 45 52 71 67 27 29 76 94 26 91 28 14 60 82 88 60 29 56 36 44 60 37 30 63 77 43 42 82 31 25 71 36 51 21 32 76 86 87 83 8 33 93 59 86 25 78 34 73 40 12 86 66 35 10 30 54 13 71 36 9 48 58 75 85 37 81 41 61 12 55 38 80 68 66 92 84 39 53 36 84 26 66 40 19 62 63 47 45 41 89 39 91 31 86 42 57 43 53 48 19 43 66 16 23 19 10 44 46 28 78 81 21 45 38 53 76 49 8 46 55 94 70 6 44 47 56 33 92 17 84 48 69 68 23 87 90 49 12 47 32 80 15
第四部分:資料選取
4.1 獲取資料
df = pd. DataFrame( np. random. randint( 0 , 151 , size = ( 10 , 3 ) ) ,
index= list ( 'ABCDEFHIJK' ) , columns= [ 'Python' , 'Math' , 'En' ] )
df
Python Math En A 88 52 48 B 78 62 94 C 9 14 71 D 86 15 21 E 1 71 71 F 123 138 55 H 59 17 140 I 68 8 58 J 100 70 63 K 79 37 72
df[ 'Python' ] # 獲取資料Series
A 88
B 78
C 9
D 86
E 1
F 123
H 59
I 68
J 100
K 79
Name: Python, dtype: int64
df. Python # 屬性,DataFrame中列索引,表示屬性
A 88
B 78
C 9
D 86
E 1
F 123
H 59
I 68
J 100
K 79
Name: Python, dtype: int64
df[ [ 'Python' , 'En' ] ] # 獲取多列資料
Python En A 88 48 B 78 94 C 9 71 D 86 21 E 1 71 F 123 55 H 59 140 I 68 58 J 100 63 K 79 72
4.2 標簽選擇
# 標簽,就是行索引 location = loc 位置
df. loc[ 'A' ]
Python 88
Math 52
En 48
Name: A, dtype: int64
df. loc[ [ 'A' , 'F' , 'K' ] ]
Python Math En A 88 52 48 F 123 138 55 K 79 37 72
df. loc[ 'A' , 'Python' ]
88
df. loc[ [ 'A' , 'C' , 'F' ] , 'Python' ]
A 88
C 9
F 123
Name: Python, dtype: int64
df. loc[ 'A' : : 2 , [ 'Math' , 'En' ] ]
Math En A 52 48 C 14 71 E 71 71 H 17 140 J 70 63
df. loc[ 'A' : 'D' , : ]
Python Math En A 88 52 48 B 78 62 94 C 9 14 71 D 86 15 21
4.3 位置選擇
df. iloc[ 0 ]
Python 88
Math 52
En 48
Name: A, dtype: int64
df. iloc[ [ 0 , 2 , 4 ] ]
Python Math En A 88 52 48 C 9 14 71 E 1 71 71
df. iloc[ 0 : 4 , [ 0 , 2 ] ]
Python En A 88 48 B 78 94 C 9 71 D 86 21
df. iloc[ 3 : 8 : 2 ]
Python Math En D 86 15 21 F 123 138 55 I 68 8 58
4.4 boolean索引
cond = df. Python > 80 # 將Python大于80分的成績獲取
df[ cond]
Python Math En A 88 52 48 D 86 15 21 F 123 138 55 J 100 70 63
cond = df. mean( axis = 1 ) > 75 # 平均分大于75,優秀,篩選出來
df[ cond]
Python Math En B 78 62 94 F 123 138 55 J 100 70 63
cond = ( df. Python > 70 ) & ( df. Math > 70 )
df[ cond]
cond = df. index. isin( [ 'C' , 'E' , 'H' , 'K' ] ) # 判斷資料是否在陣列中
df[ cond] # 刪選出來了符合條件的資料
Python Math En C 9 14 71 E 1 71 71 H 59 17 140 K 79 37 72
4.5 賦值操作
df[ 'Python' ] [ 'A' ] = 150 # 修改某個位置的值
df
Python Math En A 150 52 48 B 78 62 94 C 9 14 71 D 86 15 21 E 1 71 71 F 123 138 55 H 59 17 140 I 68 8 58 J 100 70 63 K 79 37 72
df[ 'Java' ] = np. random. randint( 0 , 151 , size = 10 ) # 新增加一列
df
Python Math En Java A 150 52 48 65 B 78 62 94 25 C 9 14 71 82 D 86 15 21 139 E 1 71 71 67 F 123 138 55 145 H 59 17 140 53 I 68 8 58 141 J 100 70 63 11 K 79 37 72 127
df. loc[ [ 'C' , 'D' , 'E' ] , 'Math' ] = 147 # 修改多個人的成績
df
Python Math En Java A 150 52 48 65 B 78 62 94 25 C 9 147 71 82 D 86 147 21 139 E 1 147 71 67 F 123 138 55 145 H 59 17 140 53 I 68 8 58 141 J 100 70 63 11 K 79 37 72 127
cond = df < 60
df[ cond] = 60 # where 條件操作,符合這條件值,修改,不符合,不改變
df
Python Math En Java A 150 60 60 65 B 78 62 94 60 C 60 147 71 82 D 86 147 60 139 E 60 147 71 67 F 123 138 60 145 H 60 60 140 60 I 68 60 60 141 J 100 70 63 60 K 79 60 72 127
df. iloc[ 3 : : 3 , [ 0 , 2 ] ] += 100
df
Python Math En Java A 150 60 60 65 B 78 62 94 60 C 60 147 71 82 D 186 147 160 139 E 60 147 71 67 F 123 138 60 145 H 160 60 240 60 I 68 60 60 141 J 100 70 63 60 K 179 60 172 127
第五部分:資料集成
5.1 concat資料串聯
# np.concatenate NumPy資料集成
df1 = pd. DataFrame( np. random. randint( 0 , 151 , size = ( 10 , 3 ) ) ,
columns= [ 'Python' , 'Math' , 'En' ] ,
index = list ( 'ABCDEFHIJK' ) )
df2 = pd. DataFrame( np. random. randint( 0 , 151 , size = ( 10 , 3 ) ) ,
columns = [ 'Python' , 'Math' , 'En' ] ,
index = list ( 'QWRTUYOPLM' ) )
df3 = pd. DataFrame( np. random. randint( 0 , 151 , size = ( 10 , 2 ) ) ,
columns= [ 'Java' , 'Chinese' ] , index = list ( 'ABCDEFHIJK' ) )
pd. concat( [ df1, df2] , axis = 0 ) # axis = 0變是行合并,行增加
Python Math En A 108 74 53 B 98 16 47 C 71 77 128 D 9 123 131 E 25 90 132 F 105 106 86 H 146 42 81 I 83 4 36 J 102 79 8 K 92 11 47 Q 119 59 43 W 20 62 106 R 77 82 128 T 44 119 15 U 49 149 62 Y 94 90 88 O 105 72 133 P 87 109 123 L 125 140 149 M 148 22 102
pd. concat( [ df1, df3] , axis = 1 ) # axis = 1表示列增加
Python Math En Java Chinese A 108 74 53 61 81 B 98 16 47 117 117 C 71 77 128 48 4 D 9 123 131 149 115 E 25 90 132 113 73 F 105 106 86 140 26 H 146 42 81 117 118 I 83 4 36 103 91 J 102 79 8 43 20 K 92 11 47 93 72
df1. append( df2) # append追加,在行后面直接進行追加
Python C++ Math En A 108 59.0 74 53 B 98 4.0 16 47 C 71 27.0 77 128 D 9 17.0 123 131 E 25 60.0 90 132 F 105 136.0 106 86 H 146 112.0 42 81 I 83 120.0 4 36 J 102 28.0 79 8 K 92 53.0 11 47 Q 119 NaN 59 43 W 20 NaN 62 106 R 77 NaN 82 128 T 44 NaN 119 15 U 49 NaN 149 62 Y 94 NaN 90 88 O 105 NaN 72 133 P 87 NaN 109 123 L 125 NaN 140 149 M 148 NaN 22 102
df1. append( df3) # 出現空資料,原因在于:df1的列索引和df3列索引不一致
Python C++ Math En Java Chinese A 108.0 59.0 74.0 53.0 NaN NaN B 98.0 4.0 16.0 47.0 NaN NaN C 71.0 27.0 77.0 128.0 NaN NaN D 9.0 17.0 123.0 131.0 NaN NaN E 25.0 60.0 90.0 132.0 NaN NaN F 105.0 136.0 106.0 86.0 NaN NaN H 146.0 112.0 42.0 81.0 NaN NaN I 83.0 120.0 4.0 36.0 NaN NaN J 102.0 28.0 79.0 8.0 NaN NaN K 92.0 53.0 11.0 47.0 NaN NaN A NaN NaN NaN NaN 61.0 81.0 B NaN NaN NaN NaN 117.0 117.0 C NaN NaN NaN NaN 48.0 4.0 D NaN NaN NaN NaN 149.0 115.0 E NaN NaN NaN NaN 113.0 73.0 F NaN NaN NaN NaN 140.0 26.0 H NaN NaN NaN NaN 117.0 118.0 I NaN NaN NaN NaN 103.0 91.0 J NaN NaN NaN NaN 43.0 20.0 K NaN NaN NaN NaN 93.0 72.0
pd. concat( [ df1, df3] , axis = 0 )
Python C++ Math En Java Chinese A 108.0 59.0 74.0 53.0 NaN NaN B 98.0 4.0 16.0 47.0 NaN NaN C 71.0 27.0 77.0 128.0 NaN NaN D 9.0 17.0 123.0 131.0 NaN NaN E 25.0 60.0 90.0 132.0 NaN NaN F 105.0 136.0 106.0 86.0 NaN NaN H 146.0 112.0 42.0 81.0 NaN NaN I 83.0 120.0 4.0 36.0 NaN NaN J 102.0 28.0 79.0 8.0 NaN NaN K 92.0 53.0 11.0 47.0 NaN NaN A NaN NaN NaN NaN 61.0 81.0 B NaN NaN NaN NaN 117.0 117.0 C NaN NaN NaN NaN 48.0 4.0 D NaN NaN NaN NaN 149.0 115.0 E NaN NaN NaN NaN 113.0 73.0 F NaN NaN NaN NaN 140.0 26.0 H NaN NaN NaN NaN 117.0 118.0 I NaN NaN NaN NaN 103.0 91.0 J NaN NaN NaN NaN 43.0 20.0 K NaN NaN NaN NaN 93.0 72.0
5.2 資料插入
df1
Python Math En A 108 74 53 B 98 16 47 C 71 77 128 D 9 123 131 E 25 90 132 F 105 106 86 H 146 42 81 I 83 4 36 J 102 79 8 K 92 11 47
df1. insert( loc = 1 , # 插入位置
column= 'C++' , # 插入一列,這一列名字
value = np. random. randint( 0 , 151 , size = 10 ) ) # 插入的值
df1
Python C++ Math En A 108 59 74 53 B 98 4 16 47 C 71 27 77 128 D 9 17 123 131 E 25 60 90 132 F 105 136 106 86 H 146 112 42 81 I 83 120 4 36 J 102 28 79 8 K 92 53 11 47
5.3 Join SQL風格合并
df1 = pd. DataFrame( data = { 'name' : [ 'softpo' , 'Brandon' , 'Ella' , 'Daniel' , '張三' ] ,
'height' : [ 175 , 180 , 169 , 177 , 168 ] } ) # 身高
df2 = pd. DataFrame( data = { 'name' : [ 'softpo' , 'Brandon' , 'Ella' , 'Daniel' , '李四' ] ,
'weight' : [ 70 , 65 , 74 , 63 , 88 ] } ) # 體重
df3 = pd. DataFrame( data = { '名字' : [ 'softpo' , 'Brandon' , 'Ella' , 'Daniel' , '張三' ] ,
'salary' : np. random. randint( 20 , 100 , size = 5 ) } ) # 薪水
display( df1, df2, df3)
name height 0 softpo 175 1 Brandon 180 2 Ella 169 3 Daniel 177 4 張三 168
name weight 0 softpo 70 1 Brandon 65 2 Ella 74 3 Daniel 63 4 李四 88
名字 salary 0 softpo 64 1 Brandon 48 2 Ella 25 3 Daniel 26 4 張三 96
pd. concat( [ df1, df2] , axis = 1 )
name height name weight 0 softpo 175 softpo 70 1 Brandon 180 Brandon 65 2 Ella 169 Ella 74 3 Daniel 177 Daniel 63 4 張三 168 李四 88
# 根據共同的屬性,合并資料
# df1 和 df2 共同屬性:name
# 資料庫,合并join 共同key
# inner內合并
pd. merge( df1, df2, how = 'inner' ) # 根據共同name進行合并,兩表合并,外鍵
name height weight 0 softpo 175 70 1 Brandon 180 65 2 Ella 169 74 3 Daniel 177 63
pd. merge( df1, df2, how = 'outer' ) # 外合并,所有資料保留,不對應位置,填充了空資料
name height weight 0 softpo 175.0 70.0 1 Brandon 180.0 65.0 2 Ella 169.0 74.0 3 Daniel 177.0 63.0 4 張三 168.0 NaN 5 李四 NaN 88.0
pd. merge( df1, df2, how = 'left' )
name height weight 0 softpo 175 70.0 1 Brandon 180 65.0 2 Ella 169 74.0 3 Daniel 177 63.0 4 張三 168 NaN
pd. merge( df1, df3, left_on= 'name' , right_on= '名字' )
name height 名字 salary 0 softpo 175 softpo 64 1 Brandon 180 Brandon 48 2 Ella 169 Ella 25 3 Daniel 177 Daniel 26 4 張三 168 張三 96
df4 = pd. DataFrame( data = np. random. randint( 0 , 151 , size = ( 10 , 3 ) ) ,
columns= [ 'Python' , 'Math' , 'En' ] , index = list ( 'ABCDEFHIJK' ) )
df4
Python Math En A 71 7 89 B 145 116 40 C 56 150 139 D 88 66 41 E 87 139 117 F 141 45 18 H 93 119 114 I 110 89 2 J 2 35 96 K 125 59 9
score_mean = df4. mean( axis = 1 ) . round ( 1 )
score_mean
A 55.7
B 100.3
C 115.0
D 65.0
E 114.3
F 68.0
H 108.7
I 67.0
J 44.3
K 64.3
dtype: float64
df4. insert( loc = 2 , column= '平均分' , value= score_mean)
df4
Python Math 平均分 En A 71 7 55.7 89 B 145 116 100.3 40 C 56 150 115.0 139 D 88 66 65.0 41 E 87 139 114.3 117 F 141 45 68.0 18 H 93 119 108.7 114 I 110 89 67.0 2 J 2 35 44.3 96 K 125 59 64.3 9
df5 = df4. iloc[ : , [ 0 , 1 , 3 ] ]
df5
Python Math En A 71 7 89 B 145 116 40 C 56 150 139 D 88 66 41 E 87 139 117 F 141 45 18 H 93 119 114 I 110 89 2 J 2 35 96 K 125 59 9
score_mean. name = '平均分'
score_mean
A 55.7
B 100.3
C 115.0
D 65.0
E 114.3
F 68.0
H 108.7
I 67.0
J 44.3
K 64.3
Name: 平均分, dtype: float64
df5
Python Math En A 71 7 89 B 145 116 40 C 56 150 139 D 88 66 41 E 87 139 117 F 141 45 18 H 93 119 114 I 110 89 2 J 2 35 96 K 125 59 9
pd. merge( df5, score_mean,
left_index= True , # 資料合并根據行索引,對應
right_index= True ) # 右邊資料根據行索引,對應
Python Math En 平均分 A 71 7 89 55.7 B 145 116 40 100.3 C 56 150 139 115.0 D 88 66 41 65.0 E 87 139 117 114.3 F 141 45 18 68.0 H 93 119 114 108.7 I 110 89 2 67.0 J 2 35 96 44.3 K 125 59 9 64.3
第六部分:資料清洗
df = pd. DataFrame( data = { 'color' : [ 'red' , 'blue' , 'red' , 'green' , 'green' , 'blue' , None , np. NaN, 'green' ] ,
'price' : [ 20 , 15 , 20 , 18 , 18 , 22 , 30 , 30 , 22 ] } )
df
color price 0 red 20 1 blue 15 2 red 20 3 green 18 4 green 18 5 blue 22 6 None 30 7 NaN 30 8 green 22
# 重復資料洗掉
df. drop_duplicates( ) # 非重復資料,索引7和索引6重復資料,None和NaN一回事
color price 0 red 20 1 blue 15 3 green 18 5 blue 22 6 None 30 8 green 22
df
color price 0 red 20 1 blue 15 2 red 20 3 green 18 4 green 18 5 blue 22 6 None 30 7 NaN 30 8 green 22
df. dropna( ) # 空資料過濾
color price 0 red 20 1 blue 15 2 red 20 3 green 18 4 green 18 5 blue 22 8 green 22
# 洗掉行,或者列
df. drop( labels= [ 2 , 4 , 6 , 8 ] ) # 默認情況下洗掉行
color price 0 red 20 1 blue 15 3 green 18 5 blue 22 7 NaN 30
# 洗掉指定的列
df. drop( labels= 'color' , axis = 1 ) # 洗掉列,axis = 1
price 0 20 1 15 2 20 3 18 4 18 5 22 6 30 7 30 8 22
df. filter ( items= [ 'price' ] ) # 引數意思,保留資料price
price 0 20 1 15 2 20 3 18 4 18 5 22 6 30 7 30 8 22
df[ 'size' ] = 1024 # 廣播
df
color price size 0 red 20 1024 1 blue 15 1024 2 red 20 1024 3 green 18 1024 4 green 18 1024 5 blue 22 1024 6 None 30 1024 7 NaN 30 1024 8 green 22 1024
df. filter ( like = 'i' ) # 模糊匹配,保留了帶有i這個字母的索引
price size 0 20 1024 1 15 1024 2 20 1024 3 18 1024 4 18 1024 5 22 1024 6 30 1024 7 30 1024 8 22 1024
df[ 'hello' ] = 512
df
color price size hello 0 red 20 1024 512 1 blue 15 1024 512 2 red 20 1024 512 3 green 18 1024 512 4 green 18 1024 512 5 blue 22 1024 512 6 None 30 1024 512 7 NaN 30 1024 512 8 green 22 1024 512
# 正則運算式,方式很多
df. filter ( regex = 'e$' ) # 正則運算式,正則運算式,限制e必須在最后
price size 0 20 1024 1 15 1024 2 20 1024 3 18 1024 4 18 1024 5 22 1024 6 30 1024 7 30 1024 8 22 1024
df. filter ( regex= 'e' ) # 只要帶有e全部選出來
price size hello 0 20 1024 512 1 15 1024 512 2 20 1024 512 3 18 1024 512 4 18 1024 512 5 22 1024 512 6 30 1024 512 7 30 1024 512 8 22 1024 512
# 例外值過濾
a = np. random. randint( 0 , 1000 , size = 200 )
a
array([647, 871, 35, 738, 789, 587, 413, 559, 648, 993, 579, 129, 825,
904, 356, 316, 997, 800, 35, 601, 1, 208, 465, 614, 680, 619,
922, 346, 994, 135, 5, 650, 165, 475, 95, 194, 225, 455, 634,
717, 836, 678, 156, 203, 263, 180, 143, 248, 407, 56, 202, 947,
46, 408, 686, 530, 545, 273, 125, 964, 323, 775, 313, 238, 242,
804, 228, 322, 322, 768, 556, 9, 629, 938, 932, 859, 955, 707,
729, 541, 280, 493, 255, 681, 428, 992, 420, 650, 267, 32, 662,
185, 756, 319, 313, 271, 229, 711, 803, 85, 527, 853, 670, 685,
423, 458, 628, 701, 253, 495, 548, 879, 503, 115, 90, 978, 665,
532, 198, 482, 412, 850, 879, 913, 96, 177, 778, 337, 502, 128,
49, 747, 591, 22, 557, 105, 136, 775, 626, 515, 959, 869, 245,
437, 51, 236, 438, 489, 854, 49, 163, 687, 488, 175, 428, 517,
493, 377, 100, 728, 717, 926, 689, 186, 777, 639, 79, 83, 620,
623, 931, 918, 721, 315, 133, 423, 161, 999, 341, 55, 837, 582,
530, 805, 22, 301, 177, 322, 708, 14, 50, 864, 889, 929, 967,
497, 624, 127, 539, 14])
# 例外值,大于800,小于 100算作例外,認為定義的,根據實際情況,
cond = ( a <= 800 ) & ( a >= 100 )
a[ cond]
array([647, 738, 789, 587, 413, 559, 648, 579, 129, 356, 316, 800, 601,
208, 465, 614, 680, 619, 346, 135, 650, 165, 475, 194, 225, 455,
634, 717, 678, 156, 203, 263, 180, 143, 248, 407, 202, 408, 686,
530, 545, 273, 125, 323, 775, 313, 238, 242, 228, 322, 322, 768,
556, 629, 707, 729, 541, 280, 493, 255, 681, 428, 420, 650, 267,
662, 185, 756, 319, 313, 271, 229, 711, 527, 670, 685, 423, 458,
628, 701, 253, 495, 548, 503, 115, 665, 532, 198, 482, 412, 177,
778, 337, 502, 128, 747, 591, 557, 105, 136, 775, 626, 515, 245,
437, 236, 438, 489, 163, 687, 488, 175, 428, 517, 493, 377, 100,
728, 717, 689, 186, 777, 639, 620, 623, 721, 315, 133, 423, 161,
341, 582, 530, 301, 177, 322, 708, 497, 624, 127, 539])
# 正態分布,平均值是0,標準差是1
b = np. random. randn( 100000 )
b
array([-1.17335196, 2.02215212, -0.29891071, ..., -1.6762474 ,
-1.27071523, -1.15187761])
# 過濾例外值
cond = np. abs ( b) > 3 * 1 # 這些例外值,找到了
b[ cond]
array([ 3.46554243, 3.08127362, 3.55119821, 3.62774922, 3.11823028,
3.22620922, -3.10381164, -3.20067563, -3.04607325, -3.04427703,
3.09111414, -3.28220862, 3.00499105, -3.06179762, -3.17331972,
-3.37172359, 3.93766782, -3.22895232, -3.13737479, 3.07612751,
-3.43215209, -3.27660651, -3.35116041, 4.74328695, 3.25586636,
-3.54090785, 3.08881127, 3.00635551, 3.5018534 , -3.14463788,
-3.0182886 , -3.12145648, -3.24276219, 3.08087834, 3.04820238,
-3.24173442, -3.14648209, 3.87748281, -3.07660111, -3.16083928,
3.32641202, -3.05228179, 3.04924043, 3.02825131, -3.08360056,
-3.04890894, -3.27258041, -3.07339115, -3.38375287, -3.14267022,
-3.7207377 , 3.4813841 , -3.12866105, -3.17122631, 3.0599701 ,
3.12393087, 3.20253178, -3.05221958, -3.35532417, 3.02450167,
-3.28385568, 3.3422833 , -3.11052755, -3.09647003, 3.32353664,
-3.70215812, -3.07916575, -3.13546874, 3.20575826, -3.67982084,
-3.17055893, 3.4836615 , -3.30039879, -3.27774497, 3.02125912,
3.12332885, 3.01456477, 3.15958151, -3.34101369, 3.32444673,
3.06479889, 3.14506863, 3.15670827, 3.15066995, 3.14705869,
-3.20526898, -3.0761338 , 3.20716127, -3.20941307, -3.7212859 ,
-3.51785834, -3.06096986, -3.05425748, -3.47049261, 3.22285172,
-3.32233224, -3.04630606, 3.41215312, -3.16482337, -3.01813609,
-3.05441573, -3.10394416, 3.03469642, 3.01493847, -3.11901071,
3.5996865 , 3.48194227, -3.77734847, 3.04588004, 3.10611158,
-3.20473003, -3.4377999 , 3.22680244, -3.1536921 , -3.22798726,
3.34569796, 3.06046948, -3.16955677, 3.12613756, 3.04286964,
3.01148054, 3.18525226, -4.08971624, -3.55427596, -5.39879049,
3.05203254, 3.08944491, -3.02258209, 3.17316913, -3.1615401 ,
3.17205118, -3.24221772, -3.14421237, -3.74675036, 3.61678522,
3.59097443, -3.0302881 , 3.23236707, -3.00850012, 3.33608986,
-3.02859152, -3.7000766 , -3.10992575, -3.00412636, -3.05657102,
-3.05208781, 3.14017797, 3.46457731, 3.15619413, -3.43236114,
3.08259529, -3.84578168, 3.04203424, -3.29444028, -3.01764756,
3.11300256, 3.23071233, 3.20785451, -3.15668756, 3.44176099,
-3.19985577, -3.14126853, -3.26482841, -3.62208271, -3.55305069,
3.09639491, -3.18178713, -3.03662021, 3.17247227, 3.3908074 ,
-3.63563705, -3.56417097, 3.02823554, -3.06955375, 3.74305364,
3.63993306, -3.14193492, -3.04032527, -3.28310908, -3.37949723,
-3.25915912, -3.01206123, -3.10871377, -3.22982732, 3.8136103 ,
3.48893313, 3.9918267 , 3.4526763 , -3.46595488, -3.29996013,
-3.42965097, 3.151502 , 3.10548689, -3.44707735, 3.21881565,
3.50932999, -3.12410382, 3.30296386, 3.02454576, -3.20072608,
3.54339754, -3.17847739, -3.21475045, 3.03546088, -3.06225619,
3.48158164, 3.15243123, -3.06358376, 3.27300242, 3.32577453,
3.23535167, -3.04681725, 3.33439387, 3.10620079, 3.52883469,
-3.1790272 , 3.02641222, -3.45636819, 3.21009424, 3.08045954,
-3.59721754, 3.24693695, 3.05920919, -3.43674159, -3.00370946,
-3.48031594, -3.28748467, 3.42581649, 3.46912521, -3.28384157,
3.76358974, -3.34035865, 3.12978233, 3.44856854, -3.04074246,
3.50018071, 3.33188267, -3.09775514, -3.49356906, -3.09902374,
3.12068562, -3.1776565 , -3.44282129, 3.19286374, -3.28304596,
-3.10080963, -3.37189709, 3.77743156, 3.03547536, 3.22045459,
-3.44007263, 3.01331408, 3.49733677, 3.28831922, 3.62147013,
3.03458981, 3.15447237, -3.33931478, 3.09858431, -3.23592306,
3.3144797 , 3.37067342, -3.18749118, 3.09319307, -3.34390567,
3.29819563, 3.3120354 , 3.04166958, -3.00975323, 3.0347423 ,
-3.82502331, -3.13125028, -3.0876424 , 3.13929221, 3.570775 ,
-3.37420738, 3.17527797, 3.13396148, -3.70088631, -3.04054948,
3.05399103, 3.24908851, 3.19666266, -3.64071456, -3.85271081,
3.06864652, 3.53367592, 3.54650649, 3.6355438 , 3.657715 ,
4.03831601, 3.61651925])
第七部分:資料轉換
7.1 軸和元素轉換
import numpy as np
import pandas as pd
df = pd. DataFrame( data = np. random. randint( 0 , 10 , size = ( 10 , 3 ) ) ,
columns= [ 'Python' , 'Tensorflow' , 'Keras' ] ,
index = list ( 'ABCDEFHIJK' ) )
df
Python Tensorflow Keras A 2 5 3 B 5 0 0 C 7 0 4 D 0 4 7 E 8 6 9 F 8 2 6 H 6 7 8 I 7 6 9 J 4 7 9 K 6 7 1
df. rename( index = { 'A' : 'X' , 'K' : 'Y' } , # 行索引
columns= { 'Python' : '人工智能' } , # 列索引修改
inplace= True ) # 替換原資料
df. replace( 5 , 50 , inplace= True )
df
人工智能 Tensorflow Keras X 2 50 3 B 50 0 0 C 7 0 4 D 0 4 7 E 8 6 9 F 8 2 6 H 6 7 8 I 7 6 9 J 4 7 9 Y 6 7 1
df. replace( [ 2 , 7 ] , 1024 , inplace= True )
df
人工智能 Tensorflow Keras X 1024 50 3 B 50 0 0 C 1024 0 4 D 0 4 1024 E 8 6 9 F 8 1024 6 H 6 1024 8 I 1024 6 9 J 4 1024 9 Y 6 1024 1
df. iloc[ 4 , 2 ] = np. NaN # 空資料
df. replace( { 0 : 2048 , np. nan: - 100 } , inplace= True )
df
人工智能 Tensorflow Keras X 1024 50 3.0 B 50 2048 2048.0 C 1024 2048 4.0 D 2048 4 1024.0 E 8 6 -100.0 F 8 1024 6.0 H 6 1024 8.0 I 1024 6 9.0 J 4 1024 9.0 Y 6 1024 1.0
df. replace( { 'Tensorflow' : 1024 } , - 1024 ) # 指定某一列,進行資料替換
人工智能 Tensorflow Keras X 1024 50 3.0 B 50 2048 2048.0 C 1024 2048 4.0 D 2048 4 1024.0 E 8 6 -100.0 F 8 -1024 6.0 H 6 -1024 8.0 I 1024 6 9.0 J 4 -1024 9.0 Y 6 -1024 1.0
7.2 map映射元素轉變
# map 只能針對一列,就是Series
# 有一些沒有對應,那么回傳就是空資料
df[ '人工智能' ] . map ( { 1024 : 3.14 , 2048 : 2.718 , 6 : 1108 } ) # 跟據字典對資料進行改變
X 3.140
B NaN
C 3.140
D 2.718
E NaN
F NaN
H 1108.000
I 3.140
J NaN
Y 1108.000
Name: 人工智能, dtype: float64
df[ 'Keras' ] . map ( lambda x : True if x > 0 else False ) # 如果大于 0 回傳True,不然回傳False
X True
B True
C True
D True
E False
F True
H True
I True
J True
Y True
Name: Keras, dtype: bool
def convert ( x) :
if x >= 1024 :
return True
else :
return False
df[ 'level' ] = df[ 'Tensorflow' ] . map ( convert) # map映射,映射是Tensorflow中這一列中每一個資料,傳遞到方法中
df
人工智能 Tensorflow Keras level X 1024 50 3.0 False B 50 2048 2048.0 True C 1024 2048 4.0 True D 2048 4 1024.0 False E 8 6 -100.0 False F 8 1024 6.0 True H 6 1024 8.0 True I 1024 6 9.0 False J 4 1024 9.0 True Y 6 1024 1.0 True
7.3 apply映射元素轉變
# 既可以操作Series又可以操作DataFrame
df[ '人工智能' ] . apply ( lambda x : x + 100 )
X 1124
B 150
C 1124
D 2148
E 108
F 108
H 106
I 1124
J 104
Y 106
Name: 人工智能, dtype: int64
df[ 'level' ] . apply ( lambda x: 1 if x else 0 )
X 0
B 1
C 1
D 0
E 0
F 1
H 1
I 0
J 1
Y 1
Name: level, dtype: int64
df. apply ( lambda x : x + 1000 ) # apply對 所有的資料進行映射
人工智能 Tensorflow Keras level X 2024 1050 1003.0 1000 B 1050 3048 3048.0 1001 C 2024 3048 1004.0 1001 D 3048 1004 2024.0 1000 E 1008 1006 900.0 1000 F 1008 2024 1006.0 1001 H 1006 2024 1008.0 1001 I 2024 1006 1009.0 1000 J 1004 2024 1009.0 1001 Y 1006 2024 1001.0 1001
def convert ( x) :
return ( x. median( ) , x. count( ) , x. min ( ) , x. max ( ) , x. std( ) ) # 回傳中位數,回傳的是計數
df. apply ( convert) . round ( 1 ) # 默認操作列資料
人工智能 Tensorflow Keras level 0 29.0 1024.0 7.0 1 1 10.0 10.0 10.0 10 2 4.0 4.0 -100.0 False 3 2048.0 2048.0 2048.0 True 4 717.8 800.4 694.9 0.516398
df
人工智能 Tensorflow Keras level X 1024 50 3.0 False B 50 2048 2048.0 True C 1024 2048 4.0 True D 2048 4 1024.0 False E 8 6 -100.0 False F 8 1024 6.0 True H 6 1024 8.0 True I 1024 6 9.0 False J 4 1024 9.0 True Y 6 1024 1.0 True
df. apply ( convert, axis = 1 ) # axis = 1,操作資料就是行資料
X (26.5, 4, False, 1024, 503.68732033541073)
B (1049.0, 4, True, 2048, 1167.8622564326668)
C (514.0, 4, True, 2048, 979.1007353689405)
D (514.0, 4, False, 2048, 979.3623776042588)
E (3.0, 4, -100.0, 8, 52.443620520834884)
F (7.0, 4, True, 1024, 509.5085049993441)
H (7.0, 4, True, 1024, 509.5085049993441)
I (7.5, 4, False, 1024, 509.51373877453)
J (6.5, 4, True, 1024, 509.6773489179208)
Y (3.5, 4, 1.0, 1024, 510.6721061503164)
dtype: object
7.4 transform元素轉變
df = pd. DataFrame( np. random. randint( 0 , 10 , size = ( 10 , 3 ) ) ,
columns= [ 'Python' , 'Tensorflow' , 'Keras' ] ,
index = list ( 'ABCDEFHIJK' ) )
display( df)
# 可以針對一列資料,Series進行運算
df[ 'Python' ] . transform( lambda x : 1024 if x > 5 else - 1024 ) # 這個功能和map,apply類似的
Python Tensorflow Keras A 1 1 9 B 6 9 1 C 1 4 6 D 9 5 1 E 4 1 8 F 2 5 7 H 7 2 3 I 7 8 9 J 5 4 2 K 7 0 7
A -1024
B 1024
C -1024
D 1024
E -1024
F -1024
H 1024
I 1024
J -1024
K 1024
Name: Python, dtype: int64
df[ 'Tensorflow' ] . apply ( [ np. sqrt, np. square, np. cumsum] ) # 針對一列,進行不同的操作
sqrt square cumsum A 1.000000 1 1 B 3.000000 81 10 C 2.000000 16 14 D 2.236068 25 19 E 1.000000 1 20 F 2.236068 25 25 H 1.414214 4 27 I 2.828427 64 35 J 2.000000 16 39 K 0.000000 0 39
df[ 'Tensorflow' ] . transform( [ np. sqrt, np. square, np. cumsum] ) # 針對一列,進行不同的操作
sqrt square cumsum A 1.000000 1 1 B 3.000000 81 10 C 2.000000 16 14 D 2.236068 25 19 E 1.000000 1 20 F 2.236068 25 25 H 1.414214 4 27 I 2.828427 64 35 J 2.000000 16 39 K 0.000000 0 39
def convert ( x) :
if x > 5 :
return True
else :
return False
# 可以針對DataFrame進行運算
df. transform( { 'Python' : np. cumsum, 'Tensorflow' : np. square, 'Keras' : convert} ) # 對不同的列,執行不同的操作
Python Tensorflow Keras A 1 1 True B 7 81 False C 8 16 True D 17 25 False E 21 1 True F 23 25 True H 30 4 False I 37 64 True J 42 16 False K 49 0 True
df. apply ( { 'Python' : np. cumsum, 'Tensorflow' : np. square, 'Keras' : convert} ) # 對不同的列,執行不同的操作
Python Tensorflow Keras A 1 1 True B 7 81 False C 8 16 True D 17 25 False E 21 1 True F 23 25 True H 30 4 False I 37 64 True J 42 16 False K 49 0 True
7.5 重排隨機抽樣啞變數
df
Python Tensorflow Keras A 1 1 9 B 6 9 1 C 1 4 6 D 9 5 1 E 4 1 8 F 2 5 7 H 7 2 3 I 7 8 9 J 5 4 2 K 7 0 7
index = np. random. permutation( 10 ) # 回傳打亂順訊的索引
index
array([3, 4, 1, 2, 7, 9, 0, 8, 5, 6])
# 重排,索引打亂
df. take( index)
Python Tensorflow Keras D 9 5 1 E 4 1 8 B 6 9 1 C 1 4 6 I 7 8 9 K 7 0 7 A 1 1 9 J 5 4 2 F 2 5 7 H 7 2 3
# 從大量資料中隨機抽取資料
df. take( np. random. randint( 0 , 10 , size = 20 ) ) # 隨機抽樣20個資料
Python Tensorflow Keras J 5 4 2 J 5 4 2 D 9 5 1 K 7 0 7 H 7 2 3 I 7 8 9 J 5 4 2 A 1 1 9 C 1 4 6 J 5 4 2 I 7 8 9 D 9 5 1 I 7 8 9 K 7 0 7 A 1 1 9 B 6 9 1 H 7 2 3 D 9 5 1 B 6 9 1 H 7 2 3
df2 = pd. DataFrame( data = { 'key' : [ 'a' , 'b' , 'a' , 'b' , 'c' , 'b' , 'c' ] } )
df2
# one-hot,啞變數
# str型別資料,經過啞變數變換可以使用數字表示
pd. get_dummies( df2, prefix= '' , prefix_sep= '' ) # 1表示,有;0表示,沒有
a b c 0 1 0 0 1 0 1 0 2 1 0 0 3 0 1 0 4 0 0 1 5 0 1 0 6 0 0 1
第八部分:資料重塑
df
Python Tensorflow Keras A 1 1 9 B 6 9 1 C 1 4 6 D 9 5 1 E 4 1 8 F 2 5 7 H 7 2 3 I 7 8 9 J 5 4 2 K 7 0 7
df. T # 轉置,行變列,列變行
A B C D E F H I J K Python 1 6 1 9 4 2 7 7 5 7 Tensorflow 1 9 4 5 1 5 2 8 4 0 Keras 9 1 6 1 8 7 3 9 2 7
df2 = pd. DataFrame( np. random. randint( 0 , 10 , size = ( 20 , 3 ) ) ,
columns= [ 'Python' , 'Math' , 'En' ] ,
index = pd. MultiIndex. from_product( [ list ( 'ABCDEFHIJK' ) , [ '期中' , '期末' ] ] ) ) # 多層索引
df2
Python Math En A 期中 3 3 0 期末 6 5 8 B 期中 5 5 9 期末 7 5 2 C 期中 0 7 9 期末 9 7 5 D 期中 5 6 5 期末 7 9 6 E 期中 7 3 9 期末 9 1 4 F 期中 9 9 5 期末 0 8 9 H 期中 7 0 0 期末 1 6 6 I 期中 8 1 8 期末 7 9 9 J 期中 5 0 8 期末 3 6 6 K 期中 8 2 2 期末 3 5 2
df2. unstack( level = 1 ) # 將行索引變成列索引,-1表示最后一層
Python Math En 期中 期末 期中 期末 期中 期末 A 3 6 3 5 0 8 B 5 7 5 5 9 2 C 0 9 7 7 9 5 D 5 7 6 9 5 6 E 7 9 3 1 9 4 F 9 0 9 8 5 9 H 7 1 0 6 0 6 I 8 7 1 9 8 9 J 5 3 0 6 8 6 K 8 3 2 5 2 2
df2. unstack( level = - 1 ) # 將行索引變成列索引,-1表示最后一層
Python Math En 期中 期末 期中 期末 期中 期末 A 3 6 3 5 0 8 B 5 7 5 5 9 2 C 0 9 7 7 9 5 D 5 7 6 9 5 6 E 7 9 3 1 9 4 F 9 0 9 8 5 9 H 7 1 0 6 0 6 I 8 7 1 9 8 9 J 5 3 0 6 8 6 K 8 3 2 5 2 2
df2. stack( ) # 列變成行了
A 期中 Python 3
Math 3
En 0
期末 Python 6
Math 5
En 8
B 期中 Python 5
Math 5
En 9
期末 Python 7
Math 5
En 2
C 期中 Python 0
Math 7
En 9
期末 Python 9
Math 7
En 5
D 期中 Python 5
Math 6
En 5
期末 Python 7
Math 9
En 6
E 期中 Python 7
Math 3
En 9
期末 Python 9
Math 1
En 4
F 期中 Python 9
Math 9
En 5
期末 Python 0
Math 8
En 9
H 期中 Python 7
Math 0
En 0
期末 Python 1
Math 6
En 6
I 期中 Python 8
Math 1
En 8
期末 Python 7
Math 9
En 9
J 期中 Python 5
Math 0
En 8
期末 Python 3
Math 6
En 6
K 期中 Python 8
Math 2
En 2
期末 Python 3
Math 5
En 2
dtype: int64
df2. unstack( ) . stack( level = 0 )
期中 期末 A En 0 8 Math 3 5 Python 3 6 B En 9 2 Math 5 5 Python 5 7 C En 9 5 Math 7 7 Python 0 9 D En 5 6 Math 6 9 Python 5 7 E En 9 4 Math 3 1 Python 7 9 F En 5 9 Math 9 8 Python 9 0 H En 0 6 Math 0 6 Python 7 1 I En 8 9 Math 1 9 Python 8 7 J En 8 6 Math 0 6 Python 5 3 K En 2 2 Math 2 5 Python 8 3
df2. mean( ) # 計算的是 列
Python 5.45
Math 4.85
En 5.60
dtype: float64
df2. mean( axis = 1 )
A 期中 2.000000
期末 6.333333
B 期中 6.333333
期末 4.666667
C 期中 5.333333
期末 7.000000
D 期中 5.333333
期末 7.333333
E 期中 6.333333
期末 4.666667
F 期中 7.666667
期末 5.666667
H 期中 2.333333
期末 4.333333
I 期中 5.666667
期末 8.333333
J 期中 4.333333
期末 5.000000
K 期中 4.000000
期末 3.333333
dtype: float64
df2. mean( level= 1 ) # 計算期中期末所有學生的平均分
Python Math En 期中 5.7 3.6 5.5 期末 5.2 6.1 5.7
df2. mean( level = 0 ) # 計算每位學生期中和期末平均分
Python Math En A 4.5 4.0 4.0 B 6.0 5.0 5.5 C 4.5 7.0 7.0 D 6.0 7.5 5.5 E 8.0 2.0 6.5 F 4.5 8.5 7.0 H 4.0 3.0 3.0 I 7.5 5.0 8.5 J 4.0 3.0 7.0 K 5.5 3.5 2.0
第九部分:數學和統計方法
9.1 簡單統計指標
df = pd. DataFrame( np. random. randint( 0 , 10 , size = ( 20 , 3 ) ) ,
columns= [ 'Python' , 'Math' , 'En' ] , index = list ( 'QWERTYUIOPASDFGHJKLZ' ) )
df
Python Math En Q 1 4 3 W 8 2 0 E 3 0 9 R 9 8 9 T 9 1 3 Y 5 3 1 U 5 8 0 I 1 8 3 O 0 3 5 P 6 6 1 A 0 4 0 S 3 9 4 D 5 2 8 F 2 9 2 G 1 9 8 H 9 5 2 J 5 7 5 K 2 6 5 L 2 7 3 Z 2 9 8
df. iloc[ 6 , 2 ] = np. NAN
display( df)
Python Math En Q 1 4 3.0 W 8 2 0.0 E 3 0 9.0 R 9 8 9.0 T 9 1 3.0 Y 5 3 1.0 U 5 8 NaN I 1 8 3.0 O 0 3 5.0 P 6 6 1.0 A 0 4 0.0 S 3 9 4.0 D 5 2 8.0 F 2 9 2.0 G 1 9 8.0 H 9 5 2.0 J 5 7 5.0 K 2 6 5.0 L 2 7 3.0 Z 2 9 8.0
df. count( ) # 統計非空資料數量
Python 20
Math 20
En 19
dtype: int64
display( df. mean( ) , df. median( ) ) # 平均值,中位數
Python 3.900000
Math 5.500000
En 4.157895
dtype: float64
Python 3.0
Math 6.0
En 3.0
dtype: float64
display( df. min ( ) , df. max ( ) ) # 最小值,最大值
Python 0.0
Math 0.0
En 0.0
dtype: float64
Python 9.0
Math 9.0
En 9.0
dtype: float64
df[ 'Python' ] . unique( ) # 去除重復資料
array([1, 8, 3, 9, 5, 0, 6, 2])
df[ 'Math' ] . value_counts( ) # 統計出現的頻次
9 4
8 3
7 2
6 2
4 2
3 2
2 2
5 1
1 1
0 1
Name: Math, dtype: int64
df. quantile( q = [ 0 , 0.25 , 0.5 , 0.75 , 1 ] ) # 百分位數
Python Math En 0.00 0.00 0.0 0.0 0.25 1.75 3.0 2.0 0.50 3.00 6.0 3.0 0.75 5.25 8.0 6.5 1.00 9.00 9.0 9.0
df. describe( ) . round ( 1 )
Python Math En count 20.0 20.0 19.0 mean 3.9 5.5 4.2 std 3.0 2.9 3.0 min 0.0 0.0 0.0 25% 1.8 3.0 2.0 50% 3.0 6.0 3.0 75% 5.2 8.0 6.5 max 9.0 9.0 9.0
9.2 索引標簽、位置獲取
df[ 'Python' ] . argmax( ) # 回傳最大值索引
3
df[ 'En' ] . argmin( ) # 最小值索引
1
df. idxmax( ) # 回傳最大值的標簽
Python R
Math S
En E
dtype: object
df. idxmin( ) # 回傳最小值標簽
Python O
Math E
En W
dtype: object
9.3 更多統計指標
df. cumsum( ) # 累加和
Python Math En Q 1 4 3.0 W 9 6 3.0 E 12 6 12.0 R 21 14 21.0 T 30 15 24.0 Y 35 18 25.0 U 40 26 NaN I 41 34 28.0 O 41 37 33.0 P 47 43 34.0 A 47 47 34.0 S 50 56 38.0 D 55 58 46.0 F 57 67 48.0 G 58 76 56.0 H 67 81 58.0 J 72 88 63.0 K 74 94 68.0 L 76 101 71.0 Z 78 110 79.0
df. cumprod( ) # 累乘和
Python Math En Q 1 4 3.0 W 8 8 0.0 E 24 0 0.0 R 216 0 0.0 T 1944 0 0.0 Y 9720 0 0.0 U 48600 0 NaN I 48600 0 0.0 O 0 0 0.0 P 0 0 0.0 A 0 0 0.0 S 0 0 0.0 D 0 0 0.0 F 0 0 0.0 G 0 0 0.0 H 0 0 0.0 J 0 0 0.0 K 0 0 0.0 L 0 0 0.0 Z 0 0 0.0
df. cummin( ) # 累計最小值
Python Math En Q 1 4 3.0 W 1 2 0.0 E 1 0 0.0 R 1 0 0.0 T 1 0 0.0 Y 1 0 0.0 U 1 0 NaN I 1 0 0.0 O 0 0 0.0 P 0 0 0.0 A 0 0 0.0 S 0 0 0.0 D 0 0 0.0 F 0 0 0.0 G 0 0 0.0 H 0 0 0.0 J 0 0 0.0 K 0 0 0.0 L 0 0 0.0 Z 0 0 0.0
df. cummax( ) # 累計最大值
Python Math En Q 1 4 3.0 W 8 4 3.0 E 8 4 9.0 R 9 8 9.0 T 9 8 9.0 Y 9 8 9.0 U 9 8 NaN I 9 8 9.0 O 9 8 9.0 P 9 8 9.0 A 9 8 9.0 S 9 9 9.0 D 9 9 9.0 F 9 9 9.0 G 9 9 9.0 H 9 9 9.0 J 9 9 9.0 K 9 9 9.0 L 9 9 9.0 Z 9 9 9.0
df. std( ) # 標準差
Python 3.041814
Math 2.946898
En 3.004869
dtype: float64
df. var( )
Python 9.252632
Math 8.684211
En 9.029240
dtype: float64
df. diff( ) # 差分,當前資料減去上一個的差值
Python Math En Q NaN NaN NaN W 7.0 -2.0 -3.0 E -5.0 -2.0 9.0 R 6.0 8.0 0.0 T 0.0 -7.0 -6.0 Y -4.0 2.0 -2.0 U 0.0 5.0 NaN I -4.0 0.0 NaN O -1.0 -5.0 2.0 P 6.0 3.0 -4.0 A -6.0 -2.0 -1.0 S 3.0 5.0 4.0 D 2.0 -7.0 4.0 F -3.0 7.0 -6.0 G -1.0 0.0 6.0 H 8.0 -4.0 -6.0 J -4.0 2.0 3.0 K -3.0 -1.0 0.0 L 0.0 1.0 -2.0 Z 0.0 2.0 5.0
df. pct_change( ) . round ( 3 ) # 計算百分比變化
Python Math En Q NaN NaN NaN W 7.000 -0.500 -1.000 E -0.625 -1.000 inf R 2.000 inf 0.000 T 0.000 -0.875 -0.667 Y -0.444 2.000 -0.667 U 0.000 1.667 0.000 I -0.800 0.000 2.000 O -1.000 -0.625 0.667 P inf 1.000 -0.800 A -1.000 -0.333 -1.000 S inf 1.250 inf D 0.667 -0.778 1.000 F -0.600 3.500 -0.750 G -0.500 0.000 3.000 H 8.000 -0.444 -0.750 J -0.444 0.400 1.500 K -0.600 -0.143 0.000 L 0.000 0.167 -0.400 Z 0.000 0.286 1.667
9.4 高級統計指標
df. cov( ) # 協方差:自己和別人計算
Python Math En Python 9.252632 -2.157895 -0.695906 Math -2.157895 8.684211 1.160819 En -0.695906 1.160819 9.029240
df. var( ) # 方差: 自己和自己計算
Python 9.252632
Math 8.684211
En 9.029240
dtype: float64
df[ 'Python' ] . cov( df[ 'Math' ] )
-2.157894736842105
df. corr( ) # 相關性系數 -1 ~ 1
Python Math En Python 1.000000 -0.240731 -0.074376 Math -0.240731 1.000000 0.130217 En -0.074376 0.130217 1.000000
df. corrwith( df[ 'En' ] ) # 一列的相關性系數
Python -0.074376
Math 0.130217
En 1.000000
dtype: float64
第十部分:排序
df = pd. DataFrame( np. random. randint( 0 , 20 , size = ( 20 , 3 ) ) ,
columns= [ 'Python' , 'Tensorflow' , 'Keras' ] , index = list ( 'QWERTYUIOPASDFGHJKLZ' ) )
df
Python Tensorflow Keras Q 17 3 4 W 13 18 7 E 12 11 0 R 3 5 14 T 11 15 7 Y 5 15 4 U 18 2 7 I 7 3 6 O 1 18 5 P 12 6 0 A 4 18 4 S 15 5 8 D 8 11 14 F 3 2 17 G 4 17 8 H 12 1 4 J 1 2 6 K 17 9 16 L 11 14 4 Z 16 13 4
df. sort_index( axis = 0 , ascending= False ) # 降序
Python Tensorflow Keras Z 16 13 4 Y 5 15 4 W 13 18 7 U 18 2 7 T 11 15 7 S 15 5 8 R 3 5 14 Q 17 3 4 P 12 6 0 O 1 18 5 L 11 14 4 K 17 9 16 J 1 2 6 I 7 3 6 H 12 1 4 G 4 17 8 F 3 2 17 E 12 11 0 D 8 11 14 A 4 18 4
df. sort_index( ascending= True ) # 升序
Python Tensorflow Keras A 4 18 4 D 8 11 14 E 12 11 0 F 3 2 17 G 4 17 8 H 12 1 4 I 7 3 6 J 1 2 6 K 17 9 16 L 11 14 4 O 1 18 5 P 12 6 0 Q 17 3 4 R 3 5 14 S 15 5 8 T 11 15 7 U 18 2 7 W 13 18 7 Y 5 15 4 Z 16 13 4
df. sort_values( by = 'Python' , ascending= True ) # 根據Python屬性進行升序排列
Python Tensorflow Keras J 1 2 6 O 1 18 5 R 3 5 14 F 3 2 17 A 4 18 4 G 4 17 8 Y 5 15 4 I 7 3 6 D 8 11 14 T 11 15 7 L 11 14 4 P 12 6 0 E 12 11 0 H 12 1 4 W 13 18 7 S 15 5 8 Z 16 13 4 K 17 9 16 Q 17 3 4 U 18 2 7
df. sort_values( by = [ 'Python' , 'Tensorflow' ] , ascending= True )
# 先根據Python進行排序,如果相等在根據Tensorflow排序
Python Tensorflow Keras J 1 2 6 O 1 18 5 F 3 2 17 R 3 5 14 G 4 17 8 A 4 18 4 Y 5 15 4 I 7 3 6 D 8 11 14 L 11 14 4 T 11 15 7 H 12 1 4 P 12 6 0 E 12 11 0 W 13 18 7 S 15 5 8 Z 16 13 4 Q 17 3 4 K 17 9 16 U 18 2 7
df. nlargest( n = 5 , columns= 'Python' ) # 根據Python進行排序,獲取最大的5個數值
Python Tensorflow Keras U 18 2 7 Q 17 3 4 K 17 9 16 Z 16 13 4 S 15 5 8
df. nsmallest( 5 , columns= 'Keras' ) # 根據Keras進行排序,獲取最小的5個
Python Tensorflow Keras E 12 11 0 P 12 6 0 Q 17 3 4 Y 5 15 4 A 4 18 4
結束語
本篇博文的代碼是在jupyter上運行的,不過具體在哪運行都沒什么大的區別,
感謝收看,祝學業和作業進步! 需要本文資料的話,歡迎關注評論留下你的郵箱,
推薦關注的專欄
👨?👩?👦?👦 機器學習:分享機器學習實戰專案和常用模型講解 👨?👩?👦?👦 資料分析:分享資料分析實戰專案和常用技能整理
關注我,了解更多相關知識!
CSDN@報告,今天也有好好學習