9000字深度整理: 全網最詳細 Pandas 合并資料集操作總結！-有解無憂

關于如果用pandas庫來實作資料集之間合并的文章其實說少也不算少，不過我總是感覺寫的算不上完善，尤其針對Python初學者來說，所以今天打算來整理與總結一下，內容較多建議收藏，喜歡點贊支持，文末提供技術交流群，

本文大概的結構是

concat()方法的簡單介紹
append()方法的簡單介紹
merge()方法的簡單介紹
join()方法的簡單介紹
多重行索引的合并介紹
表格合并之后的列名重命名
combine()方法的簡單介紹
combine_first()方法的簡單介紹

`Concat()`方法的簡單介紹

在我們開始concat()方法的正是介紹之前，我們先來看一下簡單的例子

df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)
df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)
df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
    index=[8, 9, 10, 11],
)

我們來看一下使用concat()方法之后的效果

frames = [df1, df2, df3]
result = pd.concat(frames)
result

output

      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

大致合并的方向就是按照軸垂直的方向來進行合并，如下圖

下面小編來詳細介紹一下concat()方法中的各個引數作用

pd.concat(
    objs,
    axis=0,
    join="outer",
    ignore_index=False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity=False,
    copy=True,
)

objs:需要用來進行合并的資料集，可以是Series型別或者是DataFrame型別的資料
axis:可以理解為是合并的方向，默認是0
join:可以理解為是合并的方式，有并集或是交集兩種方式，默認的是并集
ignore_index:忽略索引，默認是False
keys:用于做行方向的多重索引

大家可能會有些迷惑，什么是多重的索引呢？看下面的例子

result = pd.concat(frames, keys=["x", "y", "z"])
result

output

如此一來，我們可以通過“x”、“y”以及“z”這些元素來獲取每一部分的資料，例如

result.log["x"]

output

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3

除此之外，keys引數還能夠被用在列索引上

s3 = pd.Series([0, 1, 2, 3], name="foo")
s4 = pd.Series([0, 1, 2, 3])
s5 = pd.Series([0, 1, 4, 5])
pd.concat([s3, s4, s5], axis=1, keys=["red", "blue", "yellow"])

output

   red  blue  yellow
0    0     0       0
1    1     1       1
2    2     2       4
3    3     3       5

列名就變成了keys串列中的元素

而對于join引數，默認的是以outer也就是并集的方式在進行兩表格的合并

df4 = pd.DataFrame(
    {
        "B": ["B2", "B3", "B6", "B7"],
        "D": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],
    },
    index=[2, 3, 6, 7],
)
result = pd.concat([df1, df4], axis=1)

output

而當我們將join引數設定成inner，也就是交集的方式來進行合并，出來的結果就會不太一樣

result = pd.concat([df1, df4], axis=1, join="inner")

output

接下來我們來看一下ignore_index引數的作用，它能夠對行索引做一個重新的整合

result = pd.concat([df1, df4], ignore_index=True, sort=False)

output

對于一個表格是DataFrame格式，另外一個是Series格式，concat()方法也可以將兩者合并起來，

s1 = pd.Series(["X0", "X1", "X2", "X3"], name="X")
result = pd.concat([df1, s1], axis=1)

output

要是在加上ignore_index引數的話，看一下效果會如何

result = pd.concat([df1, s1], axis=1, ignore_index=True)

output

`append()`方法的簡單介紹

append()方法是對上面concat()方法的簡單概括，我們來看一下簡單的例子

result = df1.append(df2)
result

output

當然append()方法當中也可以放入多個DataFrame表格，代碼如下

result = df1.append([df2, df3])

output

和上面的concat()方法相類似的是，append()方法中也有ignore_index引數，

result = df1.append(df4, ignore_index=True, sort=False)

output

同樣地，我們也可以通過append()方法來給DataFrame表格添加幾行的資料

s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])
result = df1.append(s2, ignore_index=True)

output

關于`Merge()`方法的介紹

在merge()方法中有這些引數

pd.merge(
    left,
    right,
    how="inner",
    on=None,
    left_on=None,
    right_on=None,
    left_index=False,
    right_index=False,
    sort=True,
    suffixes=("_x", "_y"),
    copy=True,
    indicator=False,
    validate=None,
)

left/right:也就是所要合并的兩個表格
on:左右所要合并的兩表格的共同列名
left_on/right_on:兩表格進行合并時所對應的欄位
how:合并的方式，有left、right、outer、inner四種，默認是inner
suffixes:在兩表格進行合并時，重復的列名后面添加的后綴
left_index:若為True，按照左表格的索引來連接兩個資料集
right_index:若為True，按照右表格的索引來連接兩個資料集

我們先來看一個簡單的例子

left = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)
right = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)
result = pd.merge(left, right, on="key")
result

output

在merge()的程序當中有三種型別的合并，分別是一對一、多對一、多對多，其中“一對一”型別也就是merge()方法會去尋找兩個表格當中相同的列，例如上面的“key”，并自動以這列作為鍵來進行排序，需要注意的是共同列中的元素其位置可以是不一致的，

那么來看一下“多對一”的合并型別，例如下面兩張表格有共同的列“group”，并且第一張表格當中的“group”有兩個相同的值，

df1:

employee        group  hire_date
0      Bob   Accounting       2008
1     Jake  Engineering       2002
2     Mike  Engineering       2005
3     Linda          HR       2010

df2:

       group supervisor
0   Accounting      Cathey
1  Engineering      Dylan
2           HR      James

然后我們來進行合并

pd.merge(df_1, df_2)

output

  employee        group  hire_date supervisor
0      Bob   Accounting       2008     Cathey
1     Jake  Engineering       2002      Dylan
2     Mike  Engineering       2005      Dylan
3    Linda           HR       2010      James

最后便是“多對多”的合并型別，可以理解為兩張表格的共同列中都存在著重復值，例如

df3:

employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR

df4: 

       group        skills
0   Accounting          math
1   Accounting  spreadsheets
2  Engineering        coding
3  Engineering         linux
4           HR  spreadsheets
5           HR  organization

然后我們進行合并之后，看一下出來的結果

df = pd.merge(df3, df4)
print(df)

output

  employee        group       skills
0      Bob   Accounting         math
1      Bob   Accounting  programming
2     Jake  Engineering        linux
3     Jake  Engineering       python
4     Lisa  Engineering        linux
5     Lisa  Engineering       python
6      Sue           HR         java
7      Sue           HR          c++

那么涉及到引數how有四種合并的方式，有“left”、“right”、“inner”、“outer”，分別代表

inner:也就是交集，在使用merge()方法的時候，默認采用的都是交集的合并方式
outer:可以理解為是并集的合并方式
left/right: 單方向的進行并集的合并

我們先來看一下“left”方向的并集的合并

result = pd.merge(left, right, how="left", on=["key1", "key2"])
result

output

我們再來看一下“right”方向的并集的合并

result = pd.merge(left, right, how="right", on=["key1", "key2"])
result

output

“outer”方式的合并

result = pd.merge(left, right, how="outer", on=["key1", "key2"])
result

output

“inner”方式的合并

result = pd.merge(left, right, how="inner", on=["key1", "key2"])
result

output

關于`join()`方法的簡單介紹

join()方法用于將兩個有著不同列索引的表格合并到一起，我們先來看一個簡單的例子

left = pd.DataFrame(
    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
)
right = pd.DataFrame(
    {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
)

result = left.join(right)

output

在join()方法中也有引數how用來定義合并的方式，和merge()方法相類似，這里便也有不做贅述

當多重行索引遇到`join()`方法

當遇到一表格，其中的行索引是多重行索引的時候，例如

left = pd.DataFrame(
    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]},
    index=pd.Index(["K0", "K1", "K2"], name="key"),
)
index = pd.MultiIndex.from_tuples(
    [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")],
    names=["key", "Y"],
)
right = pd.DataFrame(
    {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]},
    index=index,
)
result = left.join(right, how="inner")

output

那么要是要合并的兩張表格都是多重行索引呢？

leftindex = pd.MultiIndex.from_product(
    [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
)
left = pd.DataFrame({"v1": range(12)}, index=leftindex)

output

            v1
abc xy num    
a   x  1     0
       2     1
    y  1     2
       2     3
b   x  1     4
       2     5
    y  1     6
       2     7
c   x  1     8
       2     9
    y  1    10
       2    11

第二張表格如下

rightindex = pd.MultiIndex.from_product(
    [list("abc"), list("xy")], names=["abc", "xy"]
)
right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)

output

         v2
abc xy     
a   x   100
    y   200
b   x   300
    y   400
c   x   500
    y   600

將上述的兩張表格進行合并

left.join(right, on=["abc", "xy"], how="inner")

output

            v1   v2
abc xy num         
a   x  1     0  100
       2     1  100
    y  1     2  200
       2     3  200
b   x  1     4  300
       2     5  300
    y  1     6  400
       2     7  400
c   x  1     8  500
       2     9  500
    y  1    10  600
       2    11  600

列名的重命名

要是兩張表格的列名相同，合并之后會對其列名進行重新命名，例如

left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]})
right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]})
result = pd.merge(left, right, on="k")

output

這里就不得不提到suffixes這個引數，通過這個引數來個列進行重命名，例如

result = pd.merge(left, right, on="k", suffixes=("_l", "_r"))

output

`combine_first()`方法的簡單介紹

要是要合并的兩表格，其中一個存在空值的情況，就可以使用combine_first()方法，

df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
df1.combine_first(df2)

output

     A    B
0  1.0  3.0
1  0.0  4.0

表格當中的空值就會被另外一張表格的非空值給替換掉

`combine()`方法的簡單介紹

combine()方法是將兩表格按照列的方向進行合并，但是不同在于還需要另外傳進去一個第三方的函式或者是方法，來看一個簡單的例子

df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2

我們定義了一個簡單的方法，在合并的程序中提取出總和較小的值

df1.combine(df2, take_smaller)

output

   A  B
0  0  3
1  0  3

要是表格中存在空值，combine()方法也有fill_value這個引數來處理

df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [2, 2], 'B': [3 3]})
df1.combine(df2, take_smaller, fill_value=-5)

output

   A    B
0  0 -5.0
1  0  4.0

參考鏈接

https://mp.weixin.qq.com/s/Y7ccJ8TuVh_dCac3EWIhrw

技術交流

歡迎轉載、收藏、有所識訓點贊支持一下！

在這里插入圖片描述

目前開通了技術交流群，群友已超過2000人，添加時最好的備注方式為：來源+興趣方向，方便找到志同道合的朋友

方式①、發送如下圖片至微信，長按識別，后臺回復：加群；
方式②、添加微信號：dkl88191，備注：來自CSDN
方式③、微信搜索公眾號：Python學習與資料挖掘，后臺回復：加群

長按關注

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/340757.html

標籤：python

上一篇：2021最火編程語言排行榜

下一篇：【78技術人社群~Python分部】，就在今天成立 →

9000字深度整理: 全網最詳細 Pandas 合并資料集操作總結！

Concat()方法的簡單介紹

append()方法的簡單介紹

關于Merge()方法的介紹

關于join()方法的簡單介紹

當多重行索引遇到join()方法

列名的重命名

combine_first()方法的簡單介紹

combine()方法的簡單介紹

技術交流

`Concat()`方法的簡單介紹

`append()`方法的簡單介紹

關于`Merge()`方法的介紹

關于`join()`方法的簡單介紹

當多重行索引遇到`join()`方法

`combine_first()`方法的簡單介紹

`combine()`方法的簡單介紹