PythonPandas：加入兩個表，保持不重復但也不更改第一個表-有解無憂

我需要：

連接表 1 和表 2
消除重復
保留表 1 中的原件
一個字典，說明哪個是舊表中的 id，哪個是新 id

示例：輸出將是這樣的

PS：事情是，table1 來自一個已經在生產資料庫中，而我在這里擁有的 id 用于許多其他表，所以我不能更改已經存在的內容，只能添加尚未存在的新資料。但我還需要說明資料的新 id 是什么。

表格1

id   name        birthdate     
1    Goku        1997-12-15 
2    Freeza      2000-10-03
3    Vegeta      2003-08-19

表2

id    name        birthdate
1     Krillin     1983-02-28
2     Roshi       1960-06-07
3     Goku        1997-12-15
4     Freeza      1998-10-10

所以我需要從中生成以下內容

結果表1

id    name        birthdate     
1     Goku        1997-12-15 
2     Freeza      2000-10-03
3     Vegeta      2003-08-19
4     Krillin     1983-02-28
5     Roshi       1960-06-07
6     Freeza      1998-10-10

但我還需要一張表，說明一個人在舊表上的代碼，哪些是新代碼，這也將是這樣的：

from_to_table

id   origin      new_id
1    table_1     1
2    table_1     2
3    table_1     3
1    table_2     4
2    table_2     5
3    table_2     1
4    table_2     6

我已經嘗試了很多方法，我現在唯一要作業的方法是逐行插入并每次檢查兩個欄位，但這需要太多時間，使其不可行。

到目前為止，我發現的最佳方法基本上包括：連接兩個表 -> 對資料進行分組并生成新的 id 列 -> 將分組表與連接的兩個表連接起來以創建 from_to_table 問題是，這種方法會改變我的 id一定不能改變，我不知道如何保持這些。

uj5u.com熱心網友回復：

對于resulting_table1，我建議對列和使用merge外連接，然后重新創建列：namebirthdateid

resulting_table1 = pd.merge(table1, table2, on=['name','birthdate'], how='outer')[['name','birthdate']]
resulting_table1['id'] = range(1, len(resulting_table1) 1)

對于from_to_table，您可以使用另一個外連接（這次是在所有列上）并使用indicator標志來保留有關源表的資訊：

from_to_table = pd.merge(table1, table2, how='outer', indicator='origin').replace({'origin':{'left_only':'table_1', 'right_only':'table_2'}})

resulting_table1最后為新的 id做一個左連接：

from_to_table = from_to_table.merge(resulting_table1, on=['name','birthdate'], how="left")

uj5u.com熱心網友回復：

我假設這id是一列，而不是索引：

table1 =
   id    name   birthdate
0   1    Goku  1997-12-15
1   2  Freeza  2000-10-03
2   3  Vegeta  2003-08-19

然后你可以嘗試以下方法：

（1）創建一個加入table_tmp了一些附加內容：

table_tmp = pd.concat([table1.assign(table=1), table2.assign(table=2)])

   id     name   birthdate  table
0   1     Goku  1997-12-15      1
1   2   Freeza  2000-10-03      1
2   3   Vegeta  2003-08-19      1
0   1  Krillin  1983-02-28      2
1   2    Roshi  1960-06-07      2
2   3     Goku  1997-12-15      2
3   4   Freeza  1998-10-10      2

(2) 基于該創建resulting_table1：

resulting_table1 = (
    table_tmp
    .drop_duplicates(["name", "birthdate"])
    .reset_index(drop=True)
    .assign(id=lambda df: df.index   1)
    .drop(columns="table")
)

   id     name   birthdate
0   1     Goku  1997-12-15
1   2   Freeza  2000-10-03
2   3   Vegeta  2003-08-19
3   4  Krillin  1983-02-28
4   5    Roshi  1960-06-07
5   6   Freeza  1998-10-10

(3) 然后使用兩者來創建from_to_table：

from_to_table = (
    table_tmp
    .merge(resulting_table1, on=["name", "birthdate"], how="left")
    .drop(columns=["name", "birthdate"])
    .rename(columns={"id_x": "id", "id_y": "id_new"})
)

   id  table  id_new
0   1      1       1
1   2      1       2
2   3      1       3
3   1      2       4
4   2      2       5
5   3      2       1
6   4      2       6

轉載請註明出處，本文鏈接：https://www.uj5u.com/qukuanlian/452527.html

標籤：Python 熊猫加入通过...分组合并

上一篇：如果同一表中不存在日期，則MYSQL選擇具有預定義值的資料

下一篇：OracleUPDATE與JOIN3表？[復制]