我正在嘗試根據關鍵欄位將多個 .CSV 檔案中的某些欄位連接在一起。但是,在存在重復項的情況下,我想將資訊附加到現有欄位。
DF 資料和結果/期望結果示例
有誰知道這樣做的方法?
這是我擁有的當前代碼的一個示例,它可以使用指定的欄位和連接資料,但會導致重復條目:
DF1 = pd.read_csv(('facilities.csv'), header = 0, dtype = object)
DF2 = pd.read_csv(('permits.csv'), header = 0, dtype = object)
DF3 = pd.read_csv(('programs.csv'), header = 0, dtype = object)
# Select only necessary columns from CSVs
DF1_reduc = DF1[['ID','FACILITY_TYPE_CODE','FACILITY_NAME','LOCATION_ADDRESS']]
DF2_reduc = DF2[['ID','ACTIVITY_ID','PERMIT_NAME','PERMIT_STATUS_CODE']]
DF3_reduc = DF3[['ID','PROG_CODE']]
#Joining all tables together
joined_tables = [DF1_reduc, DF2_reduc, DF3_reduc]
joined_tables = [table.set_index('ID') for table in joined_tables]
joined_tables = joined_tables[0].join(joined_tables[1:])
uj5u.com熱心網友回復:
import pandas as pd
PERMIT_NAME = []
PERMIT_STATUS_CODE = []
ACTIVITY_ID = []
df['ACTIVITY_ID'] = df['ACTIVITY_ID'].apply(lambda x: str(x))
for ID in df.ID.unique():
subset = df[df["ID"] == ID]
PERMIT_NAME.append(", ".join(subset['PERMIT_NAME'].unique() ))
PERMIT_STATUS_CODE.append( ", ".join(subset['PERMIT_STATUS_CODE'].unique() ))
ACTIVITY_ID.append(", ".join( subset['ACTIVITY_ID'].unique() ))
zz = df.drop(['PERMIT_NAME', 'PERMIT_STATUS_CODE', 'ACTIVITY_ID'], axis = 1).drop_duplicates()
zz['PERMIT_NAME'] = PERMIT_NAME
zz['PERMIT_STATUS_CODE'] = PERMIT_STATUS_CODE
zz['ACTIVITY_ID'] = ACTIVITY_ID
這里的想法是我們將獲取您的最終輸出并回圈遍歷 ID 上的每個子集。并加入唯一代碼,使它們成為字串,并且可以連接到您請求的那個奇異值中。如果您希望它是一個陣列,您可以洗掉連接。
ID Facility_Code FACILITY_NAME Location_Address PROG_CODE PERMIT_NAME PERMIT_STATUS_CODE ACTIVITY_ID
04R1 GAB Facility 1 HIGHWAY 1 E ABC PERMIT 1, permit1 A, C 1111, 1234
05R2 GAB Facility 2 1200 MOUNTAIN ROAD ABC PERMIT 2 B 1111
05R7 VOR Facility 3 500 MARSH PASS PERMIT 3 A, C 2000, 1234
0K09 FOP Facility 4 67 SEA LANE permit4 C 1111
uj5u.com熱心網友回復:
按唯一列分組并使用agg()組合行:
df = df1.join([df2, df3])
df = df.groupby(['ID'
,'FACILITY_TYPE_CODE'
,'FACILITY_NAME'
,'LOCATION_ADDRESS']) \
.agg(lambda s: ', '.join(s.fillna('')
.unique()
.astype('str')))
# Drop index for concise output.
print(df.reset_index(drop=True))
# ACTIVITY_ID PERMIT_NAME PERMIT_STATUS_CODE PROG_CODE
# 0 1111, 1234 PERMIT 1, permit1 A, C ABC
# 1 1111 PERMIT 2 B ABC
# 2 2000, 1234 PERMIT 3 A, C
# 3 1111 permit4 C
如果要將值分組為集合,或者更簡單:
df = df1.join([df2, df3])
df = df.groupby(['ID'
,'FACILITY_TYPE_CODE'
,'FACILITY_NAME'
,'LOCATION_ADDRESS']) \
.agg(set)
# Drop index for concise output.
print(df.reset_index(drop=True))
# ACTIVITY_ID PERMIT_NAME PERMIT_STATUS_CODE PROG_CODE
# 0 {1234, 1111} {PERMIT 1, permit1} {A, C} {ABC}
# 1 {1111} {PERMIT 2} {B} {ABC}
# 2 {2000, 1234} {PERMIT 3} {A, C} {nan}
# 3 {1111} {permit4} {C} {nan}
更多閱讀:https : //pandas.pydata.org/docs/user_guide/groupby.html
樣本資料:
import io
import pandas as pd
facilities = io.StringIO("""
ID,FACILITY_TYPE_CODE,FACILITY_NAME,LOCATION_ADDRESS
04R1,GAB,Facility 1,HIGHWAY 1 E
05R2,GAB,Facility 2,1200 MOUNTAIN ROAD
05R7,VOR,Facility 3,500 MARSH PASS
0K09,FOP,Facility 4,67 SEA LANE
""")
permits = io.StringIO("""
ID,ACTIVITY_ID,PERMIT_NAME,PERMIT_STATUS_CODE
04R1,1111,PERMIT 1,A
04R1,1234,permit1,C
05R2,1111,PERMIT 2,B
05R7,2000,PERMIT 3,A
05R7,1234,PERMIT 3,C
0K09,1111,permit4,C
""")
programs = io.StringIO("""
ID,PROG_CODE
04R1,ABC
05R2,ABC
05R7,
0K09,
""")
df1 = pd.read_csv(facilities, index_col='ID')
df2 = pd.read_csv(permits, index_col='ID')
df3 = pd.read_csv(programs, index_col='ID')
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/336007.html
上一篇:如何通過一個或另一個ID列(pythonpandas)連接兩個表?
下一篇:十大交易規則,期貨界公認圣經
