我有多組非常大的 csv 檔案,我需要根據唯一 ID 合并它們。我將此唯一 ID 設定為基于我的 Origin 和 Destination 列的連接的索引。
資料框 1:
| 起源 | 目的地 | 價值 | |
|---|---|---|---|
| 70478 | 70 | 478 | 0.002779 |
| 70479 | 70 | 479 | 0.001673 |
| 70480 | 70 | 480 | 0.000427 |
| 70481 | 70 | 481 | 0.001503 |
| 70482 | 70 | 482 | 0.01215 |
| 70483 | 70 | 483 | 0.004507 |
| 70484 | 70 | 484 | 0.001871 |
| 70485 | 70 | 485 | 0.006522 |
| 70486 | 70 | 486 | 0.004786 |
| 70487 | 70 | 487 | 0.026566 |
資料框 2:
| 起源 | 目的地 | 價值 | |
|---|---|---|---|
| 70478 | 70 | 478 | 135.974365 |
| 70479 | 70 | 479 | 130.936752 |
| 70480 | 70 | 480 | 111.191734 |
| 70481 | 70 | 481 | 98.170746 |
| 70482 | 70 | 482 | 88.257645 |
| 70483 | 70 | 483 | 102.095566 |
| 70484 | 70 | 484 | 103.585373 |
| 70485 | 70 | 485 | 114.298431 |
| 70486 | 70 | 486 | 97.331055 |
| 70487 | 70 | 487 | 85.754776 |
我的最終表格應如下所示(需求 = df1 的值;時間 = df2 的值;Demand_Time = 時間/需求):
| 起源 | 目的地 | 要求 | 時間 | 需求時間 | |
|---|---|---|---|---|---|
| 0 | 70 | 478 | 0.002779 | 135.974365 | 0.377858 |
| 1 | 70 | 479 | 0.001673 | 130.936752 | 0.219041 |
| 2 | 70 | 480 | 0.000427 | 111.191734 | 0.047494 |
| 3 | 70 | 481 | 0.001503 | 98.170746 | 0.147536 |
| 4 | 70 | 482 | 0.01215 | 88.257645 | 1.072321 |
| 5 | 70 | 483 | 0.004507 | 102.095566 | 0.460115 |
| 6 | 70 | 484 | 0.001871 | 103.585373 | 0.193806 |
| 7 | 70 | 485 | 0.006522 | 114.298431 | 0.74551 |
| 8 | 70 | 486 | 0.004786 | 97.331055 | 0.465854 |
| 9 | 70 | 487 | 0.026566 | 85.754776 | 2.278125 |
我.compare在 df1 和 df2 之間執行了一個生成以下新資料幀的操作:
| 起源 | 目的地 | 價值 | ||||
|---|---|---|---|---|---|---|
| 自己 | 其他 | 自己 | 其他 | 自己 | 其他 | |
| 70478 | 70 | 70 | 478 | 478 | 0.002779 | 135.974365 |
| 70479 | 70 | 70 | 479 | 479 | 0.001673 | 130.936752 |
| 70480 | 70 | 70 | 480 | 480 | 0.000427 | 111.191734 |
| 70481 | 70 | 70 | 481 | 481 | 0.001503 | 98.170746 |
| 70482 | 70 | 70 | 482 | 482 | 0.01215 | 88.257645 |
| 70483 | 70 | 70 | 483 | 483 | 0.004507 | 102.095566 |
| 70484 | 70 | 70 | 484 | 484 | 0.001871 | 103.585373 |
| 70485 | 70 | 70 | 485 | 485 | 0.006522 | 114.298431 |
| 70486 | 70 | 70 | 486 | 486 | 0.004786 | 97.331055 |
| 70487 | 70 | 70 | 487 | 487 | 0.026566 | 85.754776 |
然后我創建一個新的最終pd.DataFramedf,遍歷我上面的比較表和.append我最終的新 df。
在非常大的表(每個幾十萬條記錄)上迭代和追加的最后一部分需要很長時間 - 每次大約 1.5 小時。
有沒有辦法更有效地完成最后一部分?
謝謝你。
代碼示例:
import pandas as pd
# Replicating sample df1 (.read_csv from csv file 1)
df_1_data = [[70, 478, 0.0027788935694843],
[70, 479, 0.0016728754853829],
[70, 480, 0.0004271405050531],
[70, 481, 0.0015028485795482],
[70, 482, 0.0121498983353376],
[70, 483, 0.0045067127794027],
[70, 484, 0.0018709792057052],
[70, 485, 0.0065224897116422],
[70, 486, 0.0047862790524959],
[70, 487, 0.0265655759721994]]
df_1 = pd.DataFrame(df_1_data, columns=['Origin', 'Destination', 'Value'])
df_1 = df_1.set_index(df_1['Origin'].astype(str) df_1['Destination'].astype(str))
print(df_1)
# Replicating sample df2 (.read_csv from csv file 2)
df_2_data = [[70, 478, 135.9743652],
[70, 479, 130.9367523],
[70, 480, 111.1917343],
[70, 481, 98.17074585],
[70, 482, 88.25764465],
[70, 483, 102.0955658],
[70, 484, 103.5853729],
[70, 485, 114.2984314],
[70, 486, 97.33105469],
[70, 487, 85.754776]]
df_2 = pd.DataFrame(df_2_data, columns=['Origin', 'Destination', 'Value'])
df_2 = df_2.set_index(df_2['Origin'].astype(str) df_2['Destination'].astype(str))
print(df_2)
df_compare = df_1.compare(df_2, keep_shape=True, keep_equal=True)
print(df_compare)
df_out = pd.DataFrame(columns=['Origin', 'Destination', 'Demand', 'Time', 'Demand_Time'])
for index, row in df_compare.iterrows():
df_out = df_out.append({'Origin': int(row['Origin']['self']), 'Destination': int(row['Destination']['self']),
'Demand': row['Value']['self'], 'Time': row['Value']['other'],
'Demand_Time': row['Value']['self'] * row['Value']['other']}, ignore_index=True)
print(df_out)
print('\nCOMPLETED')
uj5u.com熱心網友回復:
IIUC,您可以使用:
out = (df1.rename(columns={'Value': 'Demand'})
.assign(Time=df2['Value'], Demand_Time=df2['Value'] * df1['Value'])
.reset_index(drop=True))
print(out)
# Output
Origin Destination Demand Time Demand_Time
0 70 478 0.002779 135.974365 0.377873
1 70 479 0.001673 130.936752 0.219057
2 70 480 0.000427 111.191734 0.047479
3 70 481 0.001503 98.170746 0.147551
4 70 482 0.012150 88.257645 1.072330
5 70 483 0.004507 102.095566 0.460145
6 70 484 0.001871 103.585373 0.193808
7 70 485 0.006522 114.298431 0.745454
8 70 486 0.004786 97.331055 0.465826
9 70 487 0.026566 85.754776 2.278161
uj5u.com熱心網友回復:
如果我正確理解了請求,我會使用 pandas 和 numby 的組合來及時獲得您想要的結果
import datetime
import numpy as np
df_1_data = [[70, 478, 0.0027788935694843],
[70, 479, 0.0016728754853829],
[70, 480, 0.0004271405050531],
[70, 481, 0.0015028485795482],
[70, 482, 0.0121498983353376],
[70, 483, 0.0045067127794027],
[70, 484, 0.0018709792057052],
[70, 485, 0.0065224897116422],
[70, 486, 0.0047862790524959],
[70, 487, 0.0265655759721994]]
df_1 = pd.DataFrame(df_1_data, columns=['Origin', 'Destination', 'Value'])
df_1 = df_1.set_index(df_1['Origin'].astype(str) df_1['Destination'].astype(str))
# Replicating sample df2 (.read_csv from csv file 2)
df_2_data = [[70, 478, 135.9743652],
[70, 479, 130.9367523],
[70, 480, 111.1917343],
[70, 481, 98.17074585],
[70, 482, 88.25764465],
[70, 483, 102.0955658],
[70, 484, 103.5853729],
[70, 485, 114.2984314],
[70, 486, 97.33105469],
[70, 487, 85.754776]]
df_2 = pd.DataFrame(df_2_data, columns=['Origin', 'Destination', 'Value'])
df_2 = df_2.set_index(df_2['Origin'].astype(str) df_2['Destination'].astype(str))
df_1.columns = [['Origin', 'Destination', 'Demand']]
df_2.columns = [['Origin', 'Destination', 'Time']]
df_merge = df_1.merge(df_2, how = 'inner')
df_merge['Demand_Time'] = df_merge['Time'].values / df_merge['Demand'].values
df_merge
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/471463.html
上一篇:Pythoncsv字典從字典呼叫以根據字典重復csv中的行
下一篇:在C 中處理非常大的資料
