我有包含 620 萬條記錄的資料集。當我通過分組拆分它時,它會丟失大約 120 萬條記錄。這是資料集的一部分:
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count |
1 | 2020-01-01 00:28:15 | 2020-01-01 00:33:03 | 1.0
1 | 2020-01-01 00:35:39 | 2020-01-01 00:43:04 | 1.0
.. |.................... | ................... | ...
1 | 2020-01-31 00:47:41 | 2020-01-31 00:53:52 | 1.0
1 | 2020-01-31 00:55:23 | 2020-01-31 01:00:14 | 1.0
2 | 2020-01-31 00:01:58 | 2020-01-31 00:04:16 | 1.0
我需要tpep_dropoff_datetime 按天將其拆分為列。這是我用來執行此操作的代碼,但正如我之前提到的,它無法正常作業。
for date, g in df.groupby(pd.to_datetime(df['tpep_dropoff_datetime']).dt.normalize().astype(str)):
g.to_csv(f'{date}.csv', index=False)
任何想法,如何拆分資料框?
uj5u.com熱心網友回復:
你可以試試這個,雖然我相信這可能不是最好的方法(熊貓可能有更好的方法來做到這一點)。
import pandas as pd
cols = ["VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime", "passenger_count"]
df = pd.DataFrame(
[[1, "2020-01-01 00:28:15", "2020-01-01 00:33:03", 1.0],
[1, "2020-01-01 00:35:39", "2020-01-01 00:43:04", 1.0],
[1, "2020-01-31 00:47:41", "2020-01-31 00:53:52", 1.0],
[1, "2020-01-31 00:55:23", "2020-01-31 01:00:14", 1.0],
],
columns=cols,
)
# I do this because of the example the date is a string and I'm changing it to datetime.
# This might not be necesary, depends on your data.
df["tpep_dropoff_datetime"] = pd.to_datetime(df['tpep_dropoff_datetime'], format="%Y-%m-%d %H:%M:%S")
# Create a new column named "my_date" which
# will contains the date from the column "tpep_dropoff_datetime"
df["my_date"] = df["tpep_dropoff_datetime"].dt.date
# Now we group by date al the rows, and copy the ones according to their index
for date, indexes in df.groupby('my_date').groups.items():
print(f"date: {date}")
print(f"indexes: {indexes}")
# Copying the rows I want according to the index
aux_df = df.loc[indexes]
print(aux_df)
# Exporting to csv only the columns I want
aux_df.to_csv(f"{date}.csv", columns=cols, index=False)
輸出是控制臺中的檔案和這個:
date: 2020-01-01
indexes: Int64Index([0, 1], dtype='int64')
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count my_date
0 1 2020-01-01 00:28:15 2020-01-01 00:33:03 1.0 2020-01-01
1 1 2020-01-01 00:35:39 2020-01-01 00:43:04 1.0 2020-01-01
date: 2020-01-31
indexes: Int64Index([2, 3], dtype='int64')
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count my_date
2 1 2020-01-31 00:47:41 2020-01-31 00:53:52 1.0 2020-01-31
3 1 2020-01-31 00:55:23 2020-01-31 01:00:14 1.0 2020-01-31
至少我可以確定我的日期是正確的,但效率可能不是最好的
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/529501.html
