成為以下 python pandas DataFrame:
| num_ID | start_date | end_date | time |
| ------ | ----------- | ---------- | ----------------- |
| 1 | 2022-02-14 | 2022-02-15 | 0 days 09:23:00 |
| 2 | 2022-02-12 | 2022-02-15 | 2 days 10:23:00 |
| 2 | 2022-02-05 | 2022-02-27 | 22 days 02:35:00 |
| 3 | 2022-02-04 | 2022-02-06 | 1 days 19:55:00 |
下面的 DataFrame 包含連續的日期以及列中各自的假期值is_holiday。
| date | is_holiday | name | other |
| ---------- | ---------- | ---- | ----- |
| 2022-01-01 | True | ABC | red |
| 2022-01-02 | False | CNA | blue |
...
# we assume in this case that the omitted rows have the value False in column
| 2022-02-15 | True | OOO | red |
| 2022-02-16 | True | POO | red |
| 2022-02-17 | False | KTY | blue |
...
| 2023-12-30 | False | TTE | white |
| 2023-12-31 | True | VVV | red |
我想total_days在初始 DataFrame 中添加一個新列,該列指示在第二個 DataFrame 中標記為 True 的總假期,每行在兩個日期(start_date和end_date)之間傳遞。
輸出結果示例:
| num_ID | start_date | end_date | time | total_days |
| ------ | ----------- | ---------- | ----------------- | -------------- |
| 1 | 2022-02-14 | 2022-02-15 | 0 days 09:23:00 | 1 |
| 2 | 2022-02-12 | 2022-02-15 | 2 days 10:23:00 | 1 |
| 2 | 2022-02-05 | 2022-02-27 | 22 days 02:35:00 | 2 |
| 3 | 2022-02-04 | 2022-02-06 | 1 days 19:55:00 | 0 |
uj5u.com熱心網友回復:
DataFrame.merge按行交叉連接使用,僅True按列holiday過濾,過濾Series.between和計數GroupBy.size,最后添加新列DataFrame.join:
df2 = df.merge(df1.loc[df1['holiday'], ['date']], how='cross')
s = (df2[df2['date'].between(df2["start_date"],df2["end_date"])]
.groupby(['start_date','end_date']).size())
df = df.join(s.rename('total_holidays'), on=['start_date','end_date'])
df['total_holidays'] = df['total_holidays'].fillna(0, downcast='int')
print (df)
num_ID start_date end_date total_time total_holidays
0 1 2022-02-14 2022-02-15 0 days 09:23:00 1
1 2 2022-02-12 2022-02-15 2 days 10:23:00 1
2 2 2022-02-05 2022-02-27 22 days 02:35:00 2
3 3 2022-02-04 2022-02-06 1 days 19:55:00 0
uj5u.com熱心網友回復:
如果您的資料很小,那么笛卡爾連接就可以了;隨著資料的增加,它變得低效,因為您正在比較兩個資料幀之間的每一行。更好的方法是使用某種形式的二進制搜索來獲取匹配項 -來自pyjanitor的conditional_join為非 equi 連接提供了一種有效的方法:
# pip install pyjanitor
# you can install the dev version for latest improvements
# pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
df.start_date = pd.to_datetime(df.start_date)
df.end_date = pd.to_datetime(df.end_date)
df2.date = pd.to_datetime(df2.date)
# relevant columns
cols = [*df.columns, 'is_holiday']
out = (df
.conditional_join(
df2.loc[df2.is_holiday == "True"],
('start_date', 'date', '<='),
('end_date', 'date', '>='),
how = 'inner')
.loc(axis = 1)[cols]
.groupby(cols[:-1])
.size()
.rename('total_days')
)
合并回原始資料框以獲得最終輸出
(df
.merge(out, how = 'left', on = cols[:-1])
# fillna is faster on a Series
.assign(total_days = lambda df: df.total_days.fillna(0, downcast = 'infer'))
)
num_ID start_date end_date time total_days
0 1 2022-02-14 2022-02-15 0 days 09:23:00 1
1 2 2022-02-12 2022-02-15 2 days 10:23:00 1
2 2 2022-02-05 2022-02-27 22 days 02:35:00 2
3 3 2022-02-04 2022-02-06 1 days 19:55:00 0
使用開發版本,您可以預先選擇列,也可以避免合并回原始資料框。無論如何,為了性能,如果可以的話,請避免交叉連接。
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/512603.html
