讓我們以這個示例資料幀為例:
df = pd.DataFrame({'ID':[1,1,2,2,3],'Date_min':["2021-01-01","2021-01-20","2021-01-28","2021-01-01","2021-01-02"],'Date_max':["2021-01-23","2021-12-01","2021-09-01","2021-01-15","2021-01-09"]})
df["Date_min"] = df["Date_min"].astype('datetime64')
df["Date_max"] = df["Date_max"].astype('datetime64')
ID Date_min Date_max
0 1 2021-01-01 2021-01-23
1 1 2021-01-20 2021-12-01
2 2 2021-01-28 2021-09-01
3 2 2021-01-01 2021-01-15
4 3 2021-01-02 2021-01-09
ID如果有重疊的日期范圍,我想檢查每個。我可以使用如下的回圈解決方案,但它效率不高,因此對于真正的大資料幀來說速度很慢:
L_output = []
for index, row in df.iterrows() :
if len(df[(df["ID"]==row["ID"]) & (df["Date_min"]<= row["Date_min"]) &
(df["Date_max"]>= row["Date_min"])].index)>1:
print("overlapping date ranges for ID %d" %row["ID"])
L_output.append(row["ID"])
Output :
overlapping date ranges for ID 1
您是否知道一種更好的方法來檢查 ID 1 是否具有重疊的日期范圍?
預期輸出:
[1]
uj5u.com熱心網友回復:
您可以將日期時間物件轉換為時間戳。然后,pd.Interval在每個 ID 的所有可能間隔組合的生成器上構造物件和迭代器:
from itertools import combinations
import pandas as pd
def group_has_overlap(group):
timestamps = group[["Date_min", "Date_max"]].values.tolist()
for t1, t2 in combinations(timestamps, 2):
i1 = pd.Interval(t1[0], t1[1])
i2 = pd.Interval(t2[0], t2[1])
if i1.overlaps(i2):
return True
return False
for ID, group in df.groupby("ID"):
print(ID, group_has_overlap(group))
輸出是:
1 True
2 False
3 False
uj5u.com熱心網友回復:
嘗試:
- 創建一列“日期”,其中包含每行從“Date_min”到“Date_max”的日期串列
explode“日期”列- 獲取重復的行
df["Dates"] = df.apply(lambda row: pd.date_range(row["Date_min"], row["Date_max"]), axis=1)
df = df.explode("Dates").drop(["Date_min", "Date_max"], axis=1)
#if you want all the ID and Dates that are duplicated/overlap
>>> df[df.duplicated()]
ID Dates
1 1 2021-01-20
1 1 2021-01-21
1 1 2021-01-22
1 1 2021-01-23
#if you just want a count of overlapping dates per ID
>>> df.groupby("ID").agg(lambda x: x.duplicated().sum())
Dates
ID
1 4
2 0
3 0
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/363101.html
