熊貓：日期時間索引和間隔索引交集-有解無憂

我有很長的時間序列，我需要將某些事件間隔內的值設定為np.nan. measures是一個datetimeindexed 資料框，并且events是一個明顯的datetimeindex不同。

措施如下：

| index               | measure  |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | 0.107815 |
| 1970-01-01 00:06:24 | 0.046558 |
| 1970-01-01 00:06:25 | 0.056558 |
| 1970-01-01 00:08:12 | 0.837405 |

例如，如果時間戳只有一個事件1970-01-01 00:06:21并且洗掉值的間隔為 /- 5 秒，則輸出將為：

| index               | measure  |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | np.nan   |
| 1970-01-01 00:06:24 | np.nan   |
| 1970-01-01 00:06:25 | np.nan   |
| 1970-01-01 00:08:12 | 0.837405 |

目前我正在使用.loc以下方法對事件進行互動：

for i in range(events.shape[0]):
    measures.loc[events[i] - pd.Timedelta("4min"):\
                 events[i]   pd.Timedelta("1min") \
        ] = np.nan

現在這可行，但花費的時間太長，兩個資料幀都很大（事件：10k 行，測量 1.5m 行）。因此我不能像這樣構造一個布爾索引：

measure_index = measures.index.to_numpy()
left_bounds = (events - pd.Timedelta("4min")).to_numpy()
right_bounds = (events   pd.Timedelta("1min")).to_numpy()
# The following product wouldn't fit in memory even with boolean dtype.
left_bool_array = measure_index >= left_bounds.reshape((-1,1)) 
right_bool_array = measure_index <= right_bounds.reshape((-1,1))
mask = np.sum( left_bool_array & right_bool_array.T ,axis= 0)

左加入有關措施的事件或重新索引事件也是不可能的，因為它們花費的時間太長。

然后我遇到了 pd.intervalindex：

left_bound = events - pd.Timedelta("4min")
right_bound = events   pd.Timedelta("1min")
interval_index=pd.IntervalIndex.from_arrays(left_bound,right_bound)

Intervalindex index has .contains() method which takes a scalar and returns "a boolean mask whether the value is contained in the Intervals". However for my use case I'd need to loop trough the measures frame and sum the boolean array for each row. I'm looking for a method like so:

pandas.IntervalIndex.intersect(input: array_like) -> boolean_array (same shape as input)

With each element in the output representing whether the corresponding input value is in any of the intervals.

Similar but different questions:

Interval lookup with interval index: Fastest way to merge pandas dataframe on ranges
Quite similar but the suggested solutions (merges) are not applicable Match IntervalIndex as part of a MultiIndex
If only I had the same indexes and a single interval per row to lookup Best way to join / merge by range in pandas

uj5u.com熱心網友回復：

如果您的measures資料已經排序（或者如果排序一次不太耗時） - 您可以考慮使用bisect.

這是~~一個近似的~~更完整的解決方案：

檢查events可以“插入”的每個元素measures
檢查此“插入點”兩側的時間戳是否在 5 秒內
如果是，設定為 nan

def bisect_loop():
    for event in events:
        bisect_point = bisect.bisect(measures.index, event)
        keep_looking_lower = True
        while keep_looking_lower:
            lower_side_index = max(0, bisect_point - 1)
            lower_side_diff = event - measures.index[lower_side_index]
            if lower_side_diff.seconds < 5:
                measures.loc[measures.index[lower_side_index]] = np.nan
                bisect_point = max(0, bisect_point - 1)
            elif lower_side_diff.seconds >=5 or bisect_point == 0:
                keep_looking_lower = False
        keep_looking_higher = True
        while keep_looking_higher:
            higher_side_index = min(len(measures.index), bisect_point)
            higher_side_diff = event - measures.index[higher_side_index]
            if higher_side_diff.seconds < 5:
                measures.loc[measures.index[higher_side_index]] = np.nan
                bisect_point = min(len(measures.index), bisect_point   1)
            elif higher_side_diff.seconds >=5 or bisect_point == len(measures.index):
                keep_looking_higher = False

以下是包含 150 個度量和 10 個事件的虛擬資料集的一些統計資料 -

df = pd.DataFrame({'year': [2000]*150, 'month': [2]*150, 'day': [12]*150, 'hour': np.random.choice(range(1), 150), 'minute': np.random.choice(range(60), 150), 'second': np.random.choice(range(60), 150)})

timestamps = pd.to_datetime(df)
measures = pd.concat([timestamps, pd.Series(np.random.rand(150))], axis=1)
measures = measures.set_index(0)
measures = measures.sort_index()

df = pd.DataFrame({'year': [2000]*150, 'month': [2]*150, 'day': [12]*150, 'hour': np.random.choice(range(24), 150), 'minute': np.random.choice(range(60), 150), 'second': np.random.choice(range(60), 150)})
events = pd.to_datetime(df).sample(10).reset_index(drop=True)

%timeit op_loop() # This is your loc based approach that is working
8.74 ms ?± 126 ?μs per loop (mean ?± std. dev. of 7 runs, 100 loops each)


%timeit bisect_loop()
3.22 ms ?± 45.8 ?μs per loop (mean ?± std. dev. of 7 runs, 100 loops each)

uj5u.com熱心網友回復：

可以用staircase解決，這是一個基于 pandas 和 numpy 構建的包，用于處理（數學）步驟函式。

設定：

measures = pd.Series(
    [0.471331, 0.069177, 0.955357, 0.107815, 0.046558, 0.056558, 0.837405],
    index = pd.DatetimeIndex([
        pd.Timestamp("1970-1-1 00:00:15"),
        pd.Timestamp("1970-1-1 00:02:37"),
        pd.Timestamp("1970-1-1 00:03:59"),
        pd.Timestamp("1970-1-1 00:06:17"),
        pd.Timestamp("1970-1-1 00:06:24"),
        pd.Timestamp("1970-1-1 00:06:25"),
        pd.Timestamp("1970-1-1 00:08:12"),
    ])
)

events = pd.DatetimeIndex(['1970-01-01 00:06:21'])

解決方案：

import pandas as pd
import staircase as sc

sf = sc.Stairs(start=measures.index, end = measures.index[1:], value=measures.values)
mask = sc.Stairs(start=events-pd.Timedelta('5 seconds'), end=events pd.Timedelta('5 seconds'))
masked = sf.mask(mask)
result = masked.sample(measures.index, include_index=True)

為什么有效

第一行：創建一個由區間組成的階躍函式，區間的端點是的索引measures。最后一個間隔，從 1970-01-01 00:08:12 開始，沒有終點，將是無限長的

第二行：創建一個步進函式，其中events變數中的時間是步進函式中間隔的中心，端點距離中心 /- 5 秒。如果任何間隔重疊，都沒有問題。

第三行：用第二步函式屏蔽第一步函式，只要第二步函式不為零，就將第一步函式中的值設定為 NaN

第 4 行：masked在您的事件時間評估階躍函式

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/323503.html

標籤：python pandas performance numpy datetimeindex

上一篇：如何提高嵌套回圈中的性能，其中2個表在abap中有大量條目？

下一篇：在Django/React專案設定中，Favicon和manifest.json沒有加載。