重復重新采樣相同形狀的時間序列資料的最快方法是什么?
問題:我有 30 年的每小時時間序列,我想重新采樣到每年和日歷年(重新采樣規則“AS”)。我需要同時找到每年的平均值和總和。沒有遺漏的時間。然后我需要這樣做超過 10,000 次。對于我正在撰寫的腳本,此重采樣步驟花費的時間最多,并且是優化運行時間的限制因素。由于閏年的原因,我們無法通過一致的 8760 小時重新采樣,因為每四年有 8784 小時。
示例代碼:
import pandas as pd
import numpy as np
import time
hourly_timeseries = pd.DataFrame(
index=pd.date_range(
pd.Timestamp(2020, 1, 1, 0, 0),
pd.Timestamp(2050, 12, 31, 23, 30),
freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000 in practice
yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {start_time - finish_time:0.4f} seconds")
>>> Ran in -3.0516 seconds
我探索過的解決方案:
- 我通過將多個時間序列聚合到一個資料幀中并同時對它們進行重新采樣來提高了一些速度;然而,由于我正在解決的更廣泛問題的設定的限制,我被限制在每個資料幀中有 10 個時間序列。因此,問題仍然存在:如果您知道陣列的形狀將始終相同,是否有一種方法可以顯著加快時間序列資料的重采樣速度?
- 我也研究過使用 numba 但這并沒有使 Pandas 的功能更快。
聽起來合理但我在研究后找不到的可能解決方案:
- 使用 numpy 重新采樣時間序列資料的 3D 陣列
- 快取正在重新采樣的索引,然后以某種方式在第一次重新采樣后以某種方式更快地進行每次重新采樣
謝謝你的幫助 :)
uj5u.com熱心網友回復:
正如我在評論中所寫的那樣,我為每年準備了指數,并使用它們來更快地計算每年的總和。
接下來,我再次洗掉了對平均值下不必要的總和計算,而是計算sum/length_of_indices每年的平均值。
對于 N=1000,它的速度提高了大約 9 倍
import pandas as pd
import numpy as np
import time
hourly_timeseries = pd.DataFrame(
index=pd.date_range(
pd.Timestamp(2020, 1, 1, 0, 0),
pd.Timestamp(2050, 12, 31, 23, 30),
freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000 in practice
yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")
start_time = time.perf_counter()
events_years = hourly_timeseries.index.year
unique_years = np.sort(np.unique(events_years))
indices_per_year = [np.where(events_years == year)[0] for year in unique_years]
len_indices_per_year = np.array([len(year_indices) for year_indices in indices_per_year])
for num in range(100): # setting as 100 so it runs faster, this is 10,000 in practice
temp = hourly_timeseries.values
yearly_timeseries_sum2 = np.array([np.sum(temp[year_indices]) for year_indices in indices_per_year])
yearly_timeseries_mean2 = yearly_timeseries_sum2 / len_indices_per_year
finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")
assert np.allclose(yearly_timeseries_sum.values.flatten(), yearly_timeseries_sum2)
assert np.allclose(yearly_timeseries_mean.values.flatten(), yearly_timeseries_mean2)
Ran in 0.9950 seconds
Ran in 0.1386 seconds
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/363485.html
下一篇:加速本征c 轉置?
