我嘗試了以下代碼來查找不在另一個資料框范圍內的資料框的范圍。但是,計算大檔案需要超過一天的時間,因為在最后 2 個 for 回圈中,它會比較每一行。我的 24 個資料幀中的每一個都有大約 10^8 行。以下方法是否有任何有效的替代方法?
請參閱此執行緒以更好地了解我的 I/O:Return the range of a dataframe not within a range of another dataframe
我的方法:我最初從and
創建了元組對,以便應用該函式。之后,我設定了 df1_ranges 是否在 df2_ranges 范圍內的條件。這里的邊緣情況是。我從迭代中收集了過濾后的索引,然后傳遞到 df1.(df1['first.start'], df1['first.end'])(df2['first.start'], df2['first.end'])range()df1['first.start'] = df1['first.end']
df2_lst=[]
for i,j in zip(temp_df2['first.start'], temp_df2['first.end']):
df2_lst.append(i)
df2_lst.append(j)
df1_lst=[]
for i,j in zip(df1['first.start'], df1['first.end']):
df1_lst.append(i)
df1_lst.append(j)
def range_subset(range1, range2):
"""Whether range1 is a subset of range2."""
if not range1:
return True # empty range is a subset of anything
if not range2:
return False # non-empty range can't be a subset of empty range
if len(range1) > 1 and range1.step % range2.step:
return False # must have a single value or integer multiple step
return range1.start in range2 and range1[-1] in range2
##### FUNCTION FOR CREATING CHUNKS OF LISTS ####
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i],lst[i 1]
df1_lst2 = list(chunks(df1_lst,2))
df2_lst2 = list(chunks(df2_lst,2))
indices=[]
for idx,i in enumerate(df1_lst2): #main list
x,y = i
for j in df2_lst2: #filter list
m,n = j
if((x!=y) & (range_subset(range(x,y), range(m,n)))): #checking if the main list exists in the filter range or not
indices.append(idx) #collecting the filtered indices
df1.iloc[indices]
uj5u.com熱心網友回復:
如果n和m是 和 中的行數df1,則df2任何演算法都需要至少n * m進行比較以檢查每個范圍與 中的df1每個范圍df2,您發布的代碼的問題是(a)它也可能有中間步驟和(b)它使用慢速 Python 回圈。如果你切換到 numpy 廣播,它在后臺使用高度優化的 C 回圈,它會快很多。
numpy 廣播的缺點是記憶體:它會創建一個n * m位元組比較矩陣,并且您的問題的大小可能會使您的計算機記憶體不足。我們可以通過分塊df1來降低記憶體使用量來緩解這種情況。
# Sample data
def random_dataframe(size):
a = np.random.randint(1, 100, 2*size).cumsum()
return pd.DataFrame({
'first.start': a[::2],
'first.end': a[1::2]
})
n, m = 10_000_000, 1000
np.random.seed(42)
df1 = random_dataframe(n)
df2 = random_dataframe(m)
# ---------------------------
# Prepare the Start and End time of df2 for comparison
# [:, None] raise the array by one dimension, which is necessary
# for array broadcasting
s2 = df2['first.start'].to_numpy()[:, None]
e2 = df2['first.end'].to_numpy()[:, None]
# A chunk_size that is too small or too big will lower performance.
# Experiment to find a sweet spot
chunk_size = 100_000
offset = 0
mask = []
while offset < len(df1):
s1 = df1['first.start'].to_numpy()[offset:offset chunk_size]
e1 = df1['first.end'].to_numpy()[offset:offset chunk_size]
mask.append(
((s2 <= s1) & (s1 <= e2) & (s2 <= e1) & (e1 <= e2)).any(axis=0)
)
offset = chunk_size
mask = np.hstack(mask)
上面的代碼在我的電腦上花了 30 秒。結果:
df1[mask] # ranges in df1 that are completely surrounded by a range in df2
df1[~mask] # ranges in df1 that are NOT completely surrounded by any range in df2
通過調整比較,您也可以檢查重疊范圍。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/432185.html
標籤:python-3.x 熊猫 数据框 for循环 优化
