插值和將輸出附加到串列在for回圈中永遠占用-有解無憂

我附上名為“df2.xlsx”這里測驗Excel檔案：https://docs.google.com/spreadsheets/d/1U55lXyZSYguiQUH0AOB_v8yhKcbMGNQs/edit?usp=sharing&ouid=102781316443126205856&rtpof=true&sd=true有58673行，我已經進口它作為資料幀并使用以下附加代碼通過interp1d線性插值計算“D50” 。D50 是 50 個百分位值，這就是我需要插值的原因。我用來插值的列是 con13c,con12c,con2c,con3c,......,con11c,con14c。con13c 和 con14c 的索引是 17 和 29。我使用 append() 將輸出存盤到一個空串列中。但是，代碼的性能很慢。

主excel檔案/文本檔案將有4928526行而不是附加的excel檔案58673行，完成主excel檔案的D50計算需要20多分鐘。讓我知道是否有辦法通過逐塊讀取 df 塊并運行到多處理器中來使其更快。在主 excel 檔案中將有 100 個不同的 TS 值，每個 TS 值將有 58673 行。因此，在測驗 excel 檔案“df2.xlsx”中，所有資料僅適用于一個特定的 TS。謝謝。

import pandas as pd
import numpy as np
from scipy.interpolate import interp1d


dt=pd.read_excel('df2.xlsx', index_col=0) 

# check column index
dt.columns.get_loc("con14c")
x=[0.00001, 0.00004675, 0.000088,   0.000177,   0.000354,   0.000707,   0.001414,   0.002828,   
                   0.005657,    0.011314,   0.022627,   0.045254,   0.6096]
x=np.array(x)
xx=np.log(x)
dfs =[]
for i in range(0, len(dt)): # loop through the rows of dt
    y1=dt.iloc[i,17:30]

    y1=np.array(y1,dtype=np.float)
    f = interp1d( y1,xx, kind='linear', bounds_error=False, fill_value=np.log(y1[0])) #fill_value='extrapolate'
    x_new=np.exp(f(.5))
    print(np.exp(x_new))
    dfs.append(x_new)
dt['D50']=dfs

uj5u.com熱心網友回復：

我在我的 PC 上運行了一個測驗，通過做一個簡單的更改，它減少了原始運行時間的 30%。之前需要 9 秒，現在只需要大約 2 秒。

洗掉列印
不要索引 pd.DataFrame 因為它非常慢。首先將其轉換為 numpy 陣列并對其進行索引：

# outside the for loop
dt_arr = dt.values

# ... other codes

y1 = dt_arr[i,17:30]

由于您只對 0.5 感興趣：

dfs1 = []
dfs2 = []
for i in range(0, len(dt)): # loop through the rows of dt
    y1=dt_arr[i, 17:30]

    y1=np.array(y1,dtype=np.float)
    f = interp1d( y1,xx, kind='linear', bounds_error=False, fill_value=np.log(y1[0])) #fill_value='extrapolate'
    x_new=np.exp(f(.5))
    dfs1.append(x_new)
    
    # I don't know if your data is sorted, if so you can ignore this part
    sort_idx = np.argsort(y1)
    xx_sorted = xx[sort_idx]
    y1_sorted = y1[sort_idx]
    # I think your fill value is a bit weird as you are using same values for both ends. You might want to check that
    if y1_sorted[-1] < 0.5 or y1_sorted[0] > 0.5:
        dfs2.append(y1[0])
    else:
        idx = np.argmax(y1_sorted > 0.5)
        x0 = xx_sorted[idx-1]
        x1 = xx_sorted[idx]
        z0 = y1_sorted[idx-1]
        z1 = y1_sorted[idx]
        dfs2.append(np.exp(x0   (0.5-z0)*(x1-x0)/(z1-z0)))

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/377308.html

標籤：Python 表现并行处理附加线性插值

上一篇：std::accumulatevsfor回圈，光線追蹤應用

下一篇：我怎樣才能簡化這個？