很長一段時間以來的第一個問題是最近在作業中恢復了 Python。我一直在用熊貓清理/準備一些資料,我發現當將函式應用于總資料(?30000000行)的較小樣本(500000行)時,運行一個需要很長時間我的代碼的特定部分(約 8 分鐘)。我的想法是,我寫了一些有用的東西,但對于我正在嘗試做的事情并不是非常理想,并且當應用于整個資料集時,它將成為一個非常漫長的程序。我不完全確定,但我認為運行這種東西是像 alteryx 這樣的程式會快得多,所以我想我一定做錯了什么。任何幫助或想法,以使其更快,非常感謝!
資料框示例:
po_data = pd.DataFrame({'Order Quantity Received Type':['Order Cancelled - None Received','Order Partially Fulfilled'],Order Quantity Change Type':['Order Cancelled','Increased','c'],'Received Quantity':[0,3],Current Order Quantity:[0,5]})
功能:
def order_quantity_received(df,output_col,cancelled,received_quant,ordered_quant):
if (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
df[output_col] = "Order Cancelled - None Received"
elif (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
df[output_col] = "Order Cancelled - Items Received"
elif df[received_quant] > df[ordered_quant]:
df[output_col] = "Order Over Fufilled"
elif (df[received_quant] < df[ordered_quant]) & (df[received_quant] > 0):
df[output_col] = "Order Partially Fufilled"
elif df[received_quant] == df[ordered_quant]:
df[output_col] = "Order Fully Fufilled"
elif (df[received_quant] == 0) & (df[ordered_quant] > 0):
df[output_col] = "Order Not Fufilled"
else:
df[output_col] = "Error"
return df
函式呼叫:
po_data = po_data.apply(lambda po_data: order_quantity_received(po_data,'Order Quantity Received Type','Order Quantity Change Type','Received Quantity','Current Order Quantity'),axis=1)
uj5u.com熱心網友回復:
使用 Pandas 和 Numpy 的最快方法是將函式矢量化。使用 for 回圈、串列推導或 apply() 沿陣列或系列逐個元素地運行函式是一種不好的做法。
我只想舉一個“取消訂單”的例子:
def order_cancelled(a, b):
## define your function logic however you want
return a - b
然后矢量化你的函式:
df['output_col'] = np.vectorize(order_cancelled)(df['cancelled'], df['received_quant'])
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/420763.html
標籤:
上一篇:如何讓if陳述句識別一串單詞
