多處理矩陣的不同行-有解無憂

我有一個非常大的矩陣（超過 100k x 100K），其中有一個計算邏輯，其中每一行都可以與其他行不同地計算

我想使用多處理來優化計算時間（將矩陣分成 3 個切片，每個切片 1/3 行）。然而，似乎多處理比計算所有行的單次呼叫花費的時間更長。我在每個程序中更改矩陣的不同部分 - 這是問題嗎？

import multiprocessing, os
import time, pandas as pd, numpy as np

def mat_proc(df):
    print("ID of process running worker1: {}".format(os.getpid()))
    return(df 3)  # simplified version of process  
    print('done processing')
          
count=5000

df = pd.DataFrame(np.random.randint(0,10,size=(3*count,3*count)),dtype='int8')
slice1=df.iloc[0:count,]
slice2=df.iloc[count:2*count,]
slice3=df.iloc[2*count:3*count,]

p1=multiprocessing.Process(target=mat_proc,args=(slice1,))
p2=multiprocessing.Process(target=mat_proc,args=(slice2,))
p3=multiprocessing.Process(target=mat_proc,args=(slice3,))

start=time.time()
print('started now')
# this is to compare the multiprocess with a single call to full matrix
#mat_proc(df)

if __name__ == '__main__':   
    p1.start()
    p2.start()
    p3.start()
    p1.join()
    p2.join()
    p3.join()
    
finish=time.time()
print(f'total time taken {round(finish-start,2)}')

uj5u.com熱心網友回復：

使用多處理時，將所有腳本部分移動到if __name__ == '__main__'部分。因為當每個行程產生時，它會運行您的主腳本。因此，每個行程都必須重新創建資料幀、切片等。

import multiprocessing, os
import time, pandas as pd, numpy as np


def mat_proc(df):
    print("ID of process running worker1: {}".format(os.getpid()))
    return (df   3)  # simplified version of process
    print('done processing')


if __name__ == '__main__':
    count = 5000

    df = pd.DataFrame(np.random.randint(0, 10, size=(3 * count, 3 * count)), dtype='int8')
    slice1 = df.iloc[0:count, ]
    slice2 = df.iloc[count:2 * count, ]
    slice3 = df.iloc[2 * count:3 * count, ]

    p1 = multiprocessing.Process(target=mat_proc, args=(slice1,))
    p2 = multiprocessing.Process(target=mat_proc, args=(slice2,))
    p3 = multiprocessing.Process(target=mat_proc, args=(slice3,))

    start = time.time()
    print('started now')
    # this is to compare the multiprocess with a single call to full matrix
    # mat_proc(df)

    p1.start()
    p2.start()
    p3.start()
    p1.join()
    p2.join()
    p3.join()

    finish = time.time()
    print(f'total time taken {round(finish - start, 2)}')

并考慮使用multiprocessing.Pool，通過更改單個數字來選擇要生成的行程數量會很方便。

第二件事，如果計算很容易（如您提供的流程的簡化版本），則向其發送資料（酸洗和取消酸洗資料幀）將比這些計算花費更長的時間，并且多處理會更慢。

uj5u.com熱心網友回復：

生成程序是一項代價高昂的操作。如果您不在新流程中執行任務，這使得流程產生時間看起來可以忽略不計，那么您最好堅持一個流程。

另一種選擇是使用多執行緒，其成本低于多處理。您必須根據資料規模和總處理時間決定使用哪一種。

本文很好地解釋了差異和成本。一探究竟！

此外，使用 multiprocessing.pool.Pool 和 multiprocessing.pool.ThreadPool 會更干凈。檢查下面的示例和官方檔案以了解它們的用法。

from multithreading.pool import Pool, ThreadPool


def run_parallel(kls):
    with kls() as pool:
        return pool.map(mat_proc, [df.iloc[0:count,], df.iloc[count: 2 * count, ], df.iloc[2 * count: 3 * count, ]])


run_parallel(Pool)        # Run with multiprocessing
run_parallel(ThreadPool)  # Run with multithreading

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/350780.html

標籤：Python 熊猫数据框多处理

上一篇：創建多索引列資料框

下一篇：AttributeError:'NoneType'物件沒有屬性'longitude'