為什么使用多執行緒會導致執行速度變慢？-有解無憂

我使用的是 MacBook Air M1 2020，Apple M1 7 核 GPU，RAM 8GB。

問題：我正在比較成對的陣列，順序執行時大約需要 11 分鐘。奇怪的是，我投入的執行緒越多，完成所需的時間就越多（即使不使用互斥鎖）。到目前為止，我已經嘗試用 2 個和 4 個執行緒運行它。

可能是什么問題呢？我認為使用 4 個執行緒會更有效，因為我有 7 個可用內核，并且執行時間（對我而言）似乎足夠長，可以補償處理多個執行緒造成的開銷。

這是我發現與此問題相關的代碼的一部分：

int const xylen = 1024;
static uint8_t voxelGroups[321536][xylen];
int threadCount = 4;

bool areVoxelGroupsIdentical(uint8_t firstArray[xylen], uint8_t secondArray[xylen]){
    return memcmp(firstArray, secondArray, xylen*sizeof(uint8_t)) == 0;
}

void* getIdenticalVoxelGroupsCount(void* threadNumber){

    for(int i = (int)threadNumber-1; i < 321536-1; i  = threadCount){
        for(int j = i 1; j < 321536; j  ){
            if(areVoxelGroupsIdentical(voxelGroups[i], voxelGroups[j])){
                pthread_mutex_lock(&mutex);
                identicalVoxelGroupsCount  ;
                pthread_mutex_unlock(&mutex);
             }
        }
    }
    return 0;
}

int main(){
    // some code
    pthread_create(&thread1, NULL, getIdenticalVoxelGroupsCount, (void *)1);
    pthread_create(&thread2, NULL, getIdenticalVoxelGroupsCount, (void *)2);
    pthread_create(&thread3, NULL, getIdenticalVoxelGroupsCount, (void *)3);
    pthread_create(&thread4, NULL, getIdenticalVoxelGroupsCount, (void *)4);

    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);
    pthread_join(thread3, NULL);
    pthread_join(thread4, NULL);
    // some code
}

uj5u.com熱心網友回復：

首先，鎖序列化所有的identicalVoxelGroupsCount增量。使用更多執行緒不會加速這部分。相反，它會更慢，因為快取行反彈：包含鎖和增量變數的快取行將從一個核心連續移動到另一個核心（參見：快取一致性協議）。這通常比順序執行所有作業慢得多，因為將快取線從一個核心移動到另一個核心會引入相當大的延遲。你不需要鎖。您可以改為增加區域變數，然后僅執行一次最終減少（例如，通過在末尾更新原子變數getIdenticalVoxelGroupsCount）。

此外，回圈迭代的交錯效率不高，因為包含的大部分快取voxelGroups行將在執行緒之間共享。這不像第一點那么重要，因為執行緒只讀取快取行。盡管如此，這仍會增加記憶體吞吐量并導致瓶頸。一種更有效的方法是將迭代拆分為相對較大的連續塊。將塊拆分為中等粒度的切片以更有效地使用快取可能會更好（盡管這種優化與并行化策略無關）。

請注意，您可以使用OpenMP在 C 中輕松高效地執行此類操作。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/341533.html

標籤：C 多线程

上一篇：MISRA要求為“查找表”功能的功能提供單點退出

下一篇：不確定如何處理來自pigpio的CMake檔案