為什么Parallel.For不能快速處理堆密集型操作？-有解無憂

對于某些操作來說Parallel，CPU 的數量可以很好地擴展，但對于其他操作則不然。

考慮下面的代碼，function1得到了 10 倍的改進，同時function2得到了 3 倍的改進。這是由于記憶體分配，還是 GC？

void function1(int v) {
    for (int i = 0; i < 100000000; i  ) {
        var q = Math.Sqrt(v);
    }
}
void function2(int v) {
    Dictionary<int, int> dict = new Dictionary<int, int>();
    for (int i = 0; i < 10000000; i  ) {
        dict.Add(i, v);
    }
}
var sw = new System.Diagnostics.Stopwatch();

var iterations = 100;

sw.Restart();
for (int v = 0; v < iterations; v  ) function1(v);
sw.Stop();
Console.WriteLine("function1 no parallel: "   sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

sw.Restart();
Parallel.For(0, iterations, function1);
sw.Stop();
Console.WriteLine("function1 with parallel: "   sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

sw.Restart();
for (int v = 0; v < iterations; v  ) function2(v);
sw.Stop();
Console.WriteLine("function2 no parallel: "   sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

sw.Restart();
Parallel.For(0, iterations, function2);
sw.Stop();
Console.WriteLine("function2 parallel: "   sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

我機器上的輸出：

function1   no parallel:  2 059,4 ms
function1 with parallel:    213,7 ms
function2   no parallel: 14 192,8 ms
function2      parallel:  4 491,1 ms

環境：
Win 11，.Net 6.0，Release build
i9 第 12 代，16 核，24 proc，32 GB DDR5

在測驗更多之后，似乎記憶體分配不能很好地與多個執行緒一起擴展。例如，如果我將函式 2 更改為：

void function2(int v) {
    Dictionary<int, int> dict = new Dictionary<int, int>(10000000);
}

結果是：

function2   no parallell:   124,0 ms
function2      parallell:   402,4 ms

記憶體分配不能很好地擴展多執行緒的結論嗎？...

uj5u.com熱心網友回復：

第一個 func 在暫存器中作業。更多內核 = 更多暫存器。

第二個函式適用于記憶體。更多內核 = 只有更多 L1 快取但共享 RAM。1000 萬個元素資料集當然只來自 RAM，因為即使 L3 也不夠大。這假設語言的 jit 將分配優化為重用的緩沖區。如果沒有，那么也有分配開銷。因此，您應該在每次新迭代時重新使用字典，而不是重新創建。

您還使用增量整數索引保存資料。簡單的陣列可以在這里作業，當然可以在迭代之間重復使用。它應該比字典占用更少的記憶體。

uj5u.com熱心網友回復：

tl; dr：堆分配爭用。

您的第一個功能是為什么 Parallel.For 不能快速處理堆密集型操作？ (Red: 25%, Orange: 56%, Green: 75%, Blue: 100%)

With task parallelism we achieved over 20x performance using 100% of CPU threads. (in this example, not always like that)

In read-only data paralelism with some computation we achieve near 6,5x faster of CPU usage 56% (with fewer computations the difference would be shorter)

But trying to implement a "real parallism" of data for writing our performance is more than twice slower and CPU can't use full potential using only 25% usage due sycronization contexts

Conclusions: Using Parallel.For does not guarantee that your code will run really in parallel neither faster. It requires a previous code/data preparation and deep analysis, benchmarks and tunings

Check also this Microsoft Documentation talking about villains in Parallel Code

https://docs.microsoft.com/pt-br/dotnet/standard/parallel-programming/potential-pitfalls-in-data-and-task-parallelism

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/442897.html

標籤：C＃多线程并行处理并行.for

上一篇：為什么ScheduledExecutorService沒有按預期作業？

下一篇：在Python中完成時終止執行緒