測量OpenMPFork/Join延遲-有解無憂

由于 MPI-3 帶有共享記憶體并行功能，而且它似乎與我的應用程式完美匹配，我正在認真考慮將我的混合 OpemMP-MPI 代碼重寫為純 MPI 實作。

為了將最后一顆釘子釘入棺材，我決定運行一個小程式來測驗 OpenMP fork/join 機制的延遲。這是代碼（為英特爾編譯器撰寫）：

void action1(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size();   index)
    {
        t1.data()[index] = std::sin(t2.data()[index]) * std::cos(t2.data()[index]);
    }
}

void action2(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size();   index)
    {
        t1.data()[index] = t2.data()[index] * std::sin(t2.data()[index]);
    }
}

void action3(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size();   index)
    {
        t1.data()[index] = t2.data()[index] * t2.data()[index];
    }
}

void action4(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size();   index)
    {
        t1.data()[index] = std::sqrt(t2.data()[index]);
    }
}

void action5(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size();   index)
    {
        t1.data()[index] = t2.data()[index] * 2.0;
    }
}

void all_actions(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size();   index)
    {
        t1.data()[index] = std::sin(t2.data()[index]) * std::cos(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * std::sin(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * t2.data()[index];
        t1.data()[index] = std::sqrt(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * 2.0;
    }
}


int main()
{
    // decide the process parameters
    const auto n = std::size_t{8000000};
    const auto test_count = std::size_t{500};
    
    // garbage data...
    auto t1 = std::vector<double>(n);
    auto t2 = std::vector<double>(n);
    
    // /////////////////
    // perform actions one after the other
    // /////////////////
    
    const auto sp = timer::spot_timer();
    const auto dur1 = sp.duration_in_us();
    for (auto index = std::size_t{}; index < test_count;   index)
    {
        #pragma noinline
        action1(t1, t2);
        #pragma noinline
        action2(t1, t2);
        #pragma noinline
        action3(t1, t2);
        #pragma noinline
        action4(t1, t2);
        #pragma noinline
        action5(t1, t2);
    }
    const auto dur2 = sp.duration_in_us();
    
    // /////////////////
    // perform all actions at once
    // /////////////////
    const auto dur3 = sp.duration_in_us();
    for (auto index = std::size_t{}; index < test_count;   index)
    {
        #pragma noinline
        all_actions(t1, t2);
    }
    const auto dur4 = sp.duration_in_us();
    
    const auto a = dur2 - dur1;
    const auto b = dur4 - dur3;
    if (a < b)
    {
        throw std::logic_error("negative_latency_error");
    }
    const auto fork_join_latency = (a - b) / (test_count * 4);
    
    // report
    std::cout << "Ran the program with " << omp_get_max_threads() << ", the calculated fork/join latency is: " << fork_join_latency << " us" << std::endl;
    
    return 0;
}

如您所見，這個想法是分別執行一組動作（每個動作都在一個 OpenMP 回圈中）并計算其平均持續時間，然后一起執行所有這些動作（在同一個 OpenMP 回圈中）并計算平均持續時間。然后我們就有了兩個變數的線性方程組，其中一個是fork/join機制的延遲，可以求解得到值。

Questions:

Am I overlooking something?
Currently, I am using "-O0" to prevent smarty-pants compiler from doing its funny business. Which compiler optimizations should I use, would these also have an effect on the latency itself etc etc?
On my Coffee Lake processor with 6 cores, I measured a latency of ~850 us. Does this sound about right?

Edit 3

) I've included a warm-up calculation in the beginning upon @paleonix's suggestion,
) I've reduced the number of actions for simplicity, and,
) I've switched to 'omp_get_wtime' to make it universally understandable.

I am now running the following code with flag -O3:

void action1(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size();   index)
    {
        t1.data()[index] = std::sin(t1.data()[index]);
    }
}

void action2(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size();   index)
    {
        t1.data()[index] =  std::cos(t1.data()[index]);
    }
}

void action3(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size();   index)
    {
        t1.data()[index] = std::atan(t1.data()[index]);
    }
}

void all_actions(std::vector<double>& t1, std::vector<double>& t2, std::vector<double>& t3)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size();   index)
    {
        #pragma optimize("", off)
        t1.data()[index] = std::sin(t1.data()[index]);
        t2.data()[index] = std::cos(t2.data()[index]);
        t3.data()[index] = std::atan(t3.data()[index]);
        #pragma optimize("", on)
    }
}


int main()
{
    // decide the process parameters
    const auto n = std::size_t{1500000}; // 12 MB (way too big for any cache)
    const auto experiment_count = std::size_t{1000};
    
    // garbage data...
    auto t1 = std::vector<double>(n);
    auto t2 = std::vector<double>(n);
    auto t3 = std::vector<double>(n);
    auto t4 = std::vector<double>(n);
    auto t5 = std::vector<double>(n);
    auto t6 = std::vector<double>(n);
    auto t7 = std::vector<double>(n);
    auto t8 = std::vector<double>(n);
    auto t9 = std::vector<double>(n);
    
    // /////////////////
    // warum-up, initialization of threads etc.
    // /////////////////
    for (auto index = std::size_t{}; index < experiment_count / 10;   index)
    {
        all_actions(t1, t2, t3);
    }
    
    // /////////////////
    // perform actions (part A)
    // /////////////////
    
    const auto dur1 = omp_get_wtime();
    for (auto index = std::size_t{}; index < experiment_count;   index)
    {
        action1(t4);
        action2(t5);
        action3(t6);
    }
    const auto dur2 = omp_get_wtime();
    
    // /////////////////
    // perform all actions at once (part B)
    // /////////////////

    const auto dur3 = omp_get_wtime();
    #pragma nofusion
    for (auto index = std::size_t{}; index < experiment_count;   index)
    {
        all_actions(t7, t8, t9);
    }
    const auto dur4 = omp_get_wtime();
    
    const auto a = dur2 - dur1;
    const auto b = dur4 - dur3;
    const auto fork_join_latency = (a - b) / (experiment_count * 2);
    
    // report
    std::cout << "Ran the program with " << omp_get_max_threads() << ", the calculated fork/join latency is: "
        << fork_join_latency * 1E 6 << " us" << std::endl;
    
    return 0;
}

With this, the measured latency is now 115 us. What's puzzling me now is that this value changes when the actions are changed. According to my logic, since I'm doing the same action in both parts A and B, there should actually be no change. Why is this happening?

uj5u.com熱心網友回復：

TL;DR：由于動態頻率縮放，內核不會以完全相同的速度運行，并且有很多噪聲會影響執行，從而導致昂貴的同步。您的基準測驗主要測量這種同步開銷。使用獨特的并行部分應該可以解決這個問題。

基準相當有缺陷。此代碼實際上并未測量 OpenMP fork/join 部分的“延遲”。它衡量許多間接費用的組合，包括：

負載平衡和同步：拆分回圈比大合并回圈執行更頻繁的同步（5 倍以上）。同步是昂貴的，不是因為通信開銷，而是因為本質上不同步的內核之間的實際同步。實際上，執行緒之間的輕微作業不平衡會導致其他執行緒等待最慢執行緒的完成。由于靜態調度，您可能認為這不應該發生，但是背景關系切換和動態頻率縮放會導致某些執行緒比其他執行緒慢。如果執行緒未系結到內核或某些程式在計算期間由作業系統調度，則背景關系切換尤其重要。動態頻率縮放（例如 Intel turbo boost）導致一些（執行緒組）在作業負載、每個內核的溫度和整體封裝、活動內核的數量、估計的功耗等方面更快。內核數量越高，同步開銷越高。請注意，此開銷取決于回圈所花費的時間。欲了解更多資訊，請閱讀以下分析。

回圈拆分的性能：將 5 個回圈合并為一個唯一的回圈會影響生成的匯編代碼（因為需要更少的指令），也會影響快取中的加載/存盤（因為記憶體訪問模式有點不同）。更不用說它在理論上會影響矢量化，盡管 ICC 不會矢量化這個特定的代碼。話雖如此，這似乎并不是我機器上的主要實際問題，因為我無法通過按順序運行程式來重現 Clang 的問題，而我可以使用許多執行緒。

要解決此問題，您可以使用獨特的并行部分。omp for回圈必須使用該nowait子句，以免引入同步。或者，類似的基于任務的構造taskloop可以nogroup幫助實作相同的目標。在這兩種情況下，您都應該注意依賴關系，因為多個 for-loop/task-loos 可以并行運行。這在您當前的代碼中很好。

分析

分析由執行噪聲（背景關系切換、頻率縮放、快取效果、作業系統中斷等）引起的短同步的影響非常困難，因為在您的情況下，它可能永遠不會是同步期間最慢的同一執行緒（之間的作業）執行緒非常平衡，但它們的速度并不完全相等）。

話雖這么說，如果這個假設是真的fork_join_latency應該依賴于n。因此，增加n也增加fork_join_latency。這是我可以-fopenmp -O3在我的 6 核 i5-9600KF 處理器上使用 Clang 13 IOMP（使用）：

n=   80'000    fork_join_latency<0.000001
n=  800'000    fork_join_latency=0.000036
n= 8'000'000   fork_join_latency=0.000288
n=80'000'000   fork_join_latency=0.003236

請注意，fork_join_latency時間在實踐中不是很穩定，但行為非常明顯：測量的開銷是依賴的n。

更好的解決方案是通過測量每個執行緒的回圈時間并累積最小和最大時間之間的差異來測量同步時間。這是一個代碼示例：

double totalSyncTime = 0.0;

void action1(std::vector<double>& t1)
{
    constexpr int threadCount = 6;
    double timePerThread[threadCount] = {0};

    #pragma omp parallel
    {
        const double start = omp_get_wtime();
        #pragma omp for nowait schedule(static) //num_threads(std::thread::hardware_concurrency())
        #pragma nounroll
        for (auto index = std::size_t{}; index < t1.size();   index)
        {
            t1.data()[index] = std::sin(t1.data()[index]);
        }
        const double stop = omp_get_wtime();
        const double threadLoopTime = (stop - start);
        timePerThread[omp_get_thread_num()] = threadLoopTime;
    }

    const double mini = *std::min_element(timePerThread, timePerThread threadCount);
    const double maxi = *std::max_element(timePerThread, timePerThread threadCount);
    const double syncTime = maxi - mini;
    totalSyncTime  = syncTime;
}

You can then divide totalSyncTime the same way you did for fork_join_latency and print the result. I get 0.000284 with fork_join_latency=0.000398 (with n=8'000'000) which almost proves that a major part of the overhead is due to synchronizations and more especially due to a slightly different thread execution velocity. Note that this overhead does not include the implicit barrier at the end of the OpenMP parallel section.

uj5u.com熱心網友回復：

這是我測量 fork-join 開銷的嘗試：

#include <iostream>
#include <string>

#include <omp.h>

constexpr int n_warmup = 10'000;
constexpr int n_measurement = 100'000;
constexpr int n_spins = 1'000;

void spin() {
    volatile bool flag = false;
    for (int i = 0; i < n_spins;   i) {
        if (flag) {
            break;
        }
    }
}

void bench_fork_join(int num_threads) {
    omp_set_num_threads(num_threads);

    // create threads, warmup
    for (int i = 0; i < n_warmup;   i) {
        #pragma omp parallel
        spin();
    }

    double const start = omp_get_wtime();
    for (int i = 0; i < n_measurement;   i) {
        #pragma omp parallel
        spin();
    }
    double const stop = omp_get_wtime();
    double const ptime = (stop - start) * 1e6 / n_measurement;

    // warmup
    for (int i = 0; i < n_warmup;   i) {
        spin();
    }
    double const sstart = omp_get_wtime();
    for (int i = 0; i < n_measurement;   i) {
        spin();
    }
    double const sstop = omp_get_wtime();
    double const stime = (sstop - sstart) * 1e6 / n_measurement;

    std::cout << ptime << " us\t- " << stime << " us\t= " << ptime - stime << " us\n";
}

int main(int argc, char **argv) {
    auto const params = argc - 1;
    std::cout << "parallel\t- sequential\t= overhead\n";

    for (int j = 0; j < params;   j) {
        auto num_threads = std::stoi(argv[1   j]);
        std::cout << "---------------- num_threads = " << num_threads << " ----------------\n";
        bench_fork_join(num_threads);
    }

    return 0;
}

您可以使用多個不同數量的執行緒來呼叫它，這些執行緒數不應高于您機器上的內核數以給出合理的結果。在我有 6 個內核并使用 gcc 11.2 編譯的機器上，我得到了

$ g   -fopenmp -O3 -DNDEBUG -o bench-omp-fork-join bench-omp-fork-join.cpp
$ ./bench-omp-fork-join 6 4 2 1
parallel        - sequential    = overhead
---------------- num_threads = 6 ----------------
1.51439 us      - 0.273195 us   = 1.24119 us
---------------- num_threads = 4 ----------------
1.24683 us      - 0.276122 us   = 0.970708 us
---------------- num_threads = 2 ----------------
1.10637 us      - 0.270865 us   = 0.835501 us
---------------- num_threads = 1 ----------------
0.708679 us     - 0.269508 us   = 0.439171 us

In each line the first number is the average (over 100'000 iterations) with threads and the second number is the average without threads. The last number is the difference between the first two and should be an upper bound on the fork-join overhead.

Make sure that the numbers in the middle column (no threads) are approximately the same in every row, as they should be independent of the number of threads. If they aren't, make sure there is nothing else running on the computer and/or increase the number of measurements and/or warmup runs.

In regard to exchanging OpenMP for MPI, keep in mind that MPI is still multiprocessing and not multithreading. You might pay a lot of memory overhead because processes tend to be much bigger than threads.

EDIT:

Revised benchmark to use spinning on a volatile flag instead of sleeping (Thanks @Jér?me Richard). As Jér?me Richard mentioned in his answer, the measured overhead grows with n_spins. Setting n_spins below 1000 didn't significantly change the measurement for me, so that is where I measured. As one can see above, the measured overhead is way lower than what the earlier version of the benchmark measured.

The inaccuracy of sleeping is a problem especially because one will always measure the thread that sleeps the longest and therefore get a bias to longer times, even if sleep times themselves would be distributed symmetrically around the input time.

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/424579.html

標籤：c multithreading openmp performance-testing latency

上一篇：Python：Thread.is_alive*究竟*是什么意思？

下一篇：將結果陣列傳遞到pthread_join