OpenMPatomic比陣列的關鍵慢得多-有解無憂

我看到的 OpenMP 示例omp atomic通常涉及更新標量，并且通常報告它比omp critical. 在我的應用程式中，我希望更新已分配陣列的元素，不同執行緒將更新的元素之間有一些重疊，我發現原子比關鍵慢得多。它是一個陣列是否有區別，我是否正確使用它？

#include <stdlib.h>
#include <assert.h>
#include <omp.h>

#define N_EACH 10000000
#define N_OVERLAP 100000

#if !defined(OMP_CRITICAL) && !defined(OMP_ATOMIC)
#error Must define OMP_CRITICAL or OMP_ATOMIC
#endif
#if defined(OMP_CRITICAL) && defined(OMP_ATOMIC)
#error Must define only one of either OMP_CRITICAL or OMP_ATOMIC
#endif

int main(void) {

  int const n = omp_get_max_threads() * N_EACH -
                (omp_get_max_threads() - 1) * N_OVERLAP;
  int *const a = (int *)calloc(n, sizeof(int));

#pragma omp parallel
  {
    int const thread_idx = omp_get_thread_num();
    int i;
#ifdef OMP_CRITICAL
#pragma omp critical
#endif /* OMP_CRITICAL */
    for (i = 0; i < N_EACH; i  ) {
#ifdef OMP_ATOMIC
#pragma omp atomic update
#endif /* OMP_ATOMIC */
      a[thread_idx * (N_EACH - N_OVERLAP)   i]  = i;
    }
  }

/* Check result is correct */
#ifndef NDEBUG
  {
    int *const b = (int *)calloc(n, sizeof(int));
    int thread_idx;
    int i;
    for (thread_idx = 0; thread_idx < omp_get_max_threads(); thread_idx  ) {
      for (i = 0; i < N_EACH; i  ) {
        b[thread_idx * (N_EACH - N_OVERLAP)   i]  = i;
      }
    }
    for (i = 0; i < n; i  ) {
      assert(a[i] == b[i]);
    }
    free(b);
  }
#endif /* NDEBUG */

  free(a);
}

請注意，在這個簡化的示例中，我們可以提前確定哪些元素將重疊，因此僅在更新這些元素時應用atomic/會更有效critical，但在我的實際應用中這是不可能的。

當我使用以下方法編譯它時：

gcc -O2 atomic_vs_critical.c -DOMP_CRITICAL -DNDEBUG -fopenmp -o critical
gcc -O2 atomic_vs_critical.c -DOMP_ATOMIC -DNDEBUG -fopenmp -o atomic

并運行time ./critical我得到： real 0m0.110s user 0m0.086s sys 0m0.058s

和time ./atomic，我得到： real 0m0.205s user 0m0.742s sys 0m0.032s

所以它在臨??界區使用了大約一半的掛鐘時間（當我重復它時我得到相同的時間）。

還有另一篇文章聲稱 critical 比 atomic 慢，但它使用了標量，當我運行提供的代碼時，原子結果實際上比關鍵結果略快。

uj5u.com熱心網友回復：

您的比較是不公平的：#pragma omp critical放置在for回圈之前，因此編譯器可以對您的回圈進行矢量化，但#pragma omp atomic update在回圈內，這會阻止矢量化。矢量化的這種差異導致了令人驚訝的運行時間。對于回圈內的公平比較位置：

for (i = 0; i < N_EACH; i  ) {
#ifdef OMP_CRITICAL
#pragma omp critical
#endif /* OMP_CRITICAL */
#ifdef OMP_ATOMIC
#pragma omp atomic update
#endif /* OMP_ATOMIC */
   a[thread_idx * (N_EACH - N_OVERLAP)   i]  = i;
}

由于這個向量化問題，如果你只使用單執行緒，你的真實程式的運行時間很可能是最短的。

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/361155.html

標籤：C 多线程 openmp 临界区

上一篇：嘗試將字串復制到結構成員變數中時，Valgrind警告重疊

下一篇：如何使用多執行緒使我的代碼不在Python中按順序運行