memcpy性能不佳-有解無憂

我正在嘗試優化一些代碼以提高速度，并且花費大量時間來做 memcpys。我決定撰寫一個簡單的測驗程式來單獨測量 memcpy 以查看我的記憶體傳輸速度有多快，而且它們對我來說似乎很慢。我想知道是什么原因造成的。這是我的測驗代碼：

#include <stdio.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>

#define MEMBYTES 1000000000

int main() {
  clock_t begin, end;
  double time_spent[2];
  int i;

  // Allocate memory                                                                                                                                    

  float *src = malloc(MEMBYTES);
  float *dst = malloc(MEMBYTES);


  // Fill the src array with some numbers                                                                                                               
  begin = clock();
  for(i=0;i<250000000;i  )
    src[i]=(float) i;
  end = clock();
  time_spent[0] = (double)(end - begin) / CLOCKS_PER_SEC;


  // Do the memcpy                                                                                                                                      
  begin = clock();
  memcpy(dst, src, MEMBYTES);
  end = clock();
  time_spent[1] = (double)(end - begin) / CLOCKS_PER_SEC;

  //Print results                                                                                                                                       
  printf("Time spent in fill: %1.10f\n", time_spent[0]);
  printf("Time spent in memcpy: %1.10f\n", time_spent[1]);
  printf("dst[200]: %f\n", dst[400]);
  printf("dst[200000000]: %f\n", dst[200000000]);

  //Free memory                                                                                                                                         
  free(src);
  free(dst);
}

/*                                                                                                                                                      
                                                                                                                                                        
  gcc -O3 -o mct memcpy_test.c                                                                                                                          
                                                                                                                                                        
*/

當我運行它時，我得到以下輸出：

Time spent in fill: 0.4263950000
Time spent in memcpy: 0.6350150000
dst[200]: 400.000000
dst[200000000]: 200000000.000000

I think the theoretical memory bandwith for modern machines is tens of GB/s or perhaps over 100 GB/s. I know in practice one cannot expect to hit the theoretical limits, and that for large memory transfers things can be slow, but I have seen people reporting measured speeds for large transfers of ~20GB/s (e.g. here). My results suggest I am getting 3.14GB/s (edit: I originally had 1.57, but stark pointed out in a comment that I need to count both read and write). I am wondering if anyone has ideas that might help or ideas of why the performance I am seeing is so low.

My machine has two CPUS with 12 physical cores each (Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz) There is 192GB of RAM (I believe its 12x16GB DDR4-2666) The OS is Ubuntu 16.04.6 LTS

My compiler is: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609

Update

Thanks to all the valuable feedback I am now using a threaded implementation and getting much better performance. Thank you!

I had tried threading before posting with poor results (I thought), but as pointed out below I should have ensured I was using wall time. Now my results with 24 threads are as follows:

Time spent in fill: 0.4229530000
Time spent in memcpy (clock): 1.2897100000
Time spent in memcpy (gettimeofday): 0.0589750000

I am also using asmlib's A_memcpy with a large SetMemcpyCacheLimit value.

uj5u.com熱心網友回復：

飽和 RAM 并不像看起來那么簡單。

首先，乍一看，我們可以從提供的數字中計算出明顯的吞吐量：

填充：1 / 0.4263950000 = 2.34GB/s（讀取1GB）；
Memcpy：2 / 0.6350150000 = 3.15GB/s（讀取 1 GB，寫入 1 GB）。

問題是分配的頁面malloc沒有映射到 Linux 系統的物理記憶體中。實際上，在虛擬記憶體malloc中保留一些空間，但是頁面僅在執行第一次觸摸時才映射到物理記憶體中，從而導致代價高昂的頁面錯誤。AFAIK，加快此程序的唯一方法是使用多個內核或預填充緩沖區并稍后重用它們。

此外，由于架構限制（即延遲），至強處理器的一個內核不能使 RAM 飽和。同樣，解決這個問題的唯一方法是使用多個內核。

如果您嘗試使用多核，那么基準測驗提供的結果將令人驚訝，因為clock測量的不是掛鐘時間而是CPU 時間（這是所有執行緒所用時間的總和）。您需要使用其他功能。在 C 中，您可以使用gettimeofday（這并不完美，因為它不是單調的）但對于您的基準測驗來說肯定足夠好（相關文章：如何在 Linux/Windows 上測量 CPU 時間和掛鐘時間？）。在 C 中，您應該使用std::steady_clock（與相比是單調的std::system_clock）。

此外，x86-64 平臺上的write-allocate 快取策略強制在寫入時讀取快取行。這意味著要寫入 1 GB，您實際上需要讀取 1 GB！話雖如此，x86-64 處理器提供不會導致此問題的非臨時存盤指令（假設您的陣列正確對齊且足夠大）。編譯器可以使用它，但 GCC 和 Clang 通常不會。memcpy已經優化為在大多數機器上使用非臨時存盤。有關更多資訊，請閱讀非臨時指令如何作業？.

最后，您可以使用OpenMP和簡單#pragma omp parallel for的回圈指令輕松并行化基準測驗。請注意，它還提供了一個用戶友好的函式來正確計算掛鐘時間：omp_get_wtime。對于memcpy，最好的當然是撰寫一個memcpy由（相對較大的）塊并行執行的回圈。

有關此主題的更多資訊，我建議您閱讀著名的檔案：What Every Programmer Should Know About Memory。由于檔案有點舊，您可以在此處查看有關此的更新資訊。該檔案還描述了其他重要的事情，以了解為什么您仍然無法成功地使用上述資訊使 RAM 飽和。一個關鍵主題是NUMA。

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/442233.html

標籤：linux performance memory memcpy

上一篇：如何為R中資料框的每一列回傳按行排列的多個標準統計資訊的“不整潔”資料框摘要？

下一篇：如何在Linux/Unix的文本檔案中用空行替換以某個字符開頭的行？