使用NumPy對uint16與uint64陣列求和時沒有加速？-有解無憂

我必須對相對較小的整數進行大量操作（加法），并且我開始考慮哪種資料型別可以在 64 位機器上提供最佳性能。

我確信將 4 加起來與 1 相加uint16所需的時間相同uint64，因為 ALU 可以uint16僅使用 1 個uint64加法器進行 4 次加法。（進位傳播意味著對于單個 64 位加法器來說這并不容易，但這就是整數 SIMD 指令的作業方式。）

顯然情況并非如此：

In [3]: data = np.random.rand(10000)

In [4]: int16 = data.astype(np.uint16)

In [5]: int64 = data.astype(np.uint64)

In [6]: int32 = data.astype(np.uint32)

In [7]: float32 = data.astype(np.float32)

In [8]: float64 = data.astype(np.float64)

In [9]: %timeit int16.sum()
13.4 μs ± 43.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [10]: %timeit int32.sum()
13.9 μs ± 347 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [11]: %timeit int64.sum()
9.33 μs ± 47.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [12]: %timeit float32.sum()
5.79 μs ± 6.51 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [13]: %timeit float64.sum()
6 μs ± 3.54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

所有 int 操作都需要相同的時間（比浮點操作更多），但沒有加速。這是因為 numpy 的 C 實作沒有被完全優化還是有一些我不知道的硬體限制？

uj5u.com熱心網友回復：

TL;DR：我對 Numpy 1.21.1 進行了實驗分析。實驗結果表明，np.sum沒有（真的）使用 SIMD 指令：沒有 SIMD 指令用于整數，而標量 SIMD 指令用于浮點數！此外，默認情況下，Numpy會將整數轉換為較小整數型別的64 位值，以避免溢位！

請注意，這可能無法反映所有 Numpy 版本，因為正在進行為常用函式提供 SIMD 支持的作業（尚未發布的Numpy 1.22.0rc1版本繼續這項長期作業）。此外，所使用的編譯器或處理器可能會顯著影響結果。以下實驗是使用從帶有 i5-9600KF 處理器的 Debian Linux 上的 pip 檢索到的 Numpy 完成的。

在引擎蓋下 `np.sum`

對于浮點數，Numpy 使用成對演算法，眾所周知，該演算法在數值上非常穩定，但速度相對較快。這可以在代碼中看到，但也可以簡單地使用分析器：TYPE_pairwise_sum是呼叫 C 函式以在運行時計算總和（其中TYPE是DOUBLE或FLOAT）。

對于整數，Numpy 使用經典的樸素歸約。呼叫的 C 函式ULONG_add_avx2在 AVX2 兼容的機器上。如果型別不是np.int64.

下面是DOUBLE_pairwise_sum函式執行的匯編代碼的熱點部分

  3,65 │ a0:┌─→add        $0x8,%rcx                  ; Loop iterator
  3,60 │    │  prefetcht0 (%r8,%rax,1)               ; Prefetch data
  9,46 │    │  vaddsd     (%rax,%rbp,1),%xmm1,%xmm1  ; Accumulate an item in xmm1[0]
  4,65 │    │  vaddsd     (%rax),%xmm0,%xmm0         ; Same for xmm0[0]
  6,91 │    │  vaddsd     (%rax,%rdx,1),%xmm4,%xmm4  ; Etc.
  7,77 │    │  vaddsd     (%rax,%rdx,2),%xmm7,%xmm7
  7,41 │    │  vaddsd     (%rax,%r10,1),%xmm2,%xmm2
  7,27 │    │  vaddsd     (%rax,%rdx,4),%xmm6,%xmm6
  6,80 │    │  vaddsd     (%rax,%r11,1),%xmm3,%xmm3
  7,46 │    │  vaddsd     (%rax,%rbx,1),%xmm5,%xmm5
  3,46 │    │  add        %r12,%rax                  ; Switch to the next items (x8)
  0,13 │    ├──cmp        %rcx,%r9                   ; Should the loop continue?
  3,27 │    └──jg         a0                         ; Jump to the beginning if so

Numpy 編譯后的代碼清楚地使用了標量 SIMD 指令 vaddsd（僅計算一個雙精度項），盡管它成功地展開了8 次回圈。為FLOAT_pairwise_sum:生成相同的代碼被vaddss呼叫 8 次。

對于np.uint32，這里是生成的匯編代碼的熱點部分：

  2,37 │160:┌─→add          $0x1,%rax     ; Loop iterator
 95,95 │    │  add          (%rdi),%rdx   ; Accumulate the values in %rdx
  0,06 │    │  add          %r10,%rdi     ; Switch to the next item
       │    ├──cmp          %rsi,%rax     ; Should the loop continue?
  1,08 │    └──jne          160           ; Jump to the beginning if so

Numpy 顯然不使用 SIMD 指令進行np.uint32型別。它甚至沒有展開回圈。add (%rdi),%rdx由于資料依賴于累加器，執行累加的指令是這里的瓶頸。對于 np.uint64 (despite the name of the function isULONG_add_avx2`可以看到相同的回圈。

但是，該np.uint32版本呼叫 C 函式_aligned_contig_cast_uint_to_ulong以將整數項轉換為更寬的型別。Numpy 這樣做是為了避免整數溢位。對于型別np.uint8和np.uint16（盡管函式的名稱不同），可以看到相同的事情。希望此函式使用 SIMD 指令 (SSE)，但仍會占用很大一部分執行時間（約 30% 的np.sum時間）。

編輯：正如@user2357112supportsMonica 所指出的dtype，np.sum可以明確指定引數。當它與dtype輸入陣列的匹配時，則不執行轉換。這會以可能的溢位為代價縮短執行時間。

基準測驗結果

這是我機器上的結果：

uint16: 7.17 μs ± 80 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
uint32: 7.11 μs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
uint64: 5.05 μs ± 8.57 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
float32: 2.88 μs ± 9.27 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
float64: 3.06 μs ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)

首先，并不是說結果與問題中提供的結果非常相似，這意味著在我的機器上看到的行為可以在其他機器上成功重現。因此，解釋也應該是一致的。

As you can see, the 64-bit version is faster than the other integer-based versions. This is due to the overhead of the conversion. The two first are equally fast because of the scalar loop and the add instruction being equally-fast for 8-bit, 16-bit and 32-bit integers (this should be true for most 64-bit mainstream platforms). The integer implementations are slower than the floating-point ones because of the lack of (proper) loop unrolling.

The floating-point implementations are equally-fast due to the scalar SIMD instructions. Indeed, the instructions vaddss (for np.float32) and vaddsd (for np.float64) have the same latency and throughput (at least on all modern Intel processors). Thus the throughput of the two implementation is the same since the loop of the two implementation is similar (same unrolling).

Additional notes

This is sad np.sum does not fully make use of SIMD instructions as this would speed up a lot the computations using it (especially small integers).

[UPDATE] Looking at the Numpy code, I discovered that the code is not vectorized because the array stride is a runtime value and the compiler do not generate a specialized version where the stride is 1. In fact, this can be partially seen in the previous assembly code: the compiler used the instruction add %r10, %rdi because %r10 (the stride of the target array) is not known at compile-time. There is currently no optimization for this specific case of reduction in the Numpy code yet (the functions are relatively generic). This may change in a near future.

除了步幅問題之外，還有一個很大的問題使編譯器很難自動對代碼進行矢量化：浮點加法不是可交換的，也不是關聯的（除非-ffast-math使用了類似的標志）。

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/369825.html

標籤：Python 麻木的表现编译器优化 simd

上一篇：當我在Numpy中使用切片運算子時，什么時候會創建一個新陣列？

下一篇：為什么使用np.ix_出現交叉索引錯誤

使用NumPy對uint16與uint64陣列求和時沒有加速？

在引擎蓋下 np.sum

基準測驗結果

Additional notes

在引擎蓋下 `np.sum`