了解Numba性能差異-有解無憂

我試圖通過使用numba演算法的各種實作來了解我所看到的性能差異。特別是，我希望func1d從下面開始是最快的實作，因為它是唯一不復制資料的演算法，但是從我的時間func1b來看，它似乎是最快的。

import numpy
import numba


def func1a(data, a, b, c):
    # pure numpy
    return a * (1   numpy.tanh((data / b) - c))


@numba.njit(fastmath=True)
def func1b(data, a, b, c):
    new_data = a * (1   numpy.tanh((data / b) - c))
    return new_data


@numba.njit(fastmath=True)
def func1c(data, a, b, c):
    new_data = numpy.empty(data.shape)
    for i in range(new_data.shape[0]):
        for j in range(new_data.shape[1]):
            new_data[i, j] = a * (1   numpy.tanh((data[i, j] / b) - c)) 
    return new_data


@numba.njit(fastmath=True)
def func1d(data, a, b, c):
    for i in range(data.shape[0]):
        for j in range(data.shape[1]):
            data[i, j] = a * (1   numpy.tanh((data[i, j] / b) - c)) 
    return data

用于測驗記憶體復制的輔助函式

def get_data_base(arr):
    """For a given NumPy array, find the base array
    that owns the actual data.
    
    https://ipython-books.github.io/45-understanding-the-internals-of-numpy-to-avoid-unnecessary-array-copying/
    """
    base = arr
    while isinstance(base.base, numpy.ndarray):
        base = base.base
    return base


def arrays_share_data(x, y):
    return get_data_base(x) is get_data_base(y)


def test_share(func):
    data = data = numpy.random.randn(100, 3)
    print(arrays_share_data(data, func(data, 0.5, 2.5, 2.5)))

時間安排

# force compiling
data = numpy.random.randn(10_000, 300)
_ = func1a(data, 0.5, 2.5, 2.5)
_ = func1b(data, 0.5, 2.5, 2.5)
_ = func1c(data, 0.5, 2.5, 2.5)
_ = func1d(data, 0.5, 2.5, 2.5)

data = numpy.random.randn(10_000, 300)
%timeit func1a(data, 0.5, 2.5, 2.5)
%timeit func1b(data, 0.5, 2.5, 2.5)
%timeit func1c(data, 0.5, 2.5, 2.5)
%timeit func1d(data, 0.5, 2.5, 2.5)

67.2 ms ± 230 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
13 ms ± 10.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
69.8 ms ± 60.4 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
69.8 ms ± 105 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

測驗哪些實作復制記憶體

test_share(func1a)
test_share(func1b)
test_share(func1c)
test_share(func1d)

False
False
False
True

uj5u.com熱心網友回復：

性能差異不在于對 tanh 函式的評估

我必須不同意@ead。讓我們暫時假設

主要的性能差異在于對 tanh 函式的評估

那么人們會期望tanh從快速數學開始numpy和numba快速數學運行會顯示出速度差異。

def func_a(data):
    return np.tanh(data)

@nb.njit(fastmath=True)
def func_b(data):
    new_data = np.tanh(data)
    return new_data

data = np.random.randn(10_000, 300)
%timeit func_a(data)
%timeit func_b(data)

然而在我的機器上，上面的代碼在性能上幾乎沒有差異。

15.7 ms ± 129 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.8 ms ± 82 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

繞道而行 `NumExpr`

我試過NumExpr你的代碼的一個版本。但在驚訝于它的運行速度幾乎快 7 倍之前，您應該記住它使用了我機器上可用的所有 10 個內核。在允許numba并行運行并稍微優化之后，性能優勢很小，但2.56 ms與3.87 ms. 請參閱下面的代碼。

@nb.njit(fastmath=True)
def func_a(data):
    new_data = a * (1   np.tanh((data / b) - c))
    return new_data

@nb.njit(fastmath=True, parallel=True)
def func_b(data):
    new_data = a * (1   np.tanh((data / b) - c))
    return new_data

@nb.njit(fastmath=True, parallel=True)
def func_c(data):
    for i in nb.prange(data.shape[0]):
        for j in range(data.shape[1]):
            data[i, j] = a * (1   np.tanh((data[i, j] / b) - c)) 
    return data

def func_d(data):
    return ne.evaluate('a * (1   tanh((data / b) - c))')

data = np.random.randn(10_000, 300)
%timeit func_a(data)
%timeit func_b(data)
%timeit func_c(data)
%timeit func_d(data)

17.4 ms ± 146 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.31 ms ± 193 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.87 ms ± 152 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.56 ms ± 104 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

實際的解釋

該?34％的時間NumExpr節省相比numba是好的，但更令人滿意的是，他們有一個簡潔的解釋，為什么他們比快numpy。我很確定這也適用于numba。

從NumExpr github 頁面：

NumExpr 獲得比 NumPy 更好的性能的主要原因是它避免了為中間結果分配記憶體。這會導致更好的快取利用率并總體上減少記憶體訪問。

所以

a * (1   numpy.tanh((data / b) - c))

速度較慢，因為它執行了很多步驟，產生了中間結果。

uj5u.com熱心網友回復：

在這里，資料的復制并沒有起到很大的作用：瓶頸是如何tanh評估 - 函式的速度很快。有很多演算法：其中一些更快，一些更慢，一些更精確，一些更少。

不同的 numpy-distributions 使用不同的tanh-function實作，例如它可能來自 mkl/vml 或來自 gnu-math-library 的一個。

根據 numba 版本，也可以使用 mkl/svml 實作或 gnu-math-library。

查看內部的最簡單方法是使用分析器，例如perf。

對于我機器上的 numpy 版本，我得到：

>>> perf record python run.py
>>> perf report
Overhead  Command  Shared Object                                      Symbol                                  
  46,73%  python   libm-2.23.so                                       [.] __expm1
  24,24%  python   libm-2.23.so                                       [.] __tanh
   4,89%  python   _multiarray_umath.cpython-37m-x86_64-linux-gnu.so  [.] sse2_binary_scalar2_divide_DOUBLE
   3,59%  python   [unknown]                                          [k] 0xffffffff8140290c

可以看出，numpy 使用了緩慢的 gnu-math-library ( libm) 功能。

對于 numba 函式，我得到：

 53,98%  python   libsvml.so                                         [.] __svml_tanh4_e9
   3,60%  python   [unknown]                                          [k] 0xffffffff81831c57
   2,79%  python   python3.7                                          [.] _PyEval_EvalFrameDefault

這意味著使用快速 mkl/svml 功能。

這就是（幾乎）所有的內容。

正如@user2640045 正確指出的那樣，由于創建臨時陣列，額外的快取未命中會損害 numpy 性能。

但是，快取未命中并沒有像以下計算那樣發揮重要作用tanh：

%timeit func1a(data, 0.5, 2.5, 2.5)  # 91.5 ms ± 2.88 ms per loop 
%timeit numpy.tanh(data)             # 76.1 ms ± 539 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

即臨時物件的創建負責大約 20% 的運行時間。

FWIW，也適用于帶有手寫回圈的版本，我的 numba 版本 (0.50.1) 能夠矢量化和呼叫 mkl/svml 功能。如果其他版本沒有發生這種情況 - numba 將回退到 gnu-math-library 功能，這似乎是在您的機器上發生的。

清單run.py：

import numpy

# TODO: define func1b for checking numba
def func1a(data, a, b, c):
    # pure numpy
    return a * (1   numpy.tanh((data / b) - c))


data = numpy.random.randn(10_000, 300)

for _ in range(100):
    func1a(data, 0.5, 2.5, 2.5)

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/390782.html

標籤：麻木的麻麻

上一篇：我如何解決匯入Seaborn時出現的錯誤？

下一篇：從兩個表中讀取資訊

了解Numba性能差異

性能差異不在于對 tanh 函式的評估

繞道而行 NumExpr

實際的解釋

繞道而行 `NumExpr`