為什么即使在圖形模式下，TensorFlow2中Float64tf.matmul的CPU性能也明顯慢于NumPymatmul？-有解無憂

我正在比較TensorFlow 2和NumPy 中矩陣矩陣產品的單執行緒性能。我分別比較單精度（float32）和雙精度（float64）。我發現NumPy 的性能幾乎等同于單精度和雙精度（DGEMM 和 SGEMM）的英特爾 MKL C 實作（用作矩陣乘法的基準）。但在TensorFlow 中，只有單精度（float32）性能與 MKL 相當，雙精度（float64）性能明顯慢一些。為什么使用雙精度資料時 Tensorflow 會變慢？

示例腳本：

我考慮以下實體來重現我的觀察。考慮矩陣乘法：

C = AB其中 A 和 B 的大小為 3000x3000

TensorFlow2 和 NumPy 代碼如下：

Tensorflow2 代碼

import tensorflow as tf
import os
import time


#Check if MKL is enabled
import tensorflow.python.framework as tff
print("MKL Enabled : ", tff.test_util.IsMklEnabled())


#Set threads
tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)

#Problem size
N = 3000
REPS = 20
DTYPE = tf.float64
#DTYPE = tf.float32


@tf.function
def gemm_implicit_noup(A, B):
    #C = A @ B
    start = tf.timestamp()
    with tf.control_dependencies([start]):
        C = tf.matmul(A,B)
    with tf.control_dependencies([C]):
        end = tf.timestamp()
    tf.print(end-start)
    return C

tf.config.run_functions_eagerly(False)

A = tf.random.normal([N, N], dtype=DTYPE)
B = tf.random.normal([N, N], dtype=DTYPE)


#Building Trace
C = gemm_implicit_noup(A,B)

for i in range(REPS):
   C = gemm_implicit_noup(A,B)

代碼

import os
os.environ["OMP_NUM_THREADS"] = "1"
import numpy as np
import time

N = 3000
REPS = 20
DTYPE = np.float64
#DTYPE = np.float32

def gemm_implicit_noup(A, B):
    #C = A @ B
    C = np.matmul(A,B)
    return C



A = np.random.randn(N,N).astype(DTYPE)
B = np.random.randn(N,N).astype(DTYPE)

for i in range(REPS):
   start = time.perf_counter()
   C = gemm_implicit_noup(A,B)
   end = time.perf_counter()
   print(end-start)

系統和安裝設定：

性能在 Intel Xeon Skylake 2.1 GHz 與 CentOS 7 以及 MacBook Pro 2018 與 BigSur 上進行了比較。在使用英特爾 MKL 構建的Tensorflow 2.7和2.8上比較了性能。檢查了Python 3.9.7和3.7.4。我比較了單執行緒性能，以便可以可靠地再現結果。我在所有設定中觀察到類似的性能數字：

單精度性能符合預期：

英特爾 MKL C SGEMM ~ 0.5s
NumPy float32 ~ 0.5s
TensorFlow float32 ~ 0.5s

但雙精度性能：

英特爾 MKL C DGEMM ~ 0.9s
NumPy float64 ~ 1s
TensorFlow float64 > 2.5s（慢得多！！）

uj5u.com熱心網友回復：

假設您使用的是英特爾? AVX-512指令支持處理器，請嘗試通過專為 AVX512 構建的 PIP安裝英特爾? Optimization for TensorFlow Wheel。這些包在英特爾? 網站上以 *.whl 的形式提供，用于特定 Python 版本，或者可以使用以下命令安裝 Python 版本 3.7、3.8 和 3.9（僅限 Linux）。

pip install intel-tensorflow-avx512==2.7.0

這記錄在英特爾? 官方網站及其子部分中，如下鏈接所示：

英特爾? TensorFlow 優化：安裝指南

英特爾? TensorFlow 優化：通過 PIP 安裝英特爾? TensorFlow 優化輪

AVX512是一種單指令多資料 (SIMD)指令集，專門設計用于處理雙精度數等復雜資料型別。為了充分利用英特爾? 架構并獲得最佳性能，TensorFlow 框架已使用oneAPI 深度神經網路庫 (oneDNN)原語進行了優化，這是一種用于深度學習應用程式的流行性能庫。作為一個額外的優化步驟，在運行 TensorFlow 代碼之前，還可以嘗試在 Linux 終端中使用以下命令將環境變數TF_ENABLE_ONEDNN_OPTS設定為 1：

export TF_ENABLE_ONEDNN_OPTS=1

下面給出了使用您提供的代碼為雙精度矩陣矩陣產品獲得的單執行緒性能。該測驗是在Intel? Xeon? Platinum 8260M CPU @ 2.40GHz和Python 3.8以及Intel? MKL 和AVX512優化的 TensorFlow 2.7 上完成的。

NumPy float64 ~ 1.44s
TensorFlow float64（啟用 MKL）~ 2.77s
TensorFlow float64（啟用 MKL，優化 AVX512，啟用 oneDNN 優化）~ 1.19s

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/420800.html

標籤：

上一篇：雪花資料庫：關于存盤在雪花中的表性能問題

下一篇：Python正則運算式檢查子字串是在要查找的更大路徑的開頭還是結尾