使用pow運算子了解Java17Vector的緩慢性和性能-有解無憂

我有一個與 Java 的 17 個新 Vector API 功能中的 pow() 函式有關的問題。我正在嘗試以矢量化方式實作 black scholes 公式，但我很難獲得與標量實作相同的性能

代碼如下：

我創建了一個雙打陣列（目前只有 5.0）
我回圈遍歷該陣列的元素（標量和向量的不同回圈語法）
我從內部的雙精度陣列創建 DoubleVectors 并進行計算（或僅計算標量）我正在嘗試做 e^(value)，我相信這就是問題所在

以下是一些代碼片段：

    public static double[] createArray(int arrayLength)
    {
        double[] array0 = new double[arrayLength];
        for(int i=0;i<arrayLength;i  )
        {
            array0[i] = 2.0;
        }
        return array0;
    }

    @Param({"256000"})
    int arraySize;
    public static final VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_PREFERRED;
    DoubleVector vectorTwo =  DoubleVector.broadcast(SPECIES,2);
    DoubleVector vectorHundred =  DoubleVector.broadcast(SPECIES,100);

    double[] scalarTwo = new double[]{2,2,2,2};
    double[] scalarHundred  = new double[]{100,100,100,100};

    @Setup
    public void Setup()
    {
        javaSIMD = new JavaSIMD();
        javaScalar = new JavaScalar();
        spotPrices = createArray(arraySize);
        timeToMaturity = createArray(arraySize);
        strikePrice = createArray(arraySize);
        interestRate = createArray(arraySize);
        volatility = createArray(arraySize);
        e = new double[arraySize];
        for(int i=0;i<arraySize;i  )
        {
            e[i] = Math.exp(1);
        }
        upperBound = SPECIES.loopBound(spotPrices.length);
    }
    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OutputTimeUnit(TimeUnit.MILLISECONDS)
    public void testVectorPerformance(Blackhole bh) {
        var upperBound = SPECIES.loopBound(spotPrices.length);
        for (var i=0;i<upperBound; i = SPECIES.length())
        {
            bh.consume(javaSIMD.calculateBlackScholesSingleCalc(spotPrices,timeToMaturity,strikePrice,
                    interestRate,volatility,e, i));
        }
    }

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OutputTimeUnit(TimeUnit.MILLISECONDS)
    public void testScalarPerformance(Blackhole bh) {
        for(int i=0;i<arraySize;i  )
        {
            bh.consume(javaScalar.calculateBlackScholesSingleCycle(spotPrices,timeToMaturity,strikePrice,
                    interestRate,volatility, i,normDist));
        }
    }

    public DoubleVector calculateBlackScholesSingleCalc(double[] spotPrices, double[] timeToMaturity, double[] strikePrice,
                                                        double[] interestRate, double[] volatility, double[] e,int i){
...(skip lines)
        DoubleVector vSpot = DoubleVector.fromArray(SPECIES, spotPrices, i);
...(skip lines)
        DoubleVector powerOperand = vRateScaled
                .mul(vTime)
                .neg();
        DoubleVector call  = (vSpot
                .mul(CDFVectorizedExcelOptimized(d1,vE)))
                .sub(vStrike
                .mul(vE
                        .pow(powerOperand))
                .mul(CDFVectorizedExcelOptimized(d2,vE)));
        return call;

以下是使用 WSL 在 Ryzen 5800X 上進行的一些 JMH 基準測驗（2 次分叉、2 次預熱、2 次迭代）：總體而言，與標量版本相比，它似乎慢了約 2 倍。我在沒有 JMH 的方法中分別運行了一個簡單的時間，它似乎是行內的。

Result "blackScholes.TestJavaPerf.testScalarPerformance":
  0.116 ±(99.9%) 0.002 ops/ms [Average]
       89873915287      cycles:u                  #    4.238 GHz                      (40.43%)
      242060738532      instructions:u            #    2.69  insn per cycle   

      
Result "blackScholes.TestJavaPerf.testVectorPerformance":
  0.071 ±(99.9%) 0.001 ops/ms [Average]
       90878787665      cycles:u                  #    4.072 GHz                      (39.25%)
      254117779312      instructions:u            #    2.80  insn per cycle

我還為 JVM 啟用了診斷選項。我看到以下內容：

"-XX: UnlockDiagnosticVMOptions", "-XX: PrintIntrinsics","-XX: PrintAssembly"

  0x00007fe451959413:   call   0x00007fe451239f00           ; ImmutableOopMap {rsi=Oop }
                                                            ;*synchronization entry
                                                            ; - jdk.incubator.vector.DoubleVector::arrayAddress@-1 (line 3283)
                                                            ;   {runtime_call counter_overflow Runtime1 stub}
  0x00007fe451959418:   jmp    0x00007fe4519593ce
  0x00007fe45195941a:   movabs $0x7fe4519593ee,%r10         ;   {internal_word}
  0x00007fe451959424:   mov    %r10,0x358(%r15)
  0x00007fe45195942b:   jmp    0x00007fe451193100           ;   {runtime_call SafepointBlob}
  0x00007fe451959430:   nop
  0x00007fe451959431:   nop
  0x00007fe451959432:   mov    0x3d0(%r15),%rax
  0x00007fe451959439:   movq   $0x0,0x3d0(%r15)
  0x00007fe451959444:   movq   $0x0,0x3d8(%r15)
  0x00007fe45195944f:   add    $0x40,%rsp
  0x00007fe451959453:   pop    %rbp
  0x00007fe451959454:   jmp    0x00007fe451231e80           ;   {runtime_call unwind_exception Runtime1 stub}
  0x00007fe451959459:   hlt    
<More halts cut off>   
[Exception Handler]
  0x00007fe451959460:   call   0x00007fe451234580           ;   {no_reloc}
  0x00007fe451959465:   movabs $0x7fe46e76df9a,%rdi         ;   {external_word}
  0x00007fe45195946f:   and    $0xfffffffffffffff0,%rsp
  0x00007fe451959473:   call   0x00007fe46e283d40           ;   {runtime_call}
  0x00007fe451959478:   hlt    
[Deopt Handler Code]
  0x00007fe451959479:   movabs $0x7fe451959479,%r10         ;   {section_word}
  0x00007fe451959483:   push   %r10
  0x00007fe451959485:   jmp    0x00007fe4511923a0           ;   {runtime_call DeoptimizationBlob}
  0x00007fe45195948a:   hlt    
<More halts cut off>
--------------------------------------------------------------------------------

============================= C2-compiled nmethod ==============================
  ** svml call failed for double_pow_32
                                            @ 3   jdk.internal.misc.Unsafe::loadFence (0 bytes)   (intrinsic)
                                            @ 3   jdk.internal.misc.Unsafe::loadFence (0 bytes)   (intrinsic)
                                          @ 2   java.lang.Math::pow (6 bytes)   (intrinsic)

調查/問題：

我正在撰寫公式的不同實作，它不是 1:1 - 這可能是原因嗎？根據 JMH 的指令數量來看，指令數量大約有 120 億的差異。通過矢量化，處理器也以較低的時鐘速率運行。
輸入數字的選擇有問題嗎？我也試過 i 10/(array.Length) 。
我看到 double_pow_32 的 SVML 呼叫失敗是否有原因？順便說一句，對于較小的輸入陣列大小，我沒有看到這個問題
我將 pow 更改為 mul （對于兩者，顯然 eq 現在非常不同）但結果似乎要快得多，結果與預期的標量 vs 向量一樣

注意：我相信它使用的是 256 位寬度的向量（在除錯期間檢查）

uj5u.com熱心網友回復：

這可能與JDK-8262275 有關，對于 double64 向量不呼叫數學向量存根

對于 Double64Vector，svml 數學向量存根內化失敗，并且不會從 jit 代碼中呼叫它們。
但是我們確實有 svml double64 向量。

您可以嘗試替代操作，例如，您可以使用e ^x對所有車道執行替代操作，vE.pow(powerOperand)而不是作為evE的向量。powerOperand.lanewise(VectorOperators.EXP)

請記住，此 API 正在孵化器狀態下作業……

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/513597.html

標籤：爪哇表现矢量化simdjava-17

上一篇：在函式中查詢時間戳時讓postgres使用索引

下一篇：R：快速創建距離矩陣（例如使用mapply()或類似方法）