JMH-如何正確地對執行緒池進行基準測驗？-有解無憂

請閱讀此問題的最新編輯。

問題：我需要寫一個正確的基準來比較不同的作業使用不同的執行緒池使用的實作（也來自外部庫）執行不同的方法到其他作業使用其它執行緒池的實作，并為作業而沒有任何執行緒。

例如，我有 24 個任務要完成，在基準狀態下有 10000 個隨機字串：

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
    @Param({"24"})
    int amountOfTasks;
    private static final int tts = Runtime.getRuntime().availableProcessors() * 2;
    private String[] strs = new String[10000];

    @Setup
    public void setup() {
        for (int i = 0; i < strs.length; i  ) {
            strs[i] = String.valueOf(Math.random());
        }
    }
}

以及作為內部類的兩個狀態，表示作業（字串連接）和 ExecutorService 設定和關閉：

@State(Scope.Thread)
public static class Work {
    public String doWork(String[] strs) {
        StringBuilder conc = new StringBuilder();
        for (String str : strs) {
            conc.append(str);
        }
        return conc.toString();
    }
}

@State(Scope.Benchmark)
public static class ExecutorServiceState {
    ExecutorService service;

    @Setup(Level.Iteration)
    public void setupMethod() {
        service = Executors.newFixedThreadPool(tts);
    }

    @TearDown(Level.Iteration)
    public void downMethod() {
        service.shutdownNow();
        service = null;
    }
}

More strict question is: How to write correct benchmark to measure average time of doWork(); first: without any threading, second: using .execute() method and third: using .submit() method getting results of futures later. Implementation that I tried to wrote:

@Benchmark
public void noThreading(Work w, Blackhole bh) {
    for (int i = 0; i < amountOfTasks; i  ) {
        bh.consume(w.doWork(strs));
    }
}

@Benchmark
public void executorService(ExecutorServiceState e, Work w, Blackhole bh) {
    for (int i = 0; i < amountOfTasks; i  ) {
         e.service.execute(() -> bh.consume(w.doWork(strs)));
    }
}

@Benchmark
public void noThreadingResult(Work w, Blackhole bh) {
    String[] strss = new String[amountOfTasks];
    for (int i = 0; i < amountOfTasks; i  ) {
        strss[i] = w.doWork(strs);
    }
    bh.consume(strss);
}

@Benchmark
public void executorServiceResult(ExecutorServiceState e, Work w, Blackhole bh) throws ExecutionException, InterruptedException {
    Future[] strss = new Future[amountOfTasks];
    for (int i = 0; i < amountOfTasks; i  ) {
        strss[i] = e.service.submit(() -> {return w.doWork(strs);});
    }
    for (Future future : strss) {
        bh.consume(future.get());
    }
}

After benchmarking this implementation on my PC (2 Cores, 4 threads) I got:

Benchmark                              (amountOfTasks)  Mode  Cnt         Score         Error  Units
ThreadPoolSamples.executorService                     24  avgt    3    255102,966 ± 4460279,056  ns/op
ThreadPoolSamples.executorServiceResult               24  avgt    3  19790020,180 ± 7676762,394  ns/op
ThreadPoolSamples.noThreading                         24  avgt    3  18881360,497 ±  340778,773  ns/op
ThreadPoolSamples.noThreadingResult                   24  avgt    3  19283976,445 ±  471788,642  ns/op

noThreading and executorService maybe correct (but i am still unsure) and noThreadingResult and executorServiceResult doesn't look correct at all.

EDIT:

I find out some new details, but i think the result is still incorrect: as answered user17280749 in this answer that the thread pool wasn't waiting for submitted tasks to complete, but there wasn't only one issue: javac also somehow optimises doWork() method in the Work class (prob the result of that operation was predictable by JVM), so for simplicity I used Thread.sleep() as "work" and also setted amountOfTasks new two params: "1" and "128" to demonstrate that on 1 task threading will be slower than noThreading, and 24 and 128 will be approx. four times faster than noThreading, also to the correctness of measurement I setted thread pools starting up and shutting down in benchmark:

package io.denery;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.util.concurrent.*;

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
    @Param({"1", "24", "128"})
    int amountOfTasks;
    private static final int tts = Runtime.getRuntime().availableProcessors() * 2;

    @State(Scope.Thread)
    public static class Work {
        public void doWork() {
            try {
                Thread.sleep(1);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }

    @Benchmark
    public void noThreading(Work w) {
        for (int i = 0; i < amountOfTasks; i  ) {
            w.doWork();
        }
    }

    @Benchmark
    public void fixedThreadPool(Work w)
            throws ExecutionException, InterruptedException {
        ExecutorService service = Executors.newFixedThreadPool(tts);
        Future[] futures = new Future[amountOfTasks];
        for (int i = 0; i < amountOfTasks; i  ) {
            futures[i] = service.submit(w::doWork);
        }
        for (Future future : futures) {
            future.get();
        }

        service.shutdown();
    }

    @Benchmark
    public void cachedThreadPool(Work w)
            throws ExecutionException, InterruptedException {
        ExecutorService service = Executors.newCachedThreadPool();
        Future[] futures = new Future[amountOfTasks];
        for (int i = 0; i < amountOfTasks; i  ) {
            futures[i] = service.submit(() -> {
                w.doWork();
            });
        }
        for (Future future : futures) {
            future.get();
        }

        service.shutdown();
    }
}

And the result of this benchmark is:

Benchmark                         (amountOfTasks)  Mode  Cnt          Score         Error  Units
ThreadPoolSamples.cachedThreadPool                1  avgt    3    1169075,866 ±   47607,783  ns/op
ThreadPoolSamples.cachedThreadPool               24  avgt    3    5208437,498 ± 4516260,543  ns/op
ThreadPoolSamples.cachedThreadPool              128  avgt    3   13112351,066 ± 1905089,389  ns/op
ThreadPoolSamples.fixedThreadPool                 1  avgt    3    1166087,665 ±   61193,085  ns/op
ThreadPoolSamples.fixedThreadPool                24  avgt    3    4721503,799 ±  313206,519  ns/op
ThreadPoolSamples.fixedThreadPool               128  avgt    3   18337097,997 ± 5781847,191  ns/op
ThreadPoolSamples.noThreading                     1  avgt    3    1066035,522 ±   83736,346  ns/op
ThreadPoolSamples.noThreading                    24  avgt    3   25525744,055 ±   45422,015  ns/op
ThreadPoolSamples.noThreading                   128  avgt    3  136126357,514 ±  200461,808  ns/op

We see that error doesn't really huge, and thread pools with task 1 are slower than noThreading, but if you compare 25525744,055 and 4721503,799 the speedup is: 5.406 and it is faster somehow than excpected ~4, and if you compare 136126357,514 and 18337097,997 the speedup is: 7.4, and this fake speedup is growing with amountOfTasks, and i think it is still incorrect. I think to look at this using PrintAssembly to find out is there are any JVM optimisations.

EDIT:

As mentioned user17294549 in this answer, I used Thread.sleep() as imitation of real work and it doesn't correct because:

for real work: only 2 tasks can run simultaneously on a 2-core system
for Thread.sleep(): any number of tasks can run simultaneously on a 2-core system

我記得 Blackhole.consumeCPU(long tokens) JMH 方法“燃燒周期”和模仿作品，有JMH 示例和檔案。所以我將作業改為：

@State(Scope.Thread)
public static class Work {
    public void doWork() {
        Blackhole.consumeCPU(4096);
    }
}

以及此更改的基準：

Benchmark                         (amountOfTasks)  Mode  Cnt         Score          Error  Units
ThreadPoolSamples.cachedThreadPool                1  avgt    3    301187,897 ±    95819,153  ns/op
ThreadPoolSamples.cachedThreadPool               24  avgt    3   2421815,991 ±   545978,808  ns/op
ThreadPoolSamples.cachedThreadPool              128  avgt    3   6648647,025 ±    30442,510  ns/op
ThreadPoolSamples.cachedThreadPool             2048  avgt    3  60229404,756 ± 21537786,512  ns/op
ThreadPoolSamples.fixedThreadPool                 1  avgt    3    293364,540 ±    10709,841  ns/op
ThreadPoolSamples.fixedThreadPool                24  avgt    3   1459852,773 ±   160912,520  ns/op
ThreadPoolSamples.fixedThreadPool               128  avgt    3   2846790,222 ±    78929,182  ns/op
ThreadPoolSamples.fixedThreadPool              2048  avgt    3  25102603,592 ±  1825740,124  ns/op
ThreadPoolSamples.noThreading                     1  avgt    3     10071,049 ±      407,519  ns/op
ThreadPoolSamples.noThreading                    24  avgt    3    241561,416 ±    15326,274  ns/op
ThreadPoolSamples.noThreading                   128  avgt    3   1300241,347 ±   148051,168  ns/op
ThreadPoolSamples.noThreading                  2048  avgt    3  20683253,408 ±  1433365,542  ns/op

我們看到 fixedThreadPool 在某種程度上比沒有執行緒的示例慢，并且當 amountOfTasks 較大時，fixedThreadPool 和 noThreading 示例之間的差異更小。里面發生了什么？我在這個問題的開頭看到了與字串連接相同的現象，但我沒有報告。（順便說一句，感謝誰讀了這本小說并試圖回答這個問題，你真的幫了我）

uj5u.com熱心網友回復：

這是我在我的機器上得到的（也許這可以幫助您了解問題所在）：

這是基準（我稍微修改了它）：

package io.denery;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.Main;
import java.util.concurrent.*;

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@Threads(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
  @Param({"1", "24", "128"})
  int amountOfTasks;
  private static final int tts = Runtime.getRuntime().availableProcessors() * 2;

  private static void doWork() {
    Blackhole.consumeCPU(4096);
  }

  public static void main(String[] args) throws Exception {
    Main.main(args);
  }

  @Benchmark
  public void noThreading() {
    for (int i = 0; i < amountOfTasks; i  ) {
      doWork();
    }
  }

  @Benchmark
  public void fixedThreadPool(Blackhole bh) throws Exception {
    runInThreadPool(amountOfTasks, bh, Executors.newFixedThreadPool(tts));
  }

  @Benchmark
  public void cachedThreadPool(Blackhole bh) throws Exception {
    runInThreadPool(amountOfTasks, bh, Executors.newCachedThreadPool());
  }

  private static void runInThreadPool(int amountOfTasks, Blackhole bh, ExecutorService threadPool)
      throws Exception {
    Future<?>[] futures = new Future[amountOfTasks];
    for (int i = 0; i < amountOfTasks; i  ) {
      futures[i] = threadPool.submit(ThreadPoolSamples::doWork);
    }
    for (Future<?> future : futures) {
      bh.consume(future.get());
    }

    threadPool.shutdownNow();
    threadPool.awaitTermination(5, TimeUnit.MINUTES);
  }
}

規格和版本：

JMH version: 1.33  
VM version: JDK 17.0.1, OpenJDK 64-Bit Server
Linux 5.14.14
CPU: Intel(R) Core(TM) i5-2320 CPU @ 3.00GHz, 4 Cores, No Hyper-Threading

結果：

Benchmark                           (amountOfTasks)  Mode  Cnt        Score        Error  Units
ThreadPoolSamples.cachedThreadPool                1  avgt    5    92968.252 ±   2853.687  ns/op
ThreadPoolSamples.cachedThreadPool               24  avgt    5   547558.977 ±  88937.441  ns/op
ThreadPoolSamples.cachedThreadPool              128  avgt    5  1502909.128 ±  40698.141  ns/op
ThreadPoolSamples.fixedThreadPool                 1  avgt    5    97945.026 ±    435.458  ns/op
ThreadPoolSamples.fixedThreadPool                24  avgt    5   643453.028 ± 135859.966  ns/op
ThreadPoolSamples.fixedThreadPool               128  avgt    5   998425.118 ± 126463.792  ns/op
ThreadPoolSamples.noThreading                     1  avgt    5    10165.462 ±     78.008  ns/op
ThreadPoolSamples.noThreading                    24  avgt    5   245942.867 ±  10594.808  ns/op
ThreadPoolSamples.noThreading                   128  avgt    5  1302173.090 ±   5482.655  ns/op

uj5u.com熱心網友回復：

請參閱此問題的答案以了解如何在 Java 中撰寫基準測驗。

... executorService 可能是正確的（但我仍然不確定）...

Benchmark                              (amountOfTasks)  Mode  Cnt         Score         Error  Units
ThreadPoolSamples.executorService                     24  avgt    3    255102,966 ± 4460279,056  ns/op

它看起來不像一個正確的結果：錯誤4460279,056比基值大 17 倍255102,966。

你還有一個錯誤：

@Benchmark
public void executorService(ExecutorServiceState e, Work w, Blackhole bh) {
    for (int i = 0; i < amountOfTasks; i  ) {
         e.service.execute(() -> bh.consume(w.doWork(strs)));
    }
}

您將任務提交給ExecutorService，但不等待它們完成。

uj5u.com熱心網友回復：

看看這段代碼：

    @TearDown(Level.Iteration)
    public void downMethod() {
        service.shutdownNow();
        service = null;
    }

您不會等待執行緒停止。閱讀檔案了解詳細資訊。
因此，您的某些基準測驗可能會與cachedThreadPool之前基準測驗中產生的另外 128 個執行緒并行運行。

所以為了簡單起見，我使用 Thread.sleep() 作為“作業”

你確定嗎？
實際作業和Thread.sleep()以下有很大區別：

對于實際作業：只有 2 個任務可以在 2 核系統上同時運行
for Thread.sleep()：任意數量的任務可以在 2 核系統上同時運行

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/349087.html

標籤：爪哇多线程标杆微基准 jmh

上一篇：為什么多執行緒（使用pthread）似乎比多行程（使用fork）慢？

下一篇：如何在std::string中硬編碼或宣告檔案的內容