組裝新手在這里。我撰寫了一個基準來測量機器在計算轉置矩陣張量積時的浮點性能。
鑒于我的機器具有 32GiB RAM(帶寬 ~37GiB/s)和 Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz(Turbo 4.0GHz)處理器,我估計最大性能(流水線和暫存器中的資料)為為 6 核 x 4.0GHz = 24GFLOP/s。然而,當我運行我的基準測驗時,我測量到 127GFLOP/s,這顯然是一個錯誤的測量。
注意:為了測量 FP 性能,我正在測量運算元:(n*n*n*n*6對于n^3矩陣矩陣乘法,在n復數資料點的切片上執行,即假設 6 次浮點數用于 1 個復數復數乘法)并將其除以平均值每次運行所花費的時間。
主函式中的代碼片段:
// benchmark runs
auto avg_dur = 0.0;
for (auto counter = std::size_t{}; counter < experiment_count; counter)
{
#pragma noinline
do_timed_run(n, avg_dur);
}
avg_dur /= static_cast<double>(experiment_count);
代碼片段: do_timed_run:
void do_timed_run(const std::size_t& n, double& avg_dur)
{
// create the data and lay first touch
auto operand0 = matrix<double>(n, n);
auto operand1 = tensor<double>(n, n, n);
auto result = tensor<double>(n, n, n);
// first touch
#pragma omp parallel
{
set_first_touch(operand1);
set_first_touch(result);
}
// do the experiment
const auto dur1 = omp_get_wtime() * 1E 6;
#pragma omp parallel firstprivate(operand0)
{
#pragma noinline
transp_matrix_tensor_mult(operand0, operand1, result);
}
const auto dur2 = omp_get_wtime() * 1E 6;
avg_dur = dur2 - dur1;
}
筆記:
- 在這一點上,我沒有提供函式的代碼,
transp_matrix_tensor_mult因為我認為它不相關。 - 這
#pragma noinline是我用來更好地理解反匯編器輸出的除錯裝置。
現在為函式的反匯編do_timed_run:
0000000000403a20 <_Z12do_timed_runRKmRd>:
403a20: 48 81 ec d8 00 00 00 sub $0xd8,%rsp
403a27: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403a2e: 00
403a2f: 48 89 fd mov %rdi,%rbp
403a32: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403a39: 00
403a3a: 48 89 f3 mov %rsi,%rbx
403a3d: 48 89 ee mov %rbp,%rsi
403a40: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403a45: 48 89 ea mov %rbp,%rdx
403a48: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403a4f: 00
403a50: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403a57: 00
403a58: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403a5f: 00
403a60: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403a67: 00
403a68: e8 03 f8 ff ff callq 403270 <_ZN5s3dft6matrixIdEC1ERKmS3_@plt>
403a6d: 48 89 ee mov %rbp,%rsi
403a70: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403a75: 48 89 ea mov %rbp,%rdx
403a78: 48 89 e9 mov %rbp,%rcx
403a7b: e8 80 f8 ff ff callq 403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
403a80: 48 89 ee mov %rbp,%rsi
403a83: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403a88: 48 89 ea mov %rbp,%rdx
403a8b: 48 89 e9 mov %rbp,%rcx
403a8e: e8 6d f8 ff ff callq 403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
403a93: bf 88 f3 44 00 mov $0x44f388,
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/450948.html
標籤:assembly openmp performance-testing icc microbenchmark
上一篇:記憶體屏障到底要解決什么問題?
下一篇:如何在匯編中列印多個變數?
