如何解釋uops.info？-有解無憂

我在 uops.info 上查找了指令如何解釋 uops.info？

uj5u.com熱心網友回復：

如果僅運行該指令的大塊，則吞吐量是倒數的吞吐量。（或者對于諸如由于隱式暫存器輸入/輸出，尤其是 FLAGS 而無法使背靠背執行不具有資料依賴性的情況的依賴關系破壞指令）adc。div這0.5意味著它可以每 0.5 個周期運行一次，即 2 個/時鐘，正如我們所知道的具有 2 個負載埠的 CPU 所預期的那樣。

為什么有時會有兩個數字表示延遲，例如 [≤10;≤11]？

另請參閱多個值或范圍作為單個指令的延遲意味著什么？以 load ALU ALU 指令為例。（我忘記了副本有多接近，直到我寫完這個答案的其余部分才去尋找它。）

通常這表明從不同輸入到輸出的延遲可能不同。例如，合并屏蔽加載必須合并到目標中，這是一個輸入，加載地址是另一個輸入（通過整數暫存器）。最近存盤在記憶體中的資料是第三個輸入（存盤轉發延遲）。

對于向量加載使用延遲等情況，其中加載結果與地址暫存器位于不同的域中，uops.info 創建一個依賴鏈，其指令序列涉及movd或vmovq rax, xmm0將加載結果耦合回地址以進行另一個加載。很難為每個部分分別建立延遲，因此 IIRC 他們假設鏈中的每個其他指令至少為 1 個周期，并將被測指令的延遲顯示為<= N，其中 N dep 鏈的其余部分加起來為測驗代碼每次迭代的總周期。

查看其中一個結果的詳細資訊頁面，顯示用于測量它的測驗序列。表中的每個數字也是一個鏈接。這些詳細資訊頁面告訴您哪個運算元是哪個，并分解從每個輸入到每個輸出的延遲。讓我們看看他們在asmvmovdqa64中VMOVDQA64_Z (ZMM, K, M512)使用vmovdqa64 zmm0{k1}{z},ZMMWORD PTR [r14]. 列出的延遲是[1;≤9].

他們將運算元編號為

1（只寫）：ZMM 目標。
2（只讀）：k0..7屏蔽暫存器
3（只讀）：記憶體（后來分解為地址與實際記憶體內容）

周期1延遲部分是從掩碼暫存器到結果的延遲，“延遲運算元 2 → 1:1”。因此，在加載單元獲取資料之前，掩碼不必準備好。

這<=9是從地址基址或索引暫存器到準備好最終 ZMM 結果的延遲。

顯然，在存盤/重新加載情況下，存盤轉發延遲成為瓶頸，“延遲運算元 3 → 1（記憶體）：≤6”。他們用這個序列進行了測驗，描述為“鏈延遲：≥6”。 vshufpd zmm已知有 1 個周期延遲，我猜他們只是將商店計算為有 1 個周期延遲？就像我說的，他們只是假設一切都是 1 個周期，即使將任何延遲都分配給商店有點可疑。

Code:
   0:   62 d1 fd c9 6f 06       vmovdqa64 zmm0{k1}{z},ZMMWORD PTR [r14]
   6:   62 71 fd 48 c6 e8 00    vshufpd zmm13,zmm0,zmm0,0x0
   d:   62 51 95 48 c6 ed 00    vshufpd zmm13,zmm13,zmm13,0x0
  14:   62 51 95 48 c6 ed 00    vshufpd zmm13,zmm13,zmm13,0x0
  1b:   62 51 95 48 c6 ed 00    vshufpd zmm13,zmm13,zmm13,0x0
  22:   62 51 95 48 c6 ed 00    vshufpd zmm13,zmm13,zmm13,0x0
  29:   62 51 fd 48 11 2e       vmovupd ZMMWORD PTR [r14],zmm13

（對于吞吐量測驗，他們多次重復該塊以創建一個展開的回圈。但對于延遲測驗，他們可能只是圍繞它包裹一個正常的回圈。nanobench 是開源的，因此您可以檢查。）

For the "Latency operand 3 → 1 (address, base register): ≤9" measurement, they say "Chain latency: ≥5". We know a vmovq r,x / vmovq x,r round-trip is more than 2 cycle latency, so the vmovq part of the chain here is probably more than a single cycle. That's why they over-estimate the load-use latency, with a conservative upper bound of 9 cycles.

   0:   62 d1 fd c9 6f 06       vmovdqa64 zmm0{k1}{z},ZMMWORD PTR [r14]
   6:   c4 c1 f9 7e c4          vmovq  r12,xmm0
   b:   4d 31 e6                xor    r14,r12
   e:   4d 31 e6                xor    r14,r12
  11:   4d 31 e6                xor    r14,r12
  14:   4d 31 e6                xor    r14,r12

they measure:

Instructions retired: 6.0

Core cycles: 14.0

Reference cycles: 10.81

UOPS_EXECUTED.THREAD: 7.0

14 cycles per iteration total, so they compute 14-5 = 9 cycles accounted for by the masked load. (Or fewer if the chain latency is actually longer than 5. The vmovq is probably actually 3 or 4 cycles, so 7 or 6 cycle SIMD load latency sound right. We know integer load-use latency is 5 cycles, and IIRC Intel's optimization manual says something about SIMD loads being 6 or 7 cycles. But this conservative upper bound of 9 is all we can really say for sure based purely on measurement, without extrapolation / guesswork.)

AVX-512 instruction naming.

"A64" is part of the AVX-512 vmovdqa64 instruction mnemonic, of course: check Intel's asm manual: https://www.felixcloutier.com/x86/movdqa:vmovdqa32:vmovdqa64. Remember that AVX-512 supports per-element merge- or zero-masking on (nearly) every instruction, so even movdqa and bitwise operations need an element size. That's also why AVX-512 bitwise booleans are vpord / q instead of just vpor (They could have used b/w/d/q naming for movdqa element sizes, but then we'd have vmovdqad or vmovdqaq, but I think we can be glad they didn't.)

Fortunately a32 vs. a64 doesn't make any performance difference, and only has any difference on the result when you're using masking, e.g. via _mm512_maskz_load_epi32( __mmask16 k, void * sa) vs. epi64 only taking a __mmask8. Or for smaller vector widths, only using fewer than 8 bits of a mask.

Zero-Masking vs. Merge Masking

op Z (ZMM, K, ZMM) vs op (ZMM, K, ZMM) is zero-masking vs. merge-masking. If you don't know how AVX-512 masking works, go read about it. e.g. Kirill Yukhin's presentation slides have an overview: https://en.wikichip.org/w/images/d/d5/Intel_Advanced_Vector_Extensions_2015-2016_Support_in_GNU_Compiler_Collection.pdf

Reg-reg vmovdqa without masking (no k register) can be 0 latency (mov-elimination), but with masking it's always 1.

Fun fact: register-renaming for k0..k7 uses the same physical register file space as MMX/x87: https://travisdowns.github.io/blog/2020/05/26/kreg2.html

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/443208.html

標籤：集会 x86 x86-64 simd avx512

上一篇：如何在M1Mac(ARM)上編譯x86_64asm代碼

下一篇：指令集如何標準化？