我正在使用 Godbolt 來組裝以下程式:
#include <stdio.h>
volatile int a = 5;
volatile int res = 0;
int main() {
res = a * 36;
return 1;
}
如果我使用-Os優化,生成的代碼很自然:
mov eax, DWORD PTR a[rip]
imul eax, eax, 36
mov DWORD PTR res[rip], eax
但是如果我使用-O2,生成的代碼是這樣的:
mov eax, DWORD PTR a[rip]
lea eax, [rax rax*8]
sal eax, 2
mov DWORD PTR res[rip], eax
所以不是乘以 5*36,而是 5 -> 5 5*8=45 -> 45*4 = 180。我認為這是因為 1 imul 比 1 lea 1 shift 左移慢。
但是在 lea 指令中,需要計算rax rax*8,其中包含 1 個加法 1 個 mul。那么為什么它仍然比 1 imul 還要快?是因為 lea 內部的記憶體尋址是免費的嗎?
編輯 1:另外,如何[rax rax*8]翻譯成機器碼?它是否被編譯為額外的 2 條指令 ( shl, rbx, rax, 3; add rax, rax, rbx;) 或其他指令?
編輯 2: 以下令人驚訝的結果。我創建一個回圈,然后使用 -O2 生成代碼,然后復制檔案并用 -Os 中的代碼替換上面的段。所以 2 個匯編檔案在任何地方都是相同的,除了我們進行基準測驗的指令。在 Windows 上運行,命令是
gcc mul.c -O2 -S -masm=intel -o mulo2.s
gcc mulo2.s -o mulo2
// replace line of code in mulo2.s, save as muls.s
gcc muls.s -o muls
cmd /v:on /c "echo !time! & START "TestAgente" /W mulo2 & echo !time!"
cmd /v:on /c "echo !time! & START "TestAgente" /W muls & echo !time!"
#include <stdio.h>
volatile int a = 5;
volatile int res = 0;
int main() {
size_t LOOP = 1000 * 1000 * 1000;
LOOP = LOOP * 10;
size_t i = 0;
while (i < LOOP) {
i ;
res = a * 36;
}
return 0;
}
; mulo2.s
.file "mul.c"
.intel_syntax noprefix
.text
.def __main; .scl 2; .type 32; .endef
.section .text.startup,"x"
.p2align 4
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
sub rsp, 40
.seh_stackalloc 40
.seh_endprologue
call __main
movabs rdx, 10000000000
.p2align 4,,10
.p2align 3
.L2:
mov eax, DWORD PTR a[rip]
lea eax, [rax rax*8] ; replaces these 2 lines with
sal eax, 2 ; imul eax, eax, 36
mov DWORD PTR res[rip], eax
sub rdx, 1
jne .L2
xor eax, eax
add rsp, 40
ret
.seh_endproc
.globl res
.bss
.align 4
res:
.space 4
.globl a
.data
.align 4
a:
.long 5
.ident "GCC: (GNU) 9.3.0"
令人驚訝的是,結果是該-Os版本始終比-O2(平均 4.1 秒與 5 秒,Intel 8750H CPU,每個 .exe 檔案運行數次)快。所以在這種情況下,編譯器優化錯誤。鑒于此基準,有人可以提供新的解釋嗎?
Edit 3: To measure the effects of instruction cache line, here's a python script to generate different addresses for the main loop by adding nop instructions to the program right before the main loop. It's for Window, for Linux it just needs to be modified a bit.
#cd "D:\Learning\temp"
import os
import time
import datetime as dt
f = open("mulo2.s","r")
lines = [line for line in f]
f.close()
def addNop(cnt, outputname):
f = open(outputname, "w")
for i in range(17):
f.write(lines[i])
for i in range(cnt):
f.write("\tnop\n")
for i in range(17, len(lines)):
f.write(lines[i])
f.close()
if os.path.isdir("nop_files")==False:
os.mkdir("nop_files")
MAXN = 100
for t in range(MAXN 1):
sourceFile = "nop_files\\mulo2_" str(t) ".s" # change \\ to / on Linux
exeFile = "nop_files\\mulo2_" str(t)
if os.path.isfile(sourceFile)==False:
addNop(t, sourceFile)
os.system("gcc " sourceFile " -o " exeFile)
runtime = os.popen("timecmd " exeFile).read() # use time
print(str(t) " nop: " str(runtime))
Result:
0 nop: command took 0:0:4.96 (4.96s total)
1 nop: command took 0:0:4.94 (4.94s total)
2 nop: command took 0:0:4.90 (4.90s total)
3 nop: command took 0:0:4.90 (4.90s total)
4 nop: command took 0:0:5.26 (5.26s total)
5 nop: command took 0:0:4.94 (4.94s total)
6 nop: command took 0:0:4.92 (4.92s total)
7 nop: command took 0:0:4.98 (4.98s total)
8 nop: command took 0:0:5.02 (5.02s total)
9 nop: command took 0:0:4.97 (4.97s total)
10 nop: command took 0:0:5.12 (5.12s total)
11 nop: command took 0:0:5.01 (5.01s total)
12 nop: command took 0:0:5.01 (5.01s total)
13 nop: command took 0:0:5.07 (5.07s total)
14 nop: command took 0:0:5.08 (5.08s total)
15 nop: command took 0:0:5.07 (5.07s total)
16 nop: command took 0:0:5.09 (5.09s total)
17 nop: command took 0:0:7.96 (7.96s total) # slow 17
18 nop: command took 0:0:7.93 (7.93s total)
19 nop: command took 0:0:7.88 (7.88s total)
20 nop: command took 0:0:7.88 (7.88s total)
21 nop: command took 0:0:7.94 (7.94s total)
22 nop: command took 0:0:7.90 (7.90s total)
23 nop: command took 0:0:7.92 (7.92s total)
24 nop: command took 0:0:7.99 (7.99s total)
25 nop: command took 0:0:7.89 (7.89s total)
26 nop: command took 0:0:7.88 (7.88s total)
27 nop: command took 0:0:7.88 (7.88s total)
28 nop: command took 0:0:7.84 (7.84s total)
29 nop: command took 0:0:7.84 (7.84s total)
30 nop: command took 0:0:7.88 (7.88s total)
31 nop: command took 0:0:7.91 (7.91s total)
32 nop: command took 0:0:7.89 (7.89s total)
33 nop: command took 0:0:7.88 (7.88s total)
34 nop: command took 0:0:7.94 (7.94s total)
35 nop: command took 0:0:7.81 (7.81s total)
36 nop: command took 0:0:7.89 (7.89s total)
37 nop: command took 0:0:7.90 (7.90s total)
38 nop: command took 0:0:7.92 (7.92s total)
39 nop: command took 0:0:7.83 (7.83s total)
40 nop: command took 0:0:4.95 (4.95s total) # fast 40
41 nop: command took 0:0:4.91 (4.91s total)
42 nop: command took 0:0:4.97 (4.97s total)
43 nop: command took 0:0:4.97 (4.97s total)
44 nop: command took 0:0:4.97 (4.97s total)
45 nop: command took 0:0:5.11 (5.11s total)
46 nop: command took 0:0:5.13 (5.13s total)
47 nop: command took 0:0:5.01 (5.01s total)
48 nop: command took 0:0:5.01 (5.01s total)
49 nop: command took 0:0:4.97 (4.97s total)
50 nop: command took 0:0:5.03 (5.03s total)
51 nop: command took 0:0:5.32 (5.32s total)
52 nop: command took 0:0:4.95 (4.95s total)
53 nop: command took 0:0:4.97 (4.97s total)
54 nop: command took 0:0:4.94 (4.94s total)
55 nop: command took 0:0:4.99 (4.99s total)
56 nop: command took 0:0:4.99 (4.99s total)
57 nop: command took 0:0:5.04 (5.04s total)
58 nop: command took 0:0:4.97 (4.97s total)
59 nop: command took 0:0:4.97 (4.97s total)
60 nop: command took 0:0:4.95 (4.95s total)
61 nop: command took 0:0:4.99 (4.99s total)
62 nop: command took 0:0:4.94 (4.94s total)
63 nop: command took 0:0:4.94 (4.94s total)
64 nop: command took 0:0:4.92 (4.92s total)
65 nop: command took 0:0:4.91 (4.91s total)
66 nop: command took 0:0:4.98 (4.98s total)
67 nop: command took 0:0:4.93 (4.93s total)
68 nop: command took 0:0:4.95 (4.95s total)
69 nop: command took 0:0:4.92 (4.92s total)
70 nop: command took 0:0:4.93 (4.93s total)
71 nop: command took 0:0:4.97 (4.97s total)
72 nop: command took 0:0:4.93 (4.93s total)
73 nop: command took 0:0:4.94 (4.94s total)
74 nop: command took 0:0:4.96 (4.96s total)
75 nop: command took 0:0:4.91 (4.91s total)
76 nop: command took 0:0:4.92 (4.92s total)
77 nop: command took 0:0:4.91 (4.91s total)
78 nop: command took 0:0:5.03 (5.03s total)
79 nop: command took 0:0:4.96 (4.96s total)
80 nop: command took 0:0:5.20 (5.20s total)
81 nop: command took 0:0:7.93 (7.93s total) # slow 81
82 nop: command took 0:0:7.88 (7.88s total)
83 nop: command took 0:0:7.85 (7.85s total)
84 nop: command took 0:0:7.91 (7.91s total)
85 nop: command took 0:0:7.93 (7.93s total)
86 nop: command took 0:0:8.06 (8.06s total)
87 nop: command took 0:0:8.03 (8.03s total)
88 nop: command took 0:0:7.85 (7.85s total)
89 nop: command took 0:0:7.88 (7.88s total)
90 nop: command took 0:0:7.91 (7.91s total)
91 nop: command took 0:0:7.86 (7.86s total)
92 nop: command took 0:0:7.99 (7.99s total)
93 nop: command took 0:0:7.86 (7.86s total)
94 nop: command took 0:0:7.91 (7.91s total)
95 nop: command took 0:0:8.12 (8.12s total)
96 nop: command took 0:0:7.88 (7.88s total)
97 nop: command took 0:0:7.81 (7.81s total)
98 nop: command took 0:0:7.88 (7.88s total)
99 nop: command took 0:0:7.85 (7.85s total)
100 nop: command took 0:0:7.90 (7.90s total)
101 nop: command took 0:0:7.93 (7.93s total)
102 nop: command took 0:0:7.85 (7.85s total)
103 nop: command took 0:0:7.88 (7.88s total)
104 nop: command took 0:0:5.00 (5.00s total) # fast 104
105 nop: command took 0:0:5.03 (5.03s total)
106 nop: command took 0:0:4.97 (4.97s total)
107 nop: command took 0:0:5.06 (5.06s total)
108 nop: command took 0:0:5.01 (5.01s total)
109 nop: command took 0:0:5.00 (5.00s total)
110 nop: command took 0:0:4.95 (4.95s total)
111 nop: command took 0:0:4.91 (4.91s total)
112 nop: command took 0:0:4.94 (4.94s total)
113 nop: command took 0:0:4.93 (4.93s total)
114 nop: command took 0:0:4.92 (4.92s total)
115 nop: command took 0:0:4.92 (4.92s total)
116 nop: command took 0:0:4.92 (4.92s total)
117 nop: command took 0:0:5.13 (5.13s total)
118 nop: command took 0:0:4.94 (4.94s total)
119 nop: command took 0:0:4.97 (4.97s total)
120 nop: command took 0:0:5.14 (5.14s total)
121 nop: command took 0:0:4.94 (4.94s total)
122 nop: command took 0:0:5.17 (5.17s total)
123 nop: command took 0:0:4.95 (4.95s total)
124 nop: command took 0:0:4.97 (4.97s total)
125 nop: command took 0:0:4.99 (4.99s total)
126 nop: command took 0:0:5.20 (5.20s total)
127 nop: command took 0:0:5.23 (5.23s total)
128 nop: command took 0:0:5.19 (5.19s total)
129 nop: command took 0:0:5.21 (5.21s total)
130 nop: command took 0:0:5.33 (5.33s total)
131 nop: command took 0:0:4.92 (4.92s total)
132 nop: command took 0:0:5.02 (5.02s total)
133 nop: command took 0:0:4.90 (4.90s total)
134 nop: command took 0:0:4.93 (4.93s total)
135 nop: command took 0:0:4.99 (4.99s total)
136 nop: command took 0:0:5.08 (5.08s total)
137 nop: command took 0:0:5.02 (5.02s total)
138 nop: command took 0:0:5.15 (5.15s total)
139 nop: command took 0:0:5.07 (5.07s total)
140 nop: command took 0:0:5.03 (5.03s total)
141 nop: command took 0:0:4.94 (4.94s total)
142 nop: command took 0:0:4.92 (4.92s total)
143 nop: command took 0:0:4.96 (4.96s total)
144 nop: command took 0:0:4.92 (4.92s total)
145 nop: command took 0:0:7.86 (7.86s total) # slow 145
146 nop: command took 0:0:7.87 (7.87s total)
147 nop: command took 0:0:7.83 (7.83s total)
148 nop: command took 0:0:7.83 (7.83s total)
149 nop: command took 0:0:7.84 (7.84s total)
150 nop: command took 0:0:7.87 (7.87s total)
151 nop: command took 0:0:7.84 (7.84s total)
152 nop: command took 0:0:7.88 (7.88s total)
153 nop: command took 0:0:7.87 (7.87s total)
154 nop: command took 0:0:7.83 (7.83s total)
155 nop: command took 0:0:7.85 (7.85s total)
156 nop: command took 0:0:7.91 (7.91s total)
157 nop: command took 0:0:8.18 (8.18s total)
158 nop: command took 0:0:7.94 (7.94s total)
159 nop: command took 0:0:7.92 (7.92s total)
160 nop: command took 0:0:7.92 (7.92s total)
161 nop: command took 0:0:7.97 (7.97s total)
162 nop: command took 0:0:8.12 (8.12s total)
163 nop: command took 0:0:7.89 (7.89s total)
164 nop: command took 0:0:7.92 (7.92s total)
165 nop: command took 0:0:7.88 (7.88s total)
166 nop: command took 0:0:7.80 (7.80s total)
167 nop: command took 0:0:7.82 (7.82s total)
168 nop: command took 0:0:4.97 (4.97s total) # fast
169 nop: command took 0:0:4.97 (4.97s total)
170 nop: command took 0:0:4.95 (4.95s total)
171 nop: command took 0:0:5.00 (5.00s total)
172 nop: command took 0:0:4.95 (4.95s total)
173 nop: command took 0:0:4.93 (4.93s total)
174 nop: command took 0:0:4.91 (4.91s total)
175 nop: command took 0:0:4.92 (4.92s total)
Points where the program switch from fast to slow (then slow to fast) are: 17S-40F-81S-104F-145S-168F. We can see the distance from slow->fast code is 23 nop, and the distance from fast->slow code is 41 nop. When we check objdump, we can see that the main loop occupies 24 bytes; that means if we place it at the start of a cache line (address mod 64 == 0), inserting 41 bytes will cause the main loop to cross the cache-line boundary, causing slowdown. So in the default code (no nop added), the main loop is already inside the same cache line.
So we know that the -O2 version being slower is not because of instruction address alignment. The only culprit left is instruction decoding speed We found a new culprit, like @Jér?me Richard answer.
Edit 4: Skylake decodes 16 bytes per cycle. However, the size of -Os and -O2 version are 21 and 24 respectively, so both requires 2 cycles to read the main loop. So where does speed the difference come from?
Conclusion: while the compiler is theoretically correct (lea sal are 2 super cheap instructions, and addressing inside lea is free since it uses a separate hardware circuit), in practice 1 single expensive instruction imul might be faster due to some extremely complex details about CPU architecture, which include instruction decoding speed, micro-operation (uops) amount, and CPU ports.
uj5u.com熱心網友回復:
你可以在這里和那里看到大多數主流架構的指令成本。基于此并假設您使用例如 Intel Skylake 處理器,您可以看到imul每個周期可以計算一條 32 位指令,但延遲為 3 個周期。在優化的代碼中,lea每個周期可以執行2條指令(非常便宜),延遲為 1 個周期。同樣的事情適用于sal指令(每個周期 2 個和延遲 1 個周期)。
這意味著優化后的版本可以僅以 2 個周期的延遲執行,而第一個需要 3 個周期的延遲(不考慮相同的加載/存盤指令)。此外,第二個版本可以更好地流水線化,因為由于超標量亂序執行,可以對兩個不同的輸入資料并行執行兩條指令。請注意,雖然每個周期只能并行執行一次存盤,但也可以并行執行兩次加載. 這意味著執行受限于存盤指令的吞吐量。總的來說,每個周期只能計算 1 個值。AFAIK,最近的 Intel Icelake 處理器可以像新的 AMD Ryzen 處理器一樣并行執行兩個存盤。在所選用例(英特爾 Skylake 處理器)上,第二個預計將同樣快或可能更快。在最新的 x86-64 處理器上它應該明顯更快。
請注意,該lea指令非常快,因為乘加是在專用 CPU 單元(硬連線移位器)上完成的,并且它僅支持乘法的某些特定常數(支持的因子為 1、2、4 和 8,這意味著lea 可用于將整數乘以常數 2、3、4、5、8 和 9)。這就是為什么lea比imul/快的原因mul。
更新(v2):
我可以使用 GCC 11.2(在帶有 i5-9600KF 處理器的 Linux 上)重現較慢的執行-O2。
速度下降的主要來源來自版本中要執行的更多微操作(uop),當然還有一些執行埠的飽和,這肯定是由于微操作調度不好。-O2
這是回圈的組裝-Os:
1049: 8b 15 d9 2f 00 00 mov edx,DWORD PTR [rip 0x2fd9] # 4028 <a>
104f: 6b d2 24 imul edx,edx,0x24
1052: 89 15 d8 2f 00 00 mov DWORD PTR [rip 0x2fd8],edx # 4030 <res>
1058: 48 ff c8 dec rax
105b: 75 ec jne 1049 <main 0x9>
這是回圈的組裝-O2:
1050: 8b 05 d2 2f 00 00 mov eax,DWORD PTR [rip 0x2fd2] # 4028 <a>
1056: 8d 04 c0 lea eax,[rax rax*8]
1059: c1 e0 02 shl eax,0x2
105c: 89 05 ce 2f 00 00 mov DWORD PTR [rip 0x2fce],eax # 4030 <res>
1062: 48 83 ea 01 sub rdx,0x1
1066: 75 e8 jne 1050 <main 0x10>
現代 x86-64 處理器解碼(可變大小)指令,然后將它們轉換為(更簡單的固定大小)微操作,最終在多個執行埠上執行(通常并行)。可以在此處找到有關特定 Skylake 架構的更多資訊。Skylake 可以將多條指令宏融合為一個微操作。在這種情況下,dec jne和sub jne指令在每種情況下都融合為一個 uops。這意味著該-Os版本執行 4 uops/迭代而-O2執行 5 uops/迭代。
uop 存盤在稱為解碼流緩沖區 (DSB)的uop 快取中,因此處理器不需要再次解碼/轉換(小)回圈的指令。要執行的快取 uops 在稱為指令解碼佇列 (IDQ) 的佇列中發送。最多可以從 DSB 向 IDQ 發送 6 個 uops/周期。對于該-Os版本,每個周期僅向 IDQ 發送 4 uop DSB(可能是因為回圈受飽和的存盤埠限制)。對于該-O2版本,每個周期僅向 IDQ 發送 5 uop DSB,但 5 次中有 4 次(平均)!這意味著每 4 個周期增加 1 個延遲周期,導致執行速度降低 25%。這種影響的原因尚不清楚,似乎與 uops 調度有關。
然后將 Uop 發送到資源分配表 (RAT) 并發布到預留站 (RS)。RS將 uops分派到執行它們的埠。然后,uop 被退休(即提交)。從 DSB 間接傳輸到 RS 的微指令數量對于兩個版本都是恒定的。相同數量的 uops 被淘汰。但是,在兩個版本中,RS 每個周期(并由埠執行)都會再調度 1 個 ghost uop。這可能是用于計算商店地址的 uops(因為商店埠沒有自己的專用 AGU)。
這是從硬體計數器(使用perf)收集的每次迭代的統計資訊:
version | instruction | issued-uops | executed-uops | retired-uops | cycles
"-Os" | 5 | 4 | 5 | 4 | 1.00
"-O2" | 6 | 5 | 6 | 5 | 1.25
以下是整體埠利用率的統計資訊:
port | type | "-Os" | "-O2"
-----------------------------------------
0 | ALU/BR | 0% | 60%
1 | ALU/MUL/LEA | 100% | 38%
2 | LOAD/AGU | 65% | 60%
3 | LOAD/AGU | 73% | 60%
4 | STORE | 100% | 80%
5 | ALU/LEA | 0% | 42%
6 | ALU/BR | 100% | 100%
7 | AGU | 62% | 40%
-----------------------------------------
total | | 500% | 480%
The port 6 is only the fully-saturated on the -O2 version which is unexpected and this certainly explains why there is an additional cycle needed every 5 cycle. Note that only the uops associated to the instructions shl and sub jne are using (simultaneously) the port 0 and 6 (and no other ports).
Note that the total of 480% is a scheduling artifact due to the stalling cycle. Indeed, 6*4=24 uops should be executed every 5 cycles (24/5*100=480). Note also that the store port is not needed 1 out of 5 cycles (4 iterations are executed every 5 cycles in average and so 4 store uops), hence its 80% usage.
Related:
- What's the purpose of the LEA instruction?
- Why is my loop much faster when it is contained in one cache line?
- How many ways-superscalar are modern Intel processors?
- https://en.wikipedia.org/wiki/Superscalar_processor
uj5u.com熱心網友回復:
tl;dr:因為 LEA 沒有進行全面的乘法運算。
雖然@JeromeRichard 的答案是正確的,但其最后一句話中隱藏了潛在的真相內核:使用 LEA,您只能乘以特定常數,即 2 的冪。因此,不需要用于乘法的大型專用電路,它只需要用于將其運算元之一移位固定量的小型子電路。
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/381527.html
標籤:c assembly optimization x86-64 cpu-architecture
下一篇:匯編時讀取資料變數的值
