帶PCLMULQDQ*NOT*反映的快速CRC-有解無憂

我正在嘗試撰寫一個PCLMULQDQ 優化的 CRC-32實作。特定的 CRC-32 變體適用于我不擁有但正在嘗試以庫形式支持的變體。在crcany 模型形式中，它具有以下引數：

width=32 poly=0xaf init=0xffffffff refin=false refout=false xorout=0x00000000 check=0xa5fd3138 （省略了我認為是0x00000000但老實說不知道它是什么的殘留物）

演算法的基本非基于表/按位實作（由生成crcany）是：

uint32_t crc32byond_bit(uint32_t crc, void const *mem, size_t len) {
    unsigned char const *data = mem;
    if (data == NULL)
        return 0xffffffff;
    for (size_t i = 0; i < len; i  ) {
        crc ^= (uint32_t)data[i] << 24;
        for (unsigned k = 0; k < 8; k  ) {
            crc = crc & 0x80000000 ? (crc << 1) ^ 0xaf : crc << 1;
        }
    }
    return crc;
}

首先，我從根本上不理解描述演算法的論文。我不知道類似的東西是什么K1 = [x^(512 64) mod P(x)]意思。（什么是x？它是從哪里來的？我不知道。）我是一名專業的軟體工程師；不是學者。坦率地說，我根本不懂這個符號，維基百科的文章對我沒有多大幫助。也許是我在線性代數方面的弱點再次困擾著我。

我知道一些我希望借鑒的公共實作：

嗚嗚
蠢事
crc32fast

但：

他們使用位反射演算法，我不確定如何創建非反射變體。
他們似乎沒有遵循論文？例如，wuffs 和 crc32fast 特別指出他們不使用 K6，但論文使 K6 看起來很有必要。

I found an Intel implementation in soft-crc but it doesn't appear to use the same constants (K4-K6? μ?)

/**
 * PCLMULQDQ CRC computation context structure
 */
struct crc_pclmulqdq_ctx {
        /**
         * K1 = reminder X^128 / P(X) : 0-63
         * K2 = reminder X^192 / P(X) : 64-127
         */
        uint64_t k1;
        uint64_t k2;

        /**
         * K3 = reminder X^64 / P(X) : 0-63
         * q  = quotient X^64 / P(X) : 64-127
         */
        uint64_t k3;
        uint64_t q;

        /**
         * p   = polynomial / P(X) : 0-63
         * res = reserved : 64-127
         */
        uint64_t p;
        uint64_t res;
};

I believe the constants I need for poly 0xAF are:

Px  = 0x1_0000_00AF
k1  = 0x0_5B5A_E0C7
k2  = 0x0_1BD8_1099
k3  = 0x0_1157_936A
k4  = 0x0_1010_1111
k5  = 0x0_0029_5F23
k6  = 0x0_0000_4455
μ   = 0x1_0000_00AF

(source: print-crc32-x86-sse42-magic-numbers.go with const px = "100000000000000000000000010101111", or 0xaf | (1 << 32))

I'm hoping for help in either

Understanding the notation and formulas used in the article (and why implementations seem to diverge from the article?),
Converting an existing implementation for the non-reflected variant, or maybe
Some pseudo-code?

uj5u.com熱心網友回復：

我有 6 組 16、32、64 位 crc 的代碼，這里沒有反射和反射。該代碼是為 Visual Studio 設定的。注釋已添加到英特爾 github 站點中缺少的常量。

https://github.com/jeffareid/crc

32位非反射在這里：

https://github.com/jeffareid/crc/tree/master/crc32f

您需要更改 crc32fg.cpp 中的多項式，以生成常量。你想要的多項式實際上是：

0x00000001000000afull

我對 crc32fg.cpp 進行了此更改以在此答案的末尾生成 rk.. 常量。

x是什么？

CRC 使用具有 1 位系數的多項式。例如 0x0B 實際上是 x^3 x 1。

XMM 暫存器可以讀|寫 16 個位元組 | 一次 128 位，但 PCLMULQDQ 只能將兩個 64 位運算元無進位相乘以產生 128 位乘積。所以 128 位被分成兩個 64 位運算元，然后每個運算元乘以一個常數來“折疊”它向前。由于 XMM 暫存器可以在一定程度上并行操作，因此使用 8 個 XMM 暫存器來讀取 | 寫 128 個位元組 | 一次 1024 位。每個折疊步驟“前進”16 個位元組 | 128位資料轉發128位元組| 1024 位，乘以常數。低 64 位乘以 x^(1024) mod poly 以創建一個 128 位乘積，該乘積“高級”了 1024 位。高 64 位乘以 x^(1024 64) mod poly 得到 128 位乘積，即高級位元組 1024 64 位（需要 64 是因為乘積是基于 128 位的高 64 位）資料）。將兩個 128 位乘積異或在一起，然后再與資料 128 位元組進行異或運算 | 1024 位之后在緩沖區中。

請注意，英特爾檔案中的示例使用 4 個 XMM 暫存器將資料推進 64 個位元組 | 一次 512 位，但我見過的 github 示例和我在 github 存盤庫中使用的示例使用 8 個 XMM 暫存器并提前 128 位元組 | 一次 1024 位。對于支持AVX512的處理器數量相對較少| ZMM暫存器，有提前256位元組的例子| 一次 2048 位。我沒有配備 AVX512 的計算機，所以我沒有任何代碼。

由于 XMM 讀取|寫入是小端序，因此 PSHUFB 用于在每次讀取后反轉位元組。

該代碼主要基于使用 65 位多項式的 64 位 CRC，但對于 32 位 CRC，可以通過將低 32 位設定為零來處理。對于 32 位 CRC，大多數常量只有 32 位，并左移 32 位以簡化與 PCLMULQDQ 的使用，并進行調整以補償移位，因此不是 x^(a) mod poly，而是 (x^( a-32) 模聚)<<32。實際 33 位多項式的常數和它的“逆” x^64 / 多項式是 33 位，不會左移。

折疊程序不會產生 CRC，它只是轉換資料以推進它。最終代碼將到達帶有標簽 _128_done: 的點。此時有 128 位資料，并且由于代碼基于 64 位 CRC 邏輯，邏輯上附加了 64 位的零，因此實際上它是一個 192 位的值，其中虛構的低 64 位全為零。高 64 位向前折疊 128 位，從而在 XMM7 中生成一個 128 位的值，準備計算 64 位 CRC，但由于這是 32 位 CRC，XMM7 有 96 位資料左移 32 位。高 32 位向前折疊 64 位（在這種情況下，96 位值和 rk06 都左移 32 位，因此 rk06 在這種情況下折疊 64 位（x^64 mod poly）并左移 32 位. 結果是在 XMM7 中左移 32 位的 64 位值。

The quotient of a 64 bit number divided by a 33 bit polynomial only needs the upper 32 bits of the 64 bit number, so the left shifted 64 bit value has the upper 32 bits where it will be convenient to calculate a quotient. The divide is really a multiply by x^64 / polynomial, PCLMULQDQ will just specify to use the upper 64 bit part of XMM7 to use the upper 32 bits of the left shifted 64 bit number. The actual CRC calculation is based on this logic:

quotient = upper 32 bits of 64 bit value / polynomial
product  = quotient * polynomial
CRC      = 64 bit value XOR product

The division is done by multiplying by the inverse: x^64 / poly. Since the poly and it's inverse are 33 bits, they can't be shifted left 32 bits, so the code shifts the product left 4 bytes after each multiply. The CRC ends up in bits 32 to 63 of XMM7 and pextrd eax,xmm7,1 is used to get those 32 bits.

I modified crc32fg.cpp to use CRC polynomial 0x1000000af, and this is what I get. Note there's no apparent reasoning behind the ordering of the constants. I only added comments to explain what they represent.

rk01    dq      000295f2300000000h      ; x^(32* 3) mod P(x) << 32
rk02    dq      0fafa517900000000h      ; x^(32* 5) mod P(x) << 32
rk03    dq      05cd86bb500000000h      ; x^(32*31) mod P(x) << 32
rk04    dq      0af6f37a300000000h      ; x^(32*33) mod P(x) << 32
rk05    dq      000295f2300000000h      ; x^(32* 3) mod P(x) << 32
rk06    dq      00000445500000000h      ; x^(32* 2) mod P(x) << 32
rk07    dq      000000001000000afh      ; floor(2^64/P(x))
rk08    dq      000000001000000afh      ; P(x)
rk09    dq      09bd57b5d00000000h      ; x^(32*27) mod P(x) << 32
rk10    dq      0b7a4d76400000000h      ; x^(32*29) mod P(x) << 32
rk11    dq      01ae0004200000000h      ; x^(32*23) mod P(x) << 32
rk12    dq      0e7720be600000000h      ; x^(32*25) mod P(x) << 32
rk13    dq      09c7fc8fe00000000h      ; x^(32*19) mod P(x) << 32
rk14    dq      03885faf800000000h      ; x^(32*21) mod P(x) << 32
rk15    dq      0b477ad7100000000h      ; x^(32*15) mod P(x) << 32
rk16    dq      00ac2ae3d00000000h      ; x^(32*17) mod P(x) << 32
rk17    dq      05eae9dbe00000000h      ; x^(32*11) mod P(x) << 32
rk18    dq      0784a483800000000h      ; x^(32*13) mod P(x) << 32
rk19    dq      07d21bf2000000000h      ; x^(32* 7) mod P(x) << 32
rk20    dq      0faebd3d300000000h      ; x^(32* 9) mod P(x) << 32

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/437323.html

標籤：assembly sse crc crc32

上一篇：在x86匯編中運行總程式

下一篇：為什么mod2^n對有符號數使用CLTD指令