使用SIMD優化影像大小調整（方法Nearest）-有解無憂

我知道“最近”的影像調整方法是最快的方法。盡管如此，我還是會尋找加快速度的方法。明顯的步驟是預先計算指數：

void CalcIndex(int sizeS, int sizeD, int colors, int* idx)
{
    float scale = (float)sizeS / sizeD;
    for (size_t i = 0; i < sizeD;   i)
    {
        int index = (int)::floor((i   0.5f) * scale)
        idx[i] = Min(Max(index, 0), sizeS - 1) * colors;
    }
}

template<int colors> inline void CopyPixel(const uint8_t* src, uint8_t* dst)
{
    for (int i = 0; i < colors;   i)
        dst[i] = src[i];
}

template<int colors> void Resize(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices (see CalcIndex).
    for (int dy = 0; dy < dstH; dy  )
    {
        const uint8_t * srcY = src   idxY[dy] * srcW * colors;
        for (int dx = 0, offset = 0; dx < dstW; dx  , offset  = colors)
            CopyPixel<N>(srcY   idxX[dx], dst   offset);
        dst  = dstW * colors;
    }
}

是否存在下一個優化步驟？例如使用 SIMD 或其他一些優化技術。

PS 特別是我對 RGB ( Colors = 3) 的優化很感興趣。如果我使用當前代碼，我會看到 ARGB 影像 ( Colors = 4) 的處理速度比 RGB 快 50%，盡管它大了 30%。

uj5u.com熱心網友回復：

（基于 SIMD 的）調整大小演算法中的速度問題來自索引輸入和輸出元素的不匹配。例如，當調整大小因子為 6/5 時，需要消耗 6 個像素并寫入 5 個。16 位元組的 OTOH SIMD 暫存器寬度映射到 16 個灰度元素、4 個 RGBA 元素或 5.33 個 RGB 元素。

我的經驗是，當嘗試一次寫入 2-4 個 SIMD 暫存器的資料，從輸入中讀取所需數量的線性位元組時，會獲得足夠好的性能（可能不是最佳的，但經常擊敗 opencv 和其他免費可用的實作）一些，并pshufb在 x86 SSSE3 或vtblNeon 中使用從暫存器收集負載——從不從記憶體中收集。當然，需要一種快速機制來計算行內 LUT 索引，或者預先計算在不同輸出行之間共享的索引。

根據（水平）解析度的輸入/輸出比，應該準備有幾個內部內核。

RGBrgbRGBrgbRGBr|gbRGBrgb ....  <- input
                         ^ where to load next 32 bytes of input
RGBRGBrgbRGBrgbr|gbRGBrgbRGBRGBrg| <- 32 output bytes, from 

0000000000000000|0000001111111111| <- high bit of index
0120123456789ab9|abcdef0123423456| <- low 4 bits of index

請注意，可以使用 LUT 方法處理所有通道數

// inner kernel for downsampling between 1x and almost 2x*
// - we need to read max 32 elements and write 16
void process_row_ds(uint8_t const *input, uint8_t const *indices,
                 int const *advances, uint8_t *output, int out_width) {
    do {
       auto a = load16_bytes(input);
       auto b = load16_bytes(input   16);
       auto c = load16_bytes(indices);
       a = lut32(a,b,c);      // get 16 bytes out of 32
       store16_bytes(output, a);
       output  = 16;
       input  = *advances  ;
    } while (out_width--);  // multiples of 16...
}

// inner kernel for upsampling between 1x and inf
void process_row_us(uint8_t const *input, uint8_t const *indices,
                 int const *advances, uint8_t *output, int out_width) {
    do {
       auto a = load16_bytes(input);
       auto c = load16_bytes(indices);
       a = lut16(a, c);      // get 16 bytes out of 16
       store16_bytes(output, a);
       output  = 16;
       input  = *advances  ;
    } while (out_width--);
}

(1 1, 1 2 1, 1 3 3 1, 1 4 6 4 1, ...)除了（至少）雙線性插值之外，我還鼓勵使用一些基本過濾進行下采樣，例如高斯二項式內核以及分層下采樣。應用程式當然有可能容忍混疊工件——AFAIK 的成本通常不是那么大，特別是考慮到否則演算法將受到記憶體限制。

uj5u.com熱心網友回復：

我認為使用 _mm256_i32gather_epi32 (AVX2) 可以在 32 位像素的情況下為調整大小提供一些性能增益：

inline void Gather32bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
    __m256i _idx = _mm256_loadu_si256((__m256i*)idx);
    __m256i val = _mm256_i32gather_epi32((int*)src, _idx, 1);
    _mm256_storeu_si256((__m256i*)dst, val);
}

template<> void Resize<4>(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices.
    size_t dstW8 = dstW & (8 - 1);
    for (int dy = 0; dy < dstH; dy  )
    {
        const uint8_t * srcY = src   idxY[dy] * srcW * 4;
        int dx = 0, offset = 0;
        for (; dx < dstW8; dx  = 8, offset  = 8*4)
            Gather32bit(srcY, idxX   dx,dst   offset);
        for (; dx < dstW; dx  , offset  = 4)
            CopyPixel<N>(srcY   idxX[dx], dst   offset);
        dst  = dstW * 4;
    }
}

PS 經過一些修改，這個方法可以應用于RGB24：

const __m256i K8_SHUFFLE = _mm256_setr_epi8(
    0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1,
    0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1);
const __m256i K32_PERMUTE = _mm256_setr_epi32(0x0, 0x1, 0x2, 0x4, 0x5, 0x6, -1, -1);


inline void Gather24bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
    __m256i _idx = _mm256_loadu_si256((__m256i*)idx);
    __m256i bgrx = _mm256_i32gather_epi32((int*)src, _idx, 1);
    __m256i bgr = _mm256_permutevar8x32_epi32(
        _mm256_shuffle_epi8(bgrx, K8_SHUFFLE), K32_PERMUTE);
    _mm256_storeu_si256((__m256i*)dst, bgr);
}

template<> void Resize<3>(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices.
    size_t dstW8 = dstW & (8 - 1);
    for (int dy = 0; dy < dstH; dy  )
    {
        const uint8_t * srcY = src   idxY[dy] * srcW * 3;
        int dx = 0, offset = 0;
        for (; dx < dstW8; dx  = 8, offset  = 8*3)
            Gather24bit(srcY, idxX   dx,dst   offset);
        for (; dx < dstW; dx  , offset  = 3)
            CopyPixel<3>(srcY   idxX[dx], dst   offset);
        dst  = dstW * 3;
    }
}

請注意，srcW < dstW@Aki-Suihkonen 的if then 方法更快。

uj5u.com熱心網友回復：

可以使用 SIMD，我很確定它會有所幫助，不幸的是它相對困難。下面是一個僅支持影像放大但不支持縮小的簡化示例。

不過，我希望它可以作為一個起點有用。

MSVC 和 GCC 都將LineResize::apply方法中的熱回圈編譯為 11 條指令。我認為 16 位元組的 11 條指令應該比您的版本快。

#include <stdint.h>
#include <emmintrin.h>
#include <tmmintrin.h>
#include <vector>
#include <array>
#include <assert.h>
#include <stdio.h>

// Implements nearest neighbor resize method for RGB24 or BGR24 bitmaps
class LineResize
{
    // Each mask produces up to 16 output bytes.
    // For enlargement exactly 16, for shrinking up to 16, possibly even 0.
    std::vector<__m128i> masks;

    // Length is the same as masks.
    // For enlargement, the values contain source pointer offsets in bytes.
    // For shrinking, the values contain destination pointer offsets in bytes.
    std::vector<uint8_t> offsets;

    // True if this class will enlarge images, false if it will shrink the width of the images.
    bool enlargement;

    void resizeFields( size_t vectors )
    {
        masks.resize( vectors, _mm_set1_epi32( -1 ) );
        offsets.resize( vectors, 0 );
    }

public:

    // Compile the shuffle table. The arguments are line widths in pixels.
    LineResize( size_t source, size_t dest );

    // Apply the algorithm to a single line of the image.
    void apply( uint8_t* rdi, const uint8_t* rsi ) const;
};

LineResize::LineResize( size_t source, size_t dest )
{
    const size_t sourceBytes = source * 3;
    const size_t destBytes = dest * 3;
    assert( sourceBytes >= 16 );
    assert( destBytes >= 16 );

    // Possible to do much faster without any integer divides.
    // Optimizing this sample for simplicity.
    if( sourceBytes < destBytes )
    {
        // Enlarging the image, each SIMD vector consumes <16 input bytes, produces exactly 16 output bytes
        enlargement = true;
        resizeFields( ( destBytes   15 ) / 16 );

        int8_t* pMasks = (int8_t*)masks.data();
        uint8_t* const pOffsets = offsets.data();

        int sourceOffset = 0;
        const size_t countVectors = masks.size();
        for( size_t i = 0; i < countVectors; i   )
        {
            const int destSlice = (int)i * 16;
            std::array<int, 16> lanes;
            int lane;
            for( lane = 0; lane < 16; lane   )
            {
                const int destByte = destSlice   lane;  // output byte index
                const int destPixel = destByte / 3; // output pixel index
                const int channel = destByte % 3;   // output byte within pixel
                const int sourcePixel = destPixel * (int)source / (int)dest; // input pixel
                const int sourceByte = sourcePixel * 3   channel;   // input byte

                if( destByte < (int)destBytes )
                    lanes[ lane ] = sourceByte;
                else
                {
                    // Destination offset out of range, i.e. the last SIMD vector
                    break;
                }
            }

            // Produce the offset
            if( i == 0 )
                assert( lanes[ 0 ] == 0 );
            else
            {
                const int off = lanes[ 0 ] - sourceOffset;
                assert( off >= 0 && off <= 16 );
                pOffsets[ i - 1 ] = (uint8_t)off;
                sourceOffset = lanes[ 0 ];
            }

            // Produce the masks
            for( int j = 0; j < lane; j   )
                pMasks[ j ] = (int8_t)( lanes[ j ] - sourceOffset );
            // The masks are initialized with _mm_set1_epi32( -1 ) = all bits set,
            // no need to handle remainder for the last vector.
            pMasks  = 16;
        }
    }
    else
    {
        // Shrinking the image, each SIMD vector consumes 16 input bytes, produces <16 output bytes
        enlargement = false;
        resizeFields( ( sourceBytes   15 ) / 16 );

        // Not implemented, but the same idea works fine for this too.
        // The only difference, instead of using offsets bytes for source offsets, use it for destination offsets.
        assert( false );
    }
}

void LineResize::apply( uint8_t * rdi, const uint8_t * rsi ) const
{
    const __m128i* pm = masks.data();
    const __m128i* const pmEnd = pm   masks.size();
    const uint8_t* po = offsets.data();
    __m128i mask, source;

    if( enlargement )
    {
        // One iteration of the loop produces 16 output bytes
        // In MSVC results in 11 instructions for 16 output bytes.
        while( pm < pmEnd )
        {
            mask = _mm_load_si128( pm );
            pm  ;

            source = _mm_loadu_si128( ( const __m128i * )( rsi ) );
            rsi  = *po;
            po  ;

            _mm_storeu_si128( ( __m128i * )rdi, _mm_shuffle_epi8( source, mask ) );
            rdi  = 16;
        }
    }
    else
    {
        // One iteration of the loop consumes 16 input bytes
        while( pm < pmEnd )
        {
            mask = _mm_load_si128( pm );
            pm  ;

            source = _mm_loadu_si128( ( const __m128i * )( rsi ) );
            rsi  = 16;

            _mm_storeu_si128( ( __m128i * )rdi, _mm_shuffle_epi8( source, mask ) );
            rdi  = *po;
            po  ;
        }
    }
}

// Utility method to print RGB pixel values from the vector
static void printPixels( const std::vector<uint8_t>&vec )
{
    assert( !vec.empty() );
    assert( 0 == ( vec.size() % 3 ) );

    const uint8_t* rsi = vec.data();
    const uint8_t* const rsiEnd = rsi   vec.size();
    while( rsi < rsiEnd )
    {
        const uint32_t r = rsi[ 0 ];
        const uint32_t g = rsi[ 1 ];
        const uint32_t b = rsi[ 2 ];
        rsi  = 3;
        const uint32_t res = ( r << 16 ) | ( g << 8 ) | b;
        printf( "X ", res );
    }
    printf( "\n" );
}

// A triviual test to resize 24 pixels -> 32 pixels
int main()
{
    constexpr int sourceLength = 24;
    constexpr int destLength = 32;

    // Initialize sample input with 24 RGB pixels
    std::vector<uint8_t> input( sourceLength * 3 );
    for( size_t i = 0; i < input.size(); i   )
        input[ i ] = (uint8_t)i;

    printf( "Input: " );
    printPixels( input );

    // That special handling of the last pixels of last line is missing from this example.
    static_assert( 0 == destLength % 16 );
    LineResize resizer( sourceLength, destLength );

    std::vector<uint8_t> result( destLength * 3 );
    resizer.apply( result.data(), input.data() );

    printf( "Output: " );
    printPixels( result );
    return 0;
}

代碼忽略對齊問題。對于生產，您需要另一種方法來處理影像的最后一行，該方法不會運行到最后，而是使用標量代碼處理最后幾個像素。

代碼在熱回圈中包含更多記憶體參考。但是，該類中的兩個向量并不太長，對于 4k 影像，大小約為 12kb，應該適合 L1D 快取并保持在那里。

如果您有 AVX2，可能會進一步改進。對于放大影像，使用_mm256_inserti128_si256，該vinserti128指令可以從記憶體中加載 16 個位元組到向量的高半部分。同樣，對于縮小影像，使用_mm256_extracti128_si256，指令可以選擇使用記憶體目的地。

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/386697.html

標籤：C 图像处理 simd simd库西尼特

上一篇：用python查找2個閾值影像之間的漢明距離

下一篇：如何將MigLayout匯入JavaSwing專案？