我知道“最近”的影像調整方法是最快的方法。盡管如此,我還是會尋找加快速度的方法。明顯的步驟是預先計算指數:
void CalcIndex(int sizeS, int sizeD, int colors, int* idx)
{
float scale = (float)sizeS / sizeD;
for (size_t i = 0; i < sizeD; i)
{
int index = (int)::floor((i 0.5f) * scale)
idx[i] = Min(Max(index, 0), sizeS - 1) * colors;
}
}
template<int colors> inline void CopyPixel(const uint8_t* src, uint8_t* dst)
{
for (int i = 0; i < colors; i)
dst[i] = src[i];
}
template<int colors> void Resize(const uint8_t* src, int srcW, int srcH,
uint8_t* dst, int dstW, int dstH)
{
int idxY[dstH], idxX[dstW];//pre-calculated indices (see CalcIndex).
for (int dy = 0; dy < dstH; dy )
{
const uint8_t * srcY = src idxY[dy] * srcW * colors;
for (int dx = 0, offset = 0; dx < dstW; dx , offset = colors)
CopyPixel<N>(srcY idxX[dx], dst offset);
dst = dstW * colors;
}
}
是否存在下一個優化步驟?例如使用 SIMD 或其他一些優化技術。
PS 特別是我對 RGB ( Colors = 3) 的優化很感興趣。如果我使用當前代碼,我會看到 ARGB 影像 ( Colors = 4) 的處理速度比 RGB 快 50%,盡管它大了 30%。
uj5u.com熱心網友回復:
(基于 SIMD 的)調整大小演算法中的速度問題來自索引輸入和輸出元素的不匹配。例如,當調整大小因子為 6/5 時,需要消耗 6 個像素并寫入 5 個。16 位元組的 OTOH SIMD 暫存器寬度映射到 16 個灰度元素、4 個 RGBA 元素或 5.33 個 RGB 元素。
我的經驗是,當嘗試一次寫入 2-4 個 SIMD 暫存器的資料,從輸入中讀取所需數量的線性位元組時,會獲得足夠好的性能(可能不是最佳的,但經常擊敗 opencv 和其他免費可用的實作) 一些,并pshufb在 x86 SSSE3 或vtblNeon 中使用從暫存器收集負載——從不從記憶體中收集。當然,需要一種快速機制來計算行內 LUT 索引,或者預先計算在不同輸出行之間共享的索引。
根據(水平)解析度的輸入/輸出比,應該準備有幾個內部內核。
RGBrgbRGBrgbRGBr|gbRGBrgb .... <- input
^ where to load next 32 bytes of input
RGBRGBrgbRGBrgbr|gbRGBrgbRGBRGBrg| <- 32 output bytes, from
0000000000000000|0000001111111111| <- high bit of index
0120123456789ab9|abcdef0123423456| <- low 4 bits of index
請注意,可以使用 LUT 方法處理所有通道數
// inner kernel for downsampling between 1x and almost 2x*
// - we need to read max 32 elements and write 16
void process_row_ds(uint8_t const *input, uint8_t const *indices,
int const *advances, uint8_t *output, int out_width) {
do {
auto a = load16_bytes(input);
auto b = load16_bytes(input 16);
auto c = load16_bytes(indices);
a = lut32(a,b,c); // get 16 bytes out of 32
store16_bytes(output, a);
output = 16;
input = *advances ;
} while (out_width--); // multiples of 16...
}
// inner kernel for upsampling between 1x and inf
void process_row_us(uint8_t const *input, uint8_t const *indices,
int const *advances, uint8_t *output, int out_width) {
do {
auto a = load16_bytes(input);
auto c = load16_bytes(indices);
a = lut16(a, c); // get 16 bytes out of 16
store16_bytes(output, a);
output = 16;
input = *advances ;
} while (out_width--);
}
(1 1, 1 2 1, 1 3 3 1, 1 4 6 4 1, ...)除了(至少)雙線性插值之外,我還鼓勵使用一些基本過濾進行下采樣,例如高斯二項式內核以及分層下采樣。應用程式當然有可能容忍混疊工件——AFAIK 的成本通常不是那么大,特別是考慮到否則演算法將受到記憶體限制。
uj5u.com熱心網友回復:
我認為使用 _mm256_i32gather_epi32 (AVX2) 可以在 32 位像素的情況下為調整大小提供一些性能增益:
inline void Gather32bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
__m256i _idx = _mm256_loadu_si256((__m256i*)idx);
__m256i val = _mm256_i32gather_epi32((int*)src, _idx, 1);
_mm256_storeu_si256((__m256i*)dst, val);
}
template<> void Resize<4>(const uint8_t* src, int srcW, int srcH,
uint8_t* dst, int dstW, int dstH)
{
int idxY[dstH], idxX[dstW];//pre-calculated indices.
size_t dstW8 = dstW & (8 - 1);
for (int dy = 0; dy < dstH; dy )
{
const uint8_t * srcY = src idxY[dy] * srcW * 4;
int dx = 0, offset = 0;
for (; dx < dstW8; dx = 8, offset = 8*4)
Gather32bit(srcY, idxX dx,dst offset);
for (; dx < dstW; dx , offset = 4)
CopyPixel<N>(srcY idxX[dx], dst offset);
dst = dstW * 4;
}
}
PS 經過一些修改,這個方法可以應用于RGB24:
const __m256i K8_SHUFFLE = _mm256_setr_epi8(
0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1,
0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1);
const __m256i K32_PERMUTE = _mm256_setr_epi32(0x0, 0x1, 0x2, 0x4, 0x5, 0x6, -1, -1);
inline void Gather24bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
__m256i _idx = _mm256_loadu_si256((__m256i*)idx);
__m256i bgrx = _mm256_i32gather_epi32((int*)src, _idx, 1);
__m256i bgr = _mm256_permutevar8x32_epi32(
_mm256_shuffle_epi8(bgrx, K8_SHUFFLE), K32_PERMUTE);
_mm256_storeu_si256((__m256i*)dst, bgr);
}
template<> void Resize<3>(const uint8_t* src, int srcW, int srcH,
uint8_t* dst, int dstW, int dstH)
{
int idxY[dstH], idxX[dstW];//pre-calculated indices.
size_t dstW8 = dstW & (8 - 1);
for (int dy = 0; dy < dstH; dy )
{
const uint8_t * srcY = src idxY[dy] * srcW * 3;
int dx = 0, offset = 0;
for (; dx < dstW8; dx = 8, offset = 8*3)
Gather24bit(srcY, idxX dx,dst offset);
for (; dx < dstW; dx , offset = 3)
CopyPixel<3>(srcY idxX[dx], dst offset);
dst = dstW * 3;
}
}
請注意,srcW < dstW@Aki-Suihkonen 的if then 方法更快。
uj5u.com熱心網友回復:
可以使用 SIMD,我很確定它會有所幫助,不幸的是它相對困難。下面是一個僅支持影像放大但不支持縮小的簡化示例。
不過,我希望它可以作為一個起點有用。
MSVC 和 GCC 都將LineResize::apply方法中的熱回圈編譯為 11 條指令。我認為 16 位元組的 11 條指令應該比您的版本快。
#include <stdint.h>
#include <emmintrin.h>
#include <tmmintrin.h>
#include <vector>
#include <array>
#include <assert.h>
#include <stdio.h>
// Implements nearest neighbor resize method for RGB24 or BGR24 bitmaps
class LineResize
{
// Each mask produces up to 16 output bytes.
// For enlargement exactly 16, for shrinking up to 16, possibly even 0.
std::vector<__m128i> masks;
// Length is the same as masks.
// For enlargement, the values contain source pointer offsets in bytes.
// For shrinking, the values contain destination pointer offsets in bytes.
std::vector<uint8_t> offsets;
// True if this class will enlarge images, false if it will shrink the width of the images.
bool enlargement;
void resizeFields( size_t vectors )
{
masks.resize( vectors, _mm_set1_epi32( -1 ) );
offsets.resize( vectors, 0 );
}
public:
// Compile the shuffle table. The arguments are line widths in pixels.
LineResize( size_t source, size_t dest );
// Apply the algorithm to a single line of the image.
void apply( uint8_t* rdi, const uint8_t* rsi ) const;
};
LineResize::LineResize( size_t source, size_t dest )
{
const size_t sourceBytes = source * 3;
const size_t destBytes = dest * 3;
assert( sourceBytes >= 16 );
assert( destBytes >= 16 );
// Possible to do much faster without any integer divides.
// Optimizing this sample for simplicity.
if( sourceBytes < destBytes )
{
// Enlarging the image, each SIMD vector consumes <16 input bytes, produces exactly 16 output bytes
enlargement = true;
resizeFields( ( destBytes 15 ) / 16 );
int8_t* pMasks = (int8_t*)masks.data();
uint8_t* const pOffsets = offsets.data();
int sourceOffset = 0;
const size_t countVectors = masks.size();
for( size_t i = 0; i < countVectors; i )
{
const int destSlice = (int)i * 16;
std::array<int, 16> lanes;
int lane;
for( lane = 0; lane < 16; lane )
{
const int destByte = destSlice lane; // output byte index
const int destPixel = destByte / 3; // output pixel index
const int channel = destByte % 3; // output byte within pixel
const int sourcePixel = destPixel * (int)source / (int)dest; // input pixel
const int sourceByte = sourcePixel * 3 channel; // input byte
if( destByte < (int)destBytes )
lanes[ lane ] = sourceByte;
else
{
// Destination offset out of range, i.e. the last SIMD vector
break;
}
}
// Produce the offset
if( i == 0 )
assert( lanes[ 0 ] == 0 );
else
{
const int off = lanes[ 0 ] - sourceOffset;
assert( off >= 0 && off <= 16 );
pOffsets[ i - 1 ] = (uint8_t)off;
sourceOffset = lanes[ 0 ];
}
// Produce the masks
for( int j = 0; j < lane; j )
pMasks[ j ] = (int8_t)( lanes[ j ] - sourceOffset );
// The masks are initialized with _mm_set1_epi32( -1 ) = all bits set,
// no need to handle remainder for the last vector.
pMasks = 16;
}
}
else
{
// Shrinking the image, each SIMD vector consumes 16 input bytes, produces <16 output bytes
enlargement = false;
resizeFields( ( sourceBytes 15 ) / 16 );
// Not implemented, but the same idea works fine for this too.
// The only difference, instead of using offsets bytes for source offsets, use it for destination offsets.
assert( false );
}
}
void LineResize::apply( uint8_t * rdi, const uint8_t * rsi ) const
{
const __m128i* pm = masks.data();
const __m128i* const pmEnd = pm masks.size();
const uint8_t* po = offsets.data();
__m128i mask, source;
if( enlargement )
{
// One iteration of the loop produces 16 output bytes
// In MSVC results in 11 instructions for 16 output bytes.
while( pm < pmEnd )
{
mask = _mm_load_si128( pm );
pm ;
source = _mm_loadu_si128( ( const __m128i * )( rsi ) );
rsi = *po;
po ;
_mm_storeu_si128( ( __m128i * )rdi, _mm_shuffle_epi8( source, mask ) );
rdi = 16;
}
}
else
{
// One iteration of the loop consumes 16 input bytes
while( pm < pmEnd )
{
mask = _mm_load_si128( pm );
pm ;
source = _mm_loadu_si128( ( const __m128i * )( rsi ) );
rsi = 16;
_mm_storeu_si128( ( __m128i * )rdi, _mm_shuffle_epi8( source, mask ) );
rdi = *po;
po ;
}
}
}
// Utility method to print RGB pixel values from the vector
static void printPixels( const std::vector<uint8_t>&vec )
{
assert( !vec.empty() );
assert( 0 == ( vec.size() % 3 ) );
const uint8_t* rsi = vec.data();
const uint8_t* const rsiEnd = rsi vec.size();
while( rsi < rsiEnd )
{
const uint32_t r = rsi[ 0 ];
const uint32_t g = rsi[ 1 ];
const uint32_t b = rsi[ 2 ];
rsi = 3;
const uint32_t res = ( r << 16 ) | ( g << 8 ) | b;
printf( "X ", res );
}
printf( "\n" );
}
// A triviual test to resize 24 pixels -> 32 pixels
int main()
{
constexpr int sourceLength = 24;
constexpr int destLength = 32;
// Initialize sample input with 24 RGB pixels
std::vector<uint8_t> input( sourceLength * 3 );
for( size_t i = 0; i < input.size(); i )
input[ i ] = (uint8_t)i;
printf( "Input: " );
printPixels( input );
// That special handling of the last pixels of last line is missing from this example.
static_assert( 0 == destLength % 16 );
LineResize resizer( sourceLength, destLength );
std::vector<uint8_t> result( destLength * 3 );
resizer.apply( result.data(), input.data() );
printf( "Output: " );
printPixels( result );
return 0;
}
代碼忽略對齊問題。對于生產,您需要另一種方法來處理影像的最后一行,該方法不會運行到最后,而是使用標量代碼處理最后幾個像素。
代碼在熱回圈中包含更多記憶體參考。但是,該類中的兩個向量并不太長,對于 4k 影像,大小約為 12kb,應該適合 L1D 快取并保持在那里。
如果您有 AVX2,可能會進一步改進。對于放大影像,使用_mm256_inserti128_si256,該vinserti128指令可以從記憶體中加載 16 個位元組到向量的高半部分。同樣,對于縮小影像,使用_mm256_extracti128_si256,指令可以選擇使用記憶體目的地。
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/386697.html
