simd - sse/avx equivalent for neon vuzp -
intel's vector extensions sse, avx, etc. provide 2 unpack operations each element size, e.g. sse intrinsics _mm_unpacklo_*
, _mm_unpackhi_*
. 4 elements in vector, this:
inputs: (a0 a1 a2 a3) (b0 b1 b2 b3) unpacklo/hi: (a0 b0 a1 b1) (a2 b2 a3 b3)
the equivalent of unpack vzip
in arm's neon instruction set. however, neon instruction set provides operation vuzp
inverse of vzip
. 4 elements in vector, this:
inputs: (a0 a1 a2 a3) (b0 b1 b2 b3) vuzp: (a0 a2 b0 b2) (a1 a3 b1 b3)
how can vuzp
implemented efficiently using sse or avx intrinsics? there doesn't seem instruction it. 4 elements, assume can done using shuffle , subsequent unpack moving 2 elements:
inputs: (a0 a1 a2 a3) (b0 b1 b2 b3) shuffle: (a0 a2 a1 a3) (b0 b2 b1 b3) unpacklo/hi 2: (a0 a2 b0 b2) (a1 a3 b1 b3)
is there more efficient solution using single instruction? (maybe sse first - i'm aware avx may have additional problem shuffle , unpack don't cross lanes.)
knowing may useful writing code data swizzling , deswizzling (it should possible derive deswizzling code inverting operations of swizzling code based on unpack operations).
edit: here 8-element version: effect of neon's vuzp
:
input: (a0 a1 a2 a3 a4 a5 a6 a7) (b0 b1 b2 b3 b4 b5 b6 b7) vuzp: (a0 a2 a4 a6 b0 b2 b4 b6) (a1 a3 a5 a7 b1 b3 b5 b7)
this version 1 shuffle
, 1 unpack
each output element (seems generalize larger element numbers):
input: (a0 a1 a2 a3 a4 a5 a6 a7) (b0 b1 b2 b3 b4 b5 b6 b7) shuffle: (a0 a2 a4 a6 a1 a3 a5 a7) (b0 b2 b4 b6 b1 b3 b5 b7) unpacklo/hi 4: (a0 a2 a4 a6 b0 b2 b4 b6) (a1 a3 a5 a7 b1 b3 b5 b7)
the method suggested eof correct require log2(8)=3
unpack
operations each output:
input: (a0 a1 a2 a3 a4 a5 a6 a7) (b0 b1 b2 b3 b4 b5 b6 b7) unpacklo/hi 1: (a0 b0 a1 b1 a2 b2 a3 b3) (a4 b4 a5 b5 a6 b6 a7 b7) unpacklo/hi 1: (a0 a4 b0 b4 a1 a5 b1 b5) (a2 a6 b2 b6 a3 a7 b3 b7) unpacklo/hi 1: (a0 a2 a4 a6 b0 b2 b4 b6) (a1 a3 a5 a7 b1 b3 b5 b7)
it should possible derive deswizzling code inverting operations
get used being disappointed , frustrated non-orthogonality of intel's vector shuffles. there no direct inverse punpck
. sse/avx pack
instructions narrowing element size. (so 1 packusdw
inverse of punpck[lh]wd
against zero, not when used 2 arbitrary vectors). also, pack
instructions available 32->16 (dword word) , 16->8 (word byte) element size. there no packusqd
(64->32).
pack instructions available saturation, not truncation (until avx512 vpmovqd
), use-case we'd need prepare 4 different input vectors 2 pack instructions. turns out horrible, worse 3-shuffle solution (see unzip32_pack()
in godbolt link below).
there 2-input shuffle want 32-bit elements, though: shufps
. low 2 elements of result can 2 elements of first vector, , high 2 element can elements of second vector. shuffle want fits constraints, can use it.
we can solve whole problem in 2 instructions (plus movdqa
non-avx version, because shufps
destroys left input register):
inputs: a=(a0 a1 a2 a3) a=(b0 b1 b2 b3) _mm_shuffle_ps(a,b,_mm_shuffle(2,0,2,0)); // (a0 a2 b0 b2) _mm_shuffle_ps(a,b,_mm_shuffle(3,1,3,1)); // (a1 a3 b1 b3)
_mm_shuffle()
uses most-significant-element first notation, of intel's documentation. notation opposite.
the intrinsic shufps
uses __m128
/ __m256
vectors (float
not integer), have cast use it. _mm_castsi128_ps
reinterpret_cast: compiles 0 instructions.
#include <immintrin.h> static inline __m128i unziplo(__m128i a, __m128i b) { __m128 aps = _mm_castsi128_ps(a); __m128 bps = _mm_castsi128_ps(b); __m128 lo = _mm_shuffle_ps(aps, bps, _mm_shuffle(2,0,2,0)); return _mm_castps_si128(lo); } static inline __m128i unziphi(__m128i a, __m128i b) { __m128 aps = _mm_castsi128_ps(a); __m128 bps = _mm_castsi128_ps(b); __m128 hi = _mm_shuffle_ps(aps, bps, _mm_shuffle(3,1,3,1)); return _mm_castps_si128(hi); }
gcc inline these single instruction each. static inline
removed, can see how they'd compile non-inline functions. put them on the godbolt compiler explorer
unziplo(long long __vector(2), long long __vector(2)): shufps xmm0, xmm1, 136 ret unziphi(long long __vector(2), long long __vector(2)): shufps xmm0, xmm1, 221 ret
using fp shuffles on integer data fine on recent intel/amd cpus. there no bypass-delay latency (see this answer summarizes agner fog's microarch guide says it). has latency on intel nehalem , may still best choice there. fp loads/shuffles won't fault or corrupt integer bit-patterns represent nan, actual fp math instructions care that.
fun fact: on amd bulldozer-family cpus (and intel core2), fp shuffles shufps
still run in ivec domain, have latency when used between fp instructions, not between integer instructions!
unlike arm neon / armv8 simd, x86 sse doesn't have 2-output-register instructions, , they're rare in x86. (they exist, e.g. mul r64
, decode multiple uops on current cpus).
it's going take @ least 2 instructions create 2 vectors of results. ideal if didn't both need run on shuffle port, since recent intel cpus have shuffle throughput of 1 per clock. instruction-level parallelism doesn't when instructions shuffles.
for throughput, 1 shuffle + 2 non-shuffles more efficient 2 shuffles, , have same latency. or 2 shuffles , 2 blends more efficient 3 shuffles, depending on bottleneck in surrounding code. don't think can replace 2x shufps
few instructions.
without shufps
:
your shuffle + unpacklo/hi pretty good. 4 shuffles total: 2 pshufd
prepare inputs, 2 punpck
l/h. worse bypass latency, except on nehalem in cases latency matters throughput doesn't.
any other option seem require preparing 4 input vectors, either blend or packss
. see @mysticial's answer _mm_shuffle_ps() equivalent integer vectors (__m128i)? blend option. 2 outputs, take total of 4 shuffles make inputs, , 2x pblendw
(fast) or vpblendd
(even faster).
using packsswd
or wb
16 or 8 bit elements work. take 2x pand
instructions mask off odd elements of , b, , 2x psrld
shift odd elements down positions. sets 2x packsswd
create 2 output vectors. 6 total instructions, plus many movdqa
because destroy inputs (unlike pshufd
copy+shuffle).
// don't use this, it's not optimal cpu void unzip32_pack(__m128i &a, __m128i &b) { __m128i a_even = _mm_and_si128(a, _mm_setr_epi32(-1, 0, -1, 0)); __m128i a_odd = _mm_srli_epi64(a, 32); __m128i b_even = _mm_and_si128(b, _mm_setr_epi32(-1, 0, -1, 0)); __m128i b_odd = _mm_srli_epi64(b, 32); __m128i lo = _mm_packs_epi16(a_even, b_even); __m128i hi = _mm_packs_epi16(a_odd, b_odd); = lo; b = hi; }
nehalem cpu might worth using other 2x shufps
, because of it's high (2c) bypass delay. has 2 per clock shuffle throughput, , pshufd
copy+shuffle, 2x pshufd
prepare copies of a
, b
need 1 movdqa
after punpckldq
, punpckhdq
results separate registers. (movdqa
isn't free; has 1c latency , needs vector execution port on nehalem. it's cheaper shuffle if you're bottlenecked on shuffle throughput, rather overall front-end bandwidth (uop throughput) or something.)
i recommend using 2x shufps
. on average cpu, , not horrible anywhere.
avx512
avx512 introduced lane-crossing pack-with-truncation instruction narrows single vector (instead of being 2-input shuffle). it's inverse of pmovzx
, , can narrow 64b->8b or other combination, instead of factor of 2.
for case, __m256i _mm512_cvtepi64_epi32 (__m512i a)
(vpmovqd
) take 32-bit elements vector , pack them together. (i.e. low halves of each 64-bit element). it's still not building block interleave, though, since need else odd elements place.
it comes in signed/unsigned saturation versions. instructions have memory-destination form intrinsics expose let masked-store.
but problem, mysticial points out, avx512 provides 2-input lane-crossing shuffles can use shufps
solve whole problem in 2 shuffles: vpermi2d/vpermt2d
.
Comments
Post a Comment