simd - sse/avx equivalent for neon vuzp -

intel's vector extensions sse, avx, etc. provide 2 unpack operations each element size, e.g. sse intrinsics _mm_unpacklo_* , _mm_unpackhi_*. 4 elements in vector, this:

inputs:      (a0 a1 a2 a3) (b0 b1 b2 b3) unpacklo/hi: (a0 b0 a1 b1) (a2 b2 a3 b3)

the equivalent of unpack vzip in arm's neon instruction set. however, neon instruction set provides operation vuzp inverse of vzip. 4 elements in vector, this:

inputs: (a0 a1 a2 a3) (b0 b1 b2 b3) vuzp:   (a0 a2 b0 b2) (a1 a3 b1 b3)

how can vuzp implemented efficiently using sse or avx intrinsics? there doesn't seem instruction it. 4 elements, assume can done using shuffle , subsequent unpack moving 2 elements:

inputs:        (a0 a1 a2 a3) (b0 b1 b2 b3) shuffle:       (a0 a2 a1 a3) (b0 b2 b1 b3) unpacklo/hi 2: (a0 a2 b0 b2) (a1 a3 b1 b3)

is there more efficient solution using single instruction? (maybe sse first - i'm aware avx may have additional problem shuffle , unpack don't cross lanes.)

knowing may useful writing code data swizzling , deswizzling (it should possible derive deswizzling code inverting operations of swizzling code based on unpack operations).

edit: here 8-element version: effect of neon's vuzp:

input:         (a0 a1 a2 a3 a4 a5 a6 a7) (b0 b1 b2 b3 b4 b5 b6 b7) vuzp:          (a0 a2 a4 a6 b0 b2 b4 b6) (a1 a3 a5 a7 b1 b3 b5 b7)

this version 1 shuffle , 1 unpack each output element (seems generalize larger element numbers):

input:         (a0 a1 a2 a3 a4 a5 a6 a7) (b0 b1 b2 b3 b4 b5 b6 b7) shuffle:       (a0 a2 a4 a6 a1 a3 a5 a7) (b0 b2 b4 b6 b1 b3 b5 b7) unpacklo/hi 4: (a0 a2 a4 a6 b0 b2 b4 b6) (a1 a3 a5 a7 b1 b3 b5 b7)

the method suggested eof correct require log2(8)=3 unpack operations each output:

input:         (a0 a1 a2 a3 a4 a5 a6 a7) (b0 b1 b2 b3 b4 b5 b6 b7) unpacklo/hi 1: (a0 b0 a1 b1 a2 b2 a3 b3) (a4 b4 a5 b5 a6 b6 a7 b7) unpacklo/hi 1: (a0 a4 b0 b4 a1 a5 b1 b5) (a2 a6 b2 b6 a3 a7 b3 b7) unpacklo/hi 1: (a0 a2 a4 a6 b0 b2 b4 b6) (a1 a3 a5 a7 b1 b3 b5 b7)

it should possible derive deswizzling code inverting operations

get used being disappointed , frustrated non-orthogonality of intel's vector shuffles. there no direct inverse punpck. sse/avx pack instructions narrowing element size. (so 1 packusdw inverse of punpck[lh]wd against zero, not when used 2 arbitrary vectors). also, pack instructions available 32->16 (dword word) , 16->8 (word byte) element size. there no packusqd (64->32).

pack instructions available saturation, not truncation (until avx512 vpmovqd), use-case we'd need prepare 4 different input vectors 2 pack instructions. turns out horrible, worse 3-shuffle solution (see unzip32_pack() in godbolt link below).

there 2-input shuffle want 32-bit elements, though: shufps. low 2 elements of result can 2 elements of first vector, , high 2 element can elements of second vector. shuffle want fits constraints, can use it.

we can solve whole problem in 2 instructions (plus movdqa non-avx version, because shufps destroys left input register):

inputs: a=(a0 a1 a2 a3) a=(b0 b1 b2 b3) _mm_shuffle_ps(a,b,_mm_shuffle(2,0,2,0)); // (a0 a2 b0 b2) _mm_shuffle_ps(a,b,_mm_shuffle(3,1,3,1)); // (a1 a3 b1 b3)

_mm_shuffle() uses most-significant-element first notation, of intel's documentation. notation opposite.

the intrinsic shufps uses __m128 / __m256 vectors (float not integer), have cast use it. _mm_castsi128_ps reinterpret_cast: compiles 0 instructions.

#include <immintrin.h> static inline __m128i unziplo(__m128i a, __m128i b) {     __m128 aps = _mm_castsi128_ps(a);     __m128 bps = _mm_castsi128_ps(b);     __m128 lo = _mm_shuffle_ps(aps, bps, _mm_shuffle(2,0,2,0));     return _mm_castps_si128(lo); }  static inline     __m128i unziphi(__m128i a, __m128i b) {     __m128 aps = _mm_castsi128_ps(a);     __m128 bps = _mm_castsi128_ps(b);     __m128 hi = _mm_shuffle_ps(aps, bps, _mm_shuffle(3,1,3,1));     return _mm_castps_si128(hi); }

gcc inline these single instruction each. static inline removed, can see how they'd compile non-inline functions. put them on the godbolt compiler explorer

unziplo(long long __vector(2), long long __vector(2)):     shufps  xmm0, xmm1, 136     ret unziphi(long long __vector(2), long long __vector(2)):     shufps  xmm0, xmm1, 221     ret

using fp shuffles on integer data fine on recent intel/amd cpus. there no bypass-delay latency (see this answer summarizes agner fog's microarch guide says it). has latency on intel nehalem , may still best choice there. fp loads/shuffles won't fault or corrupt integer bit-patterns represent nan, actual fp math instructions care that.

fun fact: on amd bulldozer-family cpus (and intel core2), fp shuffles shufps still run in ivec domain, have latency when used between fp instructions, not between integer instructions!

unlike arm neon / armv8 simd, x86 sse doesn't have 2-output-register instructions, , they're rare in x86. (they exist, e.g. mul r64, decode multiple uops on current cpus).

it's going take @ least 2 instructions create 2 vectors of results. ideal if didn't both need run on shuffle port, since recent intel cpus have shuffle throughput of 1 per clock. instruction-level parallelism doesn't when instructions shuffles.

for throughput, 1 shuffle + 2 non-shuffles more efficient 2 shuffles, , have same latency. or 2 shuffles , 2 blends more efficient 3 shuffles, depending on bottleneck in surrounding code. don't think can replace 2x shufps few instructions.

without `shufps`:

your shuffle + unpacklo/hi pretty good. 4 shuffles total: 2 pshufd prepare inputs, 2 punpckl/h. worse bypass latency, except on nehalem in cases latency matters throughput doesn't.

any other option seem require preparing 4 input vectors, either blend or packss. see @mysticial's answer _mm_shuffle_ps() equivalent integer vectors (__m128i)? blend option. 2 outputs, take total of 4 shuffles make inputs, , 2x pblendw (fast) or vpblendd (even faster).

using packsswd or wb 16 or 8 bit elements work. take 2x pand instructions mask off odd elements of , b, , 2x psrld shift odd elements down positions. sets 2x packsswd create 2 output vectors. 6 total instructions, plus many movdqa because destroy inputs (unlike pshufd copy+shuffle).

// don't use this, it's not optimal cpu void unzip32_pack(__m128i &a, __m128i &b) {     __m128i a_even = _mm_and_si128(a, _mm_setr_epi32(-1, 0, -1, 0));     __m128i a_odd  = _mm_srli_epi64(a, 32);     __m128i b_even = _mm_and_si128(b, _mm_setr_epi32(-1, 0, -1, 0));     __m128i b_odd  = _mm_srli_epi64(b, 32);     __m128i lo = _mm_packs_epi16(a_even, b_even);     __m128i hi = _mm_packs_epi16(a_odd, b_odd);     = lo;     b = hi; }

nehalem cpu might worth using other 2x shufps, because of it's high (2c) bypass delay. has 2 per clock shuffle throughput, , pshufd copy+shuffle, 2x pshufd prepare copies of a , b need 1 movdqa after punpckldq , punpckhdq results separate registers. (movdqa isn't free; has 1c latency , needs vector execution port on nehalem. it's cheaper shuffle if you're bottlenecked on shuffle throughput, rather overall front-end bandwidth (uop throughput) or something.)

i recommend using 2x shufps. on average cpu, , not horrible anywhere.

avx512

avx512 introduced lane-crossing pack-with-truncation instruction narrows single vector (instead of being 2-input shuffle). it's inverse of pmovzx, , can narrow 64b->8b or other combination, instead of factor of 2.

for case, __m256i _mm512_cvtepi64_epi32 (__m512i a) (vpmovqd) take 32-bit elements vector , pack them together. (i.e. low halves of each 64-bit element). it's still not building block interleave, though, since need else odd elements place.

it comes in signed/unsigned saturation versions. instructions have memory-destination form intrinsics expose let masked-store.

but problem, mysticial points out, avx512 provides 2-input lane-crossing shuffles can use shufps solve whole problem in 2 shuffles: vpermi2d/vpermt2d.

Search This Blog

Insert