c++ - Large (0,1) matrix multiplication using bitwise AND and popcount instead of actual int or float multiplies? -

for multiplying large binary matrices (10kx20k), convert matrices float ones , perform float matrix multiplication integer matrix multiplication pretty slow (have @ here).

this time though, i'd need perform on hundred thousands of these multiplications , even millisecond performance improvement on average matters me.

i want int or float matrix result, because product may have elements aren't 0 or 1. input matrix elements 0 or 1, can stored single bits.

in inner-product between row vector , column vector (to produce 1 element of output matrix), multiplication simplifies bitwise and. addition still addition, can add bits population-count function instead of looping on them individually.

some other boolean/binary-matrix functions or bits instead of counting them, producing bit-matrix result, that's not need.

here sample code showing forming problem std::bitset, and , count operations faster matrix multiplication.

#include <iostream> using std::cout; using std::endl; #include <vector>     using std::vector; #include <chrono> #include <eigen/dense>     using eigen::map; using eigen::matrix; using eigen::matrixxf; #include <random>     using std::random_device; using std::mt19937; using std::uniform_int_distribution; #include <bitset>     using std::bitset;  using std::floor;  const int nrow = 1000; const int ncol = 20000;  const float density = 0.4; const float denominator = 10.0 - (10*density);  void fill_random(vector<float>& vec) {     random_device rd;     mt19937 eng(rd());     uniform_int_distribution<> distr(0, 10);     int nnz = 0;     (int = 0; < nrow*ncol; ++i)         vec.push_back(floor(distr(eng)/denominator)); }  void matmul(vector<float>& vec){     float *p = vec.data();     matrixxf = eigen::map<eigen::matrix<float, nrow, ncol, eigen::rowmajor>>(p);     cout << "eigen matrix has " << a.rows() << " rows , " << a.cols() << " columns." << endl;     cout << "total non-zero values : " << a.sum() << endl;     cout << "the density of non-zero values " <<  a.sum() * 1.0 / (a.cols()*a.rows()) << endl;      auto start = std::chrono::steady_clock::now();     matrixxf b = a.transpose() * a;     auto end = (std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start)).count();     cout << "mat mul took " << end << " ms"<< endl;      // make sure operation not skipped compiler     cout << "eigen coo ";     (int i=0; i<10; ++i)         cout << b(0,i) << " ";     cout << endl; }   void bitset_op(vector<float>& vec) {     // yeah it's not great idea set size @ compile time have     vector<bitset<nrow>> col_major(ncol);      // right, multiple par isn't idea, maybe in parallel block     // doing simplicity profile second loop timing      // converting row major float vec col major bool vec     #pragma omp parallel     (int j=0; j < ncol; ++j) {         (int i=0; < nrow; ++i) {             col_major[j].set(i, vec[i*ncol + j] && 1);         }     }      auto start = std::chrono::steady_clock::now();     vector<int> coo;     coo.assign(ncol*ncol, 0);     #pragma omp parallel     (int j=0; j < ncol; ++j) {         (int k=0; k<ncol; ++k) {             coo[j*ncol + k] = (col_major[j]&col_major[k]).count();         }     }     auto end = (std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start)).count();     cout << "bitset intersection took " << end << " ms"<< endl;      // make sure operation not skipped compiler     cout << "biset coo ";     (int i=0; i<10; ++i)         cout << coo[i] << " ";     cout << endl; }   int main() {     // saving float instead of int speed matmul     vector<float> vec;     fill_random(vec);     matmul(vec);     bitset_op(vec); }

running with:

g++ -o3 -fopenmp -march=native -i. -std=c++11 code.cpp -o code

i get:

eigen matrix has 1000 rows , 20000 columns. total non-zero values : 9.08978e+06 density of non-zero values 0.454489 mat mul took 1849 ms eigen coo 458 206 208 201 224 205 204 199 217 210 bitset intersection took 602 ms biset coo 458 206 208 201 224 205 204 199 217 210

as can see, matmul set of bitset operations 3x faster eigen's float matmul, makes sense.

i want emphasize need perform operation on 100k (in hpc or cloud) , millisecond performance improvement on average make difference.

i'm not bound specific library, c++ standard, etc. please feel free answer solution think faster other using gpu, can't use number of reasons.

i'm not sure how much, if any, can compiler without manually vectorizing intrinsics or c++ vector-class wrapper (like agner fog's vcl, if project's license compatible gpl). there non-gpled wrappers, too.

cache-blocking matrix multiply fine art (and important here), , nice if use eigen's existing templates different class uses bitwise and on integers, instead of multiply on floats. i'm not sure if possible.

i did searching, , of literature binary matrices producing boolean result (including questions like this). vector inner-product done , multiply, xor or or addition, not popcount. maybe there's search-term i'm missing describes "normal" matrices happen (0,1) matrices, product won't be.

since every millisecond matters, you're going have manually vectorize this.

it's not vector-integer stuff slow in general, it's just vector-integer multiply that's slow, compared vector-float fma on recent x86 hardware (especially intel, has fp fma throughput of 2x 256b vectors per clock on haswell , later).

since don't need actual multiply boolean elements, , (3 vectors per clock throughput), that's not problem you. efficiency-gain doing many more elements per vector should more make cost per-vector.

of course, assumes integer matmul implementation using same cache-blocking , other optimizations equivalent fp matmul, , that's problem lies if don't want (or don't know how to) write yourself, , can't find library you.

i'm answering question of how efficient could be, optimal implementation. answer title question definite yes, it's massive waste of time use actual multiplication, 32-bit elements.

storage format options:

one element (0/1) per byte:

4x density of float (cache footprint / memory bandwidth / elements per vector)
easy transpose byte shuffles
vertical add easy, in case matters (e.g. vectorizing on outer loop, , working on multiple rows or multiple columns @ once. can (avoiding horizontal sums @ end) if have data interleaved in way makes work without shuffling.)

4 elements per byte, packed low nibble:

4x density of separate bytes
very efficient popcount avx2 vpshufb. inputs hot in l1d cache, load/and/accumulate-a-popcount throughput of 128 and-result elements per clock cycle (per core), in theory. 4 fused-domain uops per clock saturates skl/hsw front-end issue bandwidth of 4 per clock, , doesn't bottleneck on 3 vector alu ports, because 1 of uops pure load. (the other load micro-fuses vpand). when bottlenecked on l2 bandwidth (~one 32b load per cycle), runs @ 64 elements per clock. see below.
slower create integer or packed-bitmap (but not bad if put bits vectors in interleaved order efficient pack/unpack in-order bytes, rather forcing bits in order).
hard transpose (maybe worse packed)

packed bits:

8x density of separate bytes, 256 elements per avx2 vector.
can created vectors pmovmskb non-interleaved storage order. (not useful creation on fly, though, since puts result in integer reg, not vector. interleaved bit-order best, esp. unpacking during transpose).
fairly efficient popcount avx2: mask / shift+mask / 2xvpshufb. (9 fused-domain uops (8 vector-alu uops) , + accumulate popcount 256 elements (from 2 row/column vectors), vs. 8 uops (6 vector-alu uops) 4-per-byte strategy (from 4 row/column vectors).) alu port bottlenecks limit 96 elements per clock l1d or l2. this has 1.5x inner-product throughput of pack4 strategy when bottlenecks on l2 bandwidth, or 3/4 throughput data hot in l1d, in theory, counting inner loop. inner-product part, not accounting different pack/unpack costs.
hard transpose (but maybe not horrible pmovmskb extract 1 bit each byte , make them contiguous).

6 elements per bytes, 0xxx0xxx (probably no advantages problem on hsw/skl, interesting consider):

6x density of separate bytes
fairly easy create 0/1 bytes in interleaved way, shifting/oring, same 4 bits per byte format.
optimized efficient popcount avx2 vpshufb. no need mask before 2xvpshufb, 1 right-shift. (vpshufb zeros byte if high bit set, otherwise uses low nibble index. why needs masking.) right shifting format 4 (vpsrld ymm0,4) still leave 0 in high bit of every byte. load+and -> accumulate popcount 7 fused-domain uops per vector (vmovdqa/vpand ymm,[mem]/vpsrld ymm,4/2xvpshufb/2xvpaddb), 6 of need alu ports. hsw/skl throughput in theory 1 vector (of 192 elements) per 2 clocks, or 96 elements per clock. requires average load throughput of 1 256b vector per clock, it's right against l2 bandwidth bottleneck.

in theory it's same packed, in practice might faster or slower depending on 1 schedules better (fewer and/add uops stealing port 5 shuffles, example). packed more come close theoretical speed, because more of uops can run on multiple ports. out-of-order scheduling imperfections less likely.
the pmovmskb transpose trick doesn't work cleanly.
could useful if needed popcount(a[]) instead of popcount(a[] & b[]). or different microarchitecture alu vs. load throughput different.

another variation on this, 7 elements per byte popcounted single avx512vbmi (cannonlake?) vpermi2b (_mm512_permutex2var_epi8), each index byte selects 1 of 128 bytes concatenation of 2 other registers. shuffle wide slow, have better throughput avx512 vpshufb separate-nibble thing.

to count packed-8 avx512vbmi (but without avx512vpopcntdq), maybe use vpermi2b count low 7, shift+mask top bit , add it. (popcount of single bit = bit).

uint8_t elements easier shuffle efficiently (since there byte shuffles vpshufb), may worth considering if have transpose on fly. or pack down bits on fly while transposing?

32-bit integers option, not option. fewer elements per vector means fewer shuffle instructions in transpose, not factor of 4. number of shuffles in transpose may scale log2(elements per vector).

this big deal cache footprint / memory bandwidth. factor of 8 size difference can mean doing whole row or column takes part of l1, instead of overflowing l1. can make cache-blocking easier / less important.

10k * 20k / 8 = 23.84mib per matrix, using packed-bit elements. that's larger l2 cache (256kib on haswell, 1mib on skylake-avx512), fit in l3 on many-core xeon cpus. l3 competitively shared cores (including other vms in cloud environment), , slower l2. (many-core xeons running on in hpc / cloud systems have lower per-core memory bandwidth quad-core desktops, because of higher latency l3 cache no increase in concurrency (see "latency-bound platforms" section of answer. takes more cores drive same amount of memory bandwidth on xeon, though total throughput higher. if can have each core working out of private l2, gain lot.)

adding , results: you've arranged loops need reduce single run of booleans count of non-zeros. thing.

with 8-bit integer 0/1 elements, can 255 vpaddb before element overflow. has throughput: 2 per clock on haswell, 3 per clock on skylake. multiple accumulators, covers lot of vectors of , results. use vpsadbw against all-zero vector horizontally add bytes in vector 64-bit integers. combine accumulators vpaddq, then horizontally sum it.

with packed bits, want popcount vectors of , results. avx2 , data in vectors, want use vpshufb-based bit-slicing popcount. (see http://wm.ite.pl/articles/sse-popcount.html example. you'd want write intrinsics, not asm, if have manually vectorize it.)

you consider packing data 4 bits per byte, in low nibble. mean 1 vpshufb count bits in each byte of , result, without needing shifting / masking. inside inner loop, you'd have 2 loads, vpand, vpshufb, vpaddb. proper unrolling, should keep l1d load bandwidth of 2x 32b per clock, , saturate 3 vector execution ports (on haswell or skylake). break out of every 128 or 255 vectors or accumulate bytes of accumulator(s) vpsadbw/vpaddq. (but cache-blocking, want break out anyway , different work). so inner-most loop should run @ 4 elements per byte * 32b per vector = 128 elements per clock cycle, if can arrange read data that's hot in l1d cache. expect half bandwidth l2 cache on haswell/skylake, or worse l3 cache.

with uint8_t elements 0 or 1, can maybe use integer multiply-add instructions. they're bit weirdly designed, intended different use-cases fp fma. add horizontal pairs of multiply results, producing wider elements. vpmaddubsw widens 8 16 bit elements, , work on 0s , 1s. since each element can in range 0..2, can still horizontal-sum vpsadbw. if you're going vpsadbw, gains nothing on vpand. useful if wanted use vpaddw use 16-bit elements in vector accumulator, instead of breaking out of loop avoid byte overflow. vpmaddubsw doesn't seem useful here, becausevpsadbw` better way horizontally add bytes.

converting 0/1 integers bitmaps can done efficiently sse/avx: 32-bit integer elements, vpslld ymm0, 31 left-shift relevant bit top of each element, vmovmskps eax, ymm0 8-bit mask of high byte of each 32-bit element. 8-bit integer elements, vpslld ymm0, 7 / vpmovmskb eax, ymm0 same thing each byte, producing 32-bit integer bitmap result. (only sign bit of each byte matters, it's fine there no shift instructions 8 bit granularity. don't need bits carry next element.)

this isn't method using right away vectors, because end results in integer registers. isn't great format generate , use on fly, compact can make sense if can keep matrices in format long-term. (and if you'll limited memory bandwidth when loading them.)

converting 32-bit integers 8-bit: 1 ways 2x vpackssdw + vpacksswb. because operate within 128b lanes, elements end reordered. that's ok long it's same ordering every row/column. it's problem if want take chunk of row/column doesn't start @ multiple of 32 elements. option here left-shift (by 8, 16, , 24), , or vectors together. actually, you can shifting free using unaligned load offset 1, 2, or 3 bytes.

static inline __m256i load_interleave4x32(const int32_t *input) {   const char *p = (const char*)input;   __m256i t0 = _mm256_load_si256((const __m256i*)(p));   __m256i t1 = _mm256_load_si256((const __m256i*)(p+32*1-1));  // 1/0 bits in 2nd byte of each 32-bit element   __m256i t2 = _mm256_load_si256((const __m256i*)(p+32*2-2));   __m256i t3 = _mm256_load_si256((const __m256i*)(p+32*3-3));   return t0 | t1 | t2 | t3;   // or write out _mm256_or_si256, if don't have overloaded operators gnu c does.   // should compile 1 load , 3 vpor ymm0, [rdi+31] ... instructions. }

converting half-packed 4 bits per byte: can use same idea above. 4 vectors load_interleave4x32 (or array of uint8_t if started 8-bit elements). left-shift them 0, 1, 2, , 3 bits, , or together. interleaved bit-order fine when need , row/column , popcount whole result, because order doesn't matter. bit-order efficient unpack in-order bytes, e.g. , set1_epi8(1) vector of bytes.

you might use part of transpose if store whole matrices in format, or use format store temporary copies cache-blocked transpose. matmul touches each row/column multiple times, may worth doing work make compact format first time when lets 4x work per vector on subsequent passes.

with avx512bw (skylake-avx512)

we want doing , and popcnt vectors, not scalar integer, because vectors twice wide avx2, pull further ahead of scalar popcnt. (even though skylake-avx512 shuts down vector alus (but not scalar) on port 1 while running 512b instructions).

@harold points out interesting identity lets 2/3rds many vector popcounts, @ cost of integer operations.

   popcnt(a) + popcnt(b) + popcnt(c)  = popcnt(a ^ b ^ c) + 2 * popcnt((a ^ b) & c | (a & b))

a ^ b ^ c , (a ^ b) & c | (a & b) can done 1 vpternlogd each (since each have 3 boolean inputs). 2* free if use separate pre-shifted vpshufb lut vector. see this implementation uses 30x vpternlogd + 1 vector popcnt handle 16 vectors of 512b, cleanup @ end (only doing 16*popcnt counts inside loop; else chained).

this worth counting fully-packed 8 bits per byte elements, , makes format lot more attractive avx512, compared less-dense formats optimized popcounting without shifting/masking.

vpternlogd can useful bit-blend instruction transposes, if byte-granularity vpblendmb zmm{k1}, zmm, zmm isn't fine-enough grained.

this might worth avx2 on cpus, maybe avoiding 1 out of every 4 or 5 vector popcounts rather 1 of 3? or might not @ if increases total execution port pressure, , there wasn't bottleneck on specific one. useful scalar popcnt instructions (maybe on cpus without avx2), because bottleneck on single port on intel cpus.

we can turn uint8_t boolean elements non-interleaved bitmaps more efficiently avx2 (without needing shift), , reverse more efficiently. test-into-mask or compare-into-mask against vector of set1_epi8(1) both job, producing 64 bits of mask 64 bytes of input. or 32-bit integers start with, producing 16 bits of mask @ time. can efficiently concatenate bits kunpck instructions.

_mm512_test_epi8_mask (vptestmb) interesting: , 2 vectors together, , produce mask-register result of byte-elements true/false. isn't want: if we're going pack our bits, want pre-processing step on input matrices, not on fly while doing inner products.

bitmap -> vector of 0 / -1 fast: __m512i _mm512_movm_epi8 (__mmask64 k) (vpmovm2b) in 1 instruction. can subtract -1 instead of adding 1, you'd have mask before can or multiple bits within byte.

without avx512bw or avx512dq (knight's landing xeon phi), don't have 512b vpshufb can't vector popcnt efficiently. there's avx512 popcnt extension vector popcnt directly, no hardware has been announced yet. (avx2 vpshufb ymm slow on knl, though: 1 per 12 cycles, , psadbw ymm 1 per 9 cycles, using 256b vectors unattractive). might use a bithack popcnt based on 32-bit integer elements, since that's and/shift/add. 32-bit elements take fewer steps popcnt 64-bit, still big enough not overflow reasonable problem sizes (so can defer horizontal sum of vector until outside loop)

given choice of storage format, packing multiple bits per byte might not idea knl, single-byte integer elements good. vpandd zmm , vpaddd zmm both fast , part of avx512f, , can use them because don't want let our single-bytes overflow anyway. (using packed 32-bit add when have 8-bit elements won't carry each other swar technique.) knl has memory bandwidth , poor instruction throughput relative skylake-avx512, think.

transposing bits:

bmi2 _pdep_u64 might useful here. it's scalar instruction/intrinsic. if makes bit-transpose lot more efficient unpacking bytes, you'd want store block of transpose results before reloading vector loads , + count. (reloading vector right away after scalar stores cause store-forwarding stall.)

another useful option vpmovmskb can slice 32 bits out of 32-byte vector, 1 per byte. gives building block transpose, maybe combined byte shuffles bytes in right order it. more, see this blog post, , how transpose binary matrix?.

using in matmul

some of choices depend on format input data in, , how reuse same matrices. if matrix used multiple times, packing down 4 or 8 bits per byte ahead of time makes sense. (or on fly first time it's used). keeping transposed copy of may make sense, if side of multiply needs transposing. (if need 1 way , other, redoing on fly might better l3 cache footprint. these big enough won't lot of l3 hits, keeping transposed copy good.)

or maybe write out transposed , non-transposed version while converting input format.

you want cache-block multiplies, same data reused multiple times while hot in l1. don't have useful off top of head. the same principles apply when cache-blocking normal fp matmul, go read that.

comments on c++ implementation:

using bitset & whole column put values in memory, , you'll loop on them again in .count() on result. doubt compiler optimize one-pass loop uses vpshufb-based bit-slicing popcnt on each vector of vpand results, better. (see http://wm.ite.pl/articles/sse-popcount.html example. you'd want write intrinsics, not asm, if have manually vectorize it.)

with matrix sizes, @ least inner loop hits in l1d cache, load/store instructions looping twice more overhead , interferes prefetch of valuable data.

getting compilers efficiently popcnt dynamically-sized bitmap (without manually vectorizing) not easy. thing doesn't suck clang++ -stdlib=libc++ vector<bool>, compiles std::count(v.begin(), v.end(), true); vectorized vpshufb + vpsadbw + vpaddq loop, quite good. faster if used vpaddb inside unrolled loop , vpsadbw + vpaddq once per iteration, it's pretty auto-vectorized code.

g++'s vector<bool> bitmap, std::count(v.begin(), v.end(), true); bad: uses totally naive loop tests 1 bit @ time. , doesn't efficiently. same clang++ default libstdc++ instead of new libc++.

boost::dynamic_bitset has .count() member function, doesn't take advantage of popcnt instruction or avx2. byte-at-a-time lut lookup. that's better std::count(vector<bool>) without libc++, it's not close enough hpc.

here's test code on godbolt compiler explorer, gcc , clang asm output. of them used -march=haswell.

but unfortunately, there doesn't seem efficient way bitwise-and 2 std::vector<bool>. this answer shows how @ underlying implementation of g++'s libstdc++ vector<bool>, code doesn't auto-vectorize. doing same thing libc++ , tweaking auto-vectorizes might let fraction of performance possible manual vectorization (except transpose), you'd have keep whole matrix in 1 vector<bool>, because vector of vectors bad level of indirection. if transpose part of problem performance-critical too, using standard containers access efficient popcount not going solve whole problem.

for std::bitset<1024*1024>.count(), clang makes same efficient avx2 popcount or without libc++. g++ makes scalar loop using 64-bit popcnt instruction, (according this) faster avx2 popcnt small bitsets, slower large bitsets, on haswell , skylake.

see also: on vector<bool> — howard hinnant, commentary on c++ standard library, , why array of bits useful data structure, vector<bool> bad name it. also, benchmarks properly-optimized count/find_first/etc. on bit-vector vs. 1 bool-per-byte bool[] array, vs. naive vector<bool> (like gcc , clang without libc++).

Search This Blog

Insert