performance - Fast computation of > 10^6 cosine vector similarities in R -

August 15, 2014

i got document term matrix of ~1600 documents x ~120 words. compute cosine similarity between these vectors, speaking ~1,300,000 comparisons [n * (n - 1) / 2].

i used parallel::mclapply 8 still takes forever.

which other solution suggest?

thanks

here's take on it.

if define cosine similarity

coss <- function(x) {crossprod(x)/(sqrt(tcrossprod(colsums(x^2))))}

(i think can make base r functions , overseen crossprod little gem). if compare rcpp function using rcpparmadillo (slightly updated suggested @f-privé)

numericmatrix cosine_similarity(numericmatrix x) {   arma::mat x(x.begin(), x.nrow(), x.ncol(), false);    // compute crossprod                                                                                         arma::mat res = x.t() * x;   int n = x.ncol();   arma::vec diag(n);   int i, j;    (i=0; i<n; i++) {     diag(i) = sqrt(res(i,i));   }    (i = 0; < n; i++)     (j = 0; j < n; j++)       res(i, j) /= diag(i)*diag(j);    return(wrap(res)); }

(this might possibly optimised of specialized functions in armadillo library - wanted timing measurements).

comparing yields

> xx <- matrix(rnorm(120*1600), ncol=1600) > microbenchmark::microbenchmark(cosine_similarity(xx), coss(xx), coss2(xx), times=50) > microbenchmark::microbenchmark(coss(x), coss2(x), cosine_similarity(x), cosine_similarity2(x), coss3(x), times=50) unit: milliseconds                   expr      min       lq     mean   median       uq      max                coss(x) 173.0975 183.0606 192.8333 187.6082 193.2885 331.9206               coss2(x) 162.4193 171.3178 183.7533 178.8296 184.9762 319.7934  cosine_similarity2(x) 169.6075 175.5601 191.4402 181.3405 186.4769 319.8792  neval cld     50       50  b      50

which not bad. gain in computing cosine similarity using c++ super small (with @ f-privé's solution being fastest) i'm guessing timing issues due doing convert text words numbers , not when calculating cosine similarity. without knowing more specific code hard you.

Search This Blog

Insert

performance - Fast computation of > 10^6 cosine vector similarities in R -

Comments

Post a Comment

Popular posts from this blog

service - Android MediaPlayer calls onCompletion before it already finished -

javascript - Training Neural Network to play flappy bird with genetic algorithm - Why can't it learn? -

javascript - Create a stacked percentage column -