performance - Fast computation of > 10^6 cosine vector similarities in R -
i got document term matrix of ~1600 documents x ~120 words. compute cosine similarity between these vectors, speaking ~1,300,000 comparisons [n * (n - 1) / 2].
i used parallel::mclapply 8 still takes forever.
which other solution suggest?
thanks
here's take on it.
if define cosine similarity
coss <- function(x) {crossprod(x)/(sqrt(tcrossprod(colsums(x^2))))} (i think can make base r functions , overseen crossprod little gem). if compare rcpp function using rcpparmadillo (slightly updated suggested @f-privé)
numericmatrix cosine_similarity(numericmatrix x) { arma::mat x(x.begin(), x.nrow(), x.ncol(), false); // compute crossprod arma::mat res = x.t() * x; int n = x.ncol(); arma::vec diag(n); int i, j; (i=0; i<n; i++) { diag(i) = sqrt(res(i,i)); } (i = 0; < n; i++) (j = 0; j < n; j++) res(i, j) /= diag(i)*diag(j); return(wrap(res)); } (this might possibly optimised of specialized functions in armadillo library - wanted timing measurements).
comparing yields
> xx <- matrix(rnorm(120*1600), ncol=1600) > microbenchmark::microbenchmark(cosine_similarity(xx), coss(xx), coss2(xx), times=50) > microbenchmark::microbenchmark(coss(x), coss2(x), cosine_similarity(x), cosine_similarity2(x), coss3(x), times=50) unit: milliseconds expr min lq mean median uq max coss(x) 173.0975 183.0606 192.8333 187.6082 193.2885 331.9206 coss2(x) 162.4193 171.3178 183.7533 178.8296 184.9762 319.7934 cosine_similarity2(x) 169.6075 175.5601 191.4402 181.3405 186.4769 319.8792 neval cld 50 50 b 50 which not bad. gain in computing cosine similarity using c++ super small (with @ f-privé's solution being fastest) i'm guessing timing issues due doing convert text words numbers , not when calculating cosine similarity. without knowing more specific code hard you.
Comments
Post a Comment