python - Large Numpy Scipy CSR Matrix, row wise operation -


i want iterate on rows of csr matrix , divide each element sum of row, similar here:

numpy divide row row sum

my problem i'm dealing large matrix: (96582, 350138)

and when applying operation linked post, bloats memory, since returned matrix dense.

so here first try:

for row in counts:     row = row / row.sum() 

unfortunately doesn't affect matrix @ all, came second idea create new csr matrix , concat rows using vstack:

from scipy import sparse import time  start_time = curr_time = time.time() mtx = sparse.csr_matrix((0, counts.shape[1])) i, row in enumerate(counts):    prob_row = row / row.sum()    mtx = sparse.vstack([mtx, prob_row])    if % 1000 == 0:       delta_time = time.time() - curr_time       total_time = time.time() - start_time       curr_time = time.time()       print('step: %i, total time: %i, delta_time: %i' % (i, total_time, delta_time)) 

this works well, after iterations gets slower , slower:

step: 0, total time: 0, delta_time: 0 step: 1000, total time: 1, delta_time: 1 step: 2000, total time: 5, delta_time: 4 step: 3000, total time: 12, delta_time: 6 step: 4000, total time: 23, delta_time: 11 step: 5000, total time: 38, delta_time: 14 step: 6000, total time: 55, delta_time: 17 step: 7000, total time: 88, delta_time: 32 step: 8000, total time: 136, delta_time: 47 step: 9000, total time: 190, delta_time: 53 step: 10000, total time: 250, delta_time: 59 step: 11000, total time: 315, delta_time: 65 step: 12000, total time: 386, delta_time: 70 step: 13000, total time: 462, delta_time: 76 step: 14000, total time: 543, delta_time: 81 step: 15000, total time: 630, delta_time: 86 step: 16000, total time: 722, delta_time: 92 step: 17000, total time: 820, delta_time: 97 

any suggestions? idea why vstack gets slower , slower?

vstack o(n) operation because needs allocate memory result , copy contents of arrays passed arguments result array.

you can use multiply operation:

>>> res = counts.multiply(1 / counts.sum(1))  # multiply inverse >>> res.todense() matrix([[ 0.33333333,  0.        ,  0.66666667],         [ 0.        ,  0.        ,  1.        ],         [ 0.26666667,  0.33333333,  0.4       ]]) 

but it's quite easy use np.lib.stride_tricks.as_strided operation want (relatively performant). as_strided function allows more complicated operations on array (if there no method or function case).

for example using example csr of scipy documentation:

>>> scipy.sparse import csr_matrix >>> import numpy np >>> row = np.array([0,0,1,2,2,2]) >>> col = np.array([0,2,2,0,1,2]) >>> data = np.array([1.,2,3,4,5,6]) >>> counts = csr_matrix( (data,(row,col)), shape=(3,3) ) >>> counts.todense() matrix([[ 1.,  0.,  2.],         [ 0.,  0.,  3.],         [ 4.,  5.,  6.]]) 

you can divide each row it's sum this:

>>> row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr,                                                       shape=(counts.shape[0], 2),                                                      strides=2*counts.indptr.strides) >>> start, stop in row_start_stop:    ...    row = counts.data[start:stop] ...    row /= row.sum() >>> counts.todense() matrix([[ 0.33333333,  0.        ,  0.66666667],         [ 0.        ,  0.        ,  1.        ],         [ 0.26666667,  0.33333333,  0.4       ]]) 

Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -