python - Large Numpy Scipy CSR Matrix, row wise operation -
i want iterate on rows of csr matrix , divide each element sum of row, similar here:
my problem i'm dealing large matrix: (96582, 350138)
and when applying operation linked post, bloats memory, since returned matrix dense.
so here first try:
for row in counts: row = row / row.sum()
unfortunately doesn't affect matrix @ all, came second idea create new csr matrix , concat rows using vstack:
from scipy import sparse import time start_time = curr_time = time.time() mtx = sparse.csr_matrix((0, counts.shape[1])) i, row in enumerate(counts): prob_row = row / row.sum() mtx = sparse.vstack([mtx, prob_row]) if % 1000 == 0: delta_time = time.time() - curr_time total_time = time.time() - start_time curr_time = time.time() print('step: %i, total time: %i, delta_time: %i' % (i, total_time, delta_time))
this works well, after iterations gets slower , slower:
step: 0, total time: 0, delta_time: 0 step: 1000, total time: 1, delta_time: 1 step: 2000, total time: 5, delta_time: 4 step: 3000, total time: 12, delta_time: 6 step: 4000, total time: 23, delta_time: 11 step: 5000, total time: 38, delta_time: 14 step: 6000, total time: 55, delta_time: 17 step: 7000, total time: 88, delta_time: 32 step: 8000, total time: 136, delta_time: 47 step: 9000, total time: 190, delta_time: 53 step: 10000, total time: 250, delta_time: 59 step: 11000, total time: 315, delta_time: 65 step: 12000, total time: 386, delta_time: 70 step: 13000, total time: 462, delta_time: 76 step: 14000, total time: 543, delta_time: 81 step: 15000, total time: 630, delta_time: 86 step: 16000, total time: 722, delta_time: 92 step: 17000, total time: 820, delta_time: 97
any suggestions? idea why vstack gets slower , slower?
vstack
o(n)
operation because needs allocate memory result , copy contents of arrays passed arguments result array.
you can use multiply
operation:
>>> res = counts.multiply(1 / counts.sum(1)) # multiply inverse >>> res.todense() matrix([[ 0.33333333, 0. , 0.66666667], [ 0. , 0. , 1. ], [ 0.26666667, 0.33333333, 0.4 ]])
but it's quite easy use np.lib.stride_tricks.as_strided
operation want (relatively performant). as_strided
function allows more complicated operations on array (if there no method or function case).
for example using example csr of scipy documentation:
>>> scipy.sparse import csr_matrix >>> import numpy np >>> row = np.array([0,0,1,2,2,2]) >>> col = np.array([0,2,2,0,1,2]) >>> data = np.array([1.,2,3,4,5,6]) >>> counts = csr_matrix( (data,(row,col)), shape=(3,3) ) >>> counts.todense() matrix([[ 1., 0., 2.], [ 0., 0., 3.], [ 4., 5., 6.]])
you can divide each row it's sum this:
>>> row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr, shape=(counts.shape[0], 2), strides=2*counts.indptr.strides) >>> start, stop in row_start_stop: ... row = counts.data[start:stop] ... row /= row.sum() >>> counts.todense() matrix([[ 0.33333333, 0. , 0.66666667], [ 0. , 0. , 1. ], [ 0.26666667, 0.33333333, 0.4 ]])
Comments
Post a Comment