Benchmarking Theano and CNTK with a simple matrix-vector product on the GPU -


i want compare performance of theano , cntk on simple task: matrix-vector product on gpu. using theano 0.9.0 , cntk 2.0.

i want measure time consumed computation on device only, excluding time used data transfer host device, or vice versa.

the result got this: figure (timings theano vs cntk) (n number of repetitions. d, size of matrix, set 10000.)

question 1:

it seems time used preparation (compiling computational graph?) included in first execution of mat-vec product in cntk case. there way split preparation , execution in cntk, in theano case?

question 2:

i used theano, totally new @ cntk, not quite sure if cntk code equivalent theano code. particularly not sure if operation in loop of cntk code enclosed in device, since prod.eval() returns numpy.ndarray. missing something?

code used measure timings:

import numpy np import time  # theano def test_matvecdot_theano(d, n):     import theano     import theano.tensor t     a_cpu = np.random.normal(size=[d,d]).astype(np.float32)     x_cpu = np.random.normal(size=[d]).astype(np.float32)     a_gpu = theano.shared(a_cpu)     x_gpu = theano.shared(x_cpu)     b_gpu = theano.shared(x_cpu)     b_gpu_new = t.dot(a_gpu,x_gpu)     fnc = theano.function(inputs=[], outputs=none, updates=[(b_gpu, b_gpu_new)], allow_input_downcast=true)     tic = time.time()     in range(n):         fnc()     toc = time.time()     print("time_theano:",toc-tic)  # cntk def test_matvecdot_cntk(d, n):     import cntk c     a_cpu = np.random.normal(size=[d,d]).astype(np.float32)     x_cpu = np.random.normal(size=[d,1]).astype(np.float32)     a_c = c.parameter(init=a_cpu, dtype=np.float32)     x_c = c.parameter(init=x_cpu, dtype=np.float32)     b_c = c.parameter(init=x_cpu, dtype=np.float32)     prod = c.times(a_c, x_c)     tic = time.time()     in range(n):         b_c.value = prod.eval() # operation enclosed in device?     toc = time.time()     print("time_cntk:",toc-tic) 

the short answer no, operation not enclosed on device. here's happens: when call eval(), call goes c++ operation on device if possible. when coming out of c++, cntk checks whether value of as_numpy keyword argument, default true. when as_numpy true, gpu buffer eagerly copied numpy array.

if call prod.eval(as_numpy=false), call eval not convert gpu buffer numpy array. if assign result plain old variable, can see cntk value object. in code assign .value attribute of b_c. assignment handled setter of value property (since answer getting little technical i'm including this link sake of other readers). cntk assignment on device, although it's hard tell. because if try inspect b_c.value if calling .value property getter going give numpy array. looks result numpy array consequence of using b_c.value. other variable let see cntk value object. again, applies when eval(as_numpy=false).

furthermore, cntk uses timestamps above evaluation happens once on gpu. subsequent n-1 calls eval() return same value object (the conversion numpy happen each time though, unless specify as_numpy=false.

finally, don't expect learn many meaningful lessons benchmark: both cntk , theano calling same cudnn implementation, advantages of cntk more around higher level things such (a) comes high-level library (b) user doesn't have worry batch , sequence axes except few specialized operations (c) efficient recurrent networks (d) efficient i/o (e) easy distributed training.

and answer question setup time: understanding if eval function once, compile it. cntk has 2 kinds of compilations: if eval first time compile forward pass. if later function.grad throw away eval compilation , compile again can handle both forward , backward pass.


Comments

Popular posts from this blog

javascript - Create a stacked percentage column -

Optimising Firebase database by automatically overwriting data -

javascript - Angular UI-Grid customTemplate directive causing rows to load slowly/? -