python - sklearn CountVectorizer returning all zeros - string conversion issue? -


i trying use sklearn's countvectorizer given vocabulary. vocabulary is:

['humanitarian crisis', 'vacations anti-cruise crowd', 'school textbook', "b'cruise vacations anti-cruise", 'budget deal', "b'public school", 'u.n. announces', 'wrong petrol', 'vacations anti-cruise', "b'cruise vacations anti-cruise crowd"] 

the input vectorize on taken pandas dataframe. read in csv pd.read_csv , encoding='utf8':

29371            b'9 quirky , brilliant paris boutiques' 20525    b'public school textbook filled muslim bi... 2871     b'congress focuses on averting shutdown, t... 29902    b'yarmouk siege: u.n. announces trip syria ... 45596    b'fracking protesters arrested gluing them... 6266         b'cruise vacations anti-cruise crowd' 

after call countvectorizer(vocabulary=vocabulary).fit_transform(), matrix of zeros:

(<6x10 sparse matrix of type '<type 'numpy.int64'>'     0 stored elements in compressed sparse row format>, <class 'scipy.sparse.csr.csr_matrix'>) 

is problem because of string types, or problem how i'm calling countvectorizer? i'm not sure how else convert string types; i've tried multiple different calls encode , decode in python2.7 , pandas. suggestions appreciated.


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -