python - sklearn CountVectorizer returning all zeros - string conversion issue? -
i trying use sklearn's countvectorizer given vocabulary. vocabulary is:
['humanitarian crisis', 'vacations anti-cruise crowd', 'school textbook', "b'cruise vacations anti-cruise", 'budget deal', "b'public school", 'u.n. announces', 'wrong petrol', 'vacations anti-cruise', "b'cruise vacations anti-cruise crowd"]
the input vectorize on taken pandas dataframe. read in csv pd.read_csv
, encoding='utf8'
:
29371 b'9 quirky , brilliant paris boutiques' 20525 b'public school textbook filled muslim bi... 2871 b'congress focuses on averting shutdown, t... 29902 b'yarmouk siege: u.n. announces trip syria ... 45596 b'fracking protesters arrested gluing them... 6266 b'cruise vacations anti-cruise crowd'
after call countvectorizer(vocabulary=vocabulary).fit_transform()
, matrix of zeros:
(<6x10 sparse matrix of type '<type 'numpy.int64'>' 0 stored elements in compressed sparse row format>, <class 'scipy.sparse.csr.csr_matrix'>)
is problem because of string types, or problem how i'm calling countvectorizer? i'm not sure how else convert string types; i've tried multiple different calls encode
, decode
in python2.7 , pandas. suggestions appreciated.
Comments
Post a Comment