r - TM DocumentTermMatrix gives results which are unexpected given the corpus -


maybe misinterpret how tm::documenttermmatrix works. have corpus after preprocessing looks this:

head(description.text, 3) [1] "azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram"                     [2] "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur"      [3] "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin" 

which process via:

description.text.features <- documenttermmatrix(corpus(vectorsource(description.text)), list(     bounds = list(local = c(3, inf)),     tokenize = 'scan' )) 

when inspect first row of dtm this:

inspect(description.text.features[1,]) <<documenttermmatrix (documents: 1, terms: 887)>> non-/sparse entries: 0/887 sparsity           : 100% maximal term length: 15 weighting          : term frequency (tf) sample             :     terms docs banc camill mar martin ospedal presid san sanitar torin vittor    1    0      0   0      0       0      0   0       0     0      0 

these terms don't correspond fist document in corpus description.text (eg. banc or camill not in first document , there 0 eg martin or presid are).

furthermore if run:

description.text.features[1,] %>% as.matrix() %>% sum 

i zero, showing in first document there no terms frequency > zero!

what's going on here?

thanks

update

i created own 'corpus dtm' function , indeed gives different results. apart document terms weights different of tm::documenttermmatrix (mine expect given corpus), more terms function tm function (~3000 vs 800 of tm).

here's function:

corpus.to.dtm <- function(corpus, min.doc.freq = 3, minlength = 3, weight.fun = weighttfidf) {     library(dplyr)     library(magrittr)     library(tm)     library(parallel)      lvls <- mclapply(corpus, function(doc) words(doc) %>% unique, mc.cores = 8) %>%         unlist %>%         table %>%         data.frame %>%         set_colnames(c('term', 'freq')) %>%         mutate(lengths = str_length(term)) %>%         filter(freq >= min.doc.freq & lengths >= minlength) %>%         use_series(term)      dtm <- mclapply(corpus, function(doc) factor(words(doc), levels = lvls) %>% table %>% as.vector, mc.cores = 8) %>%         do.call(what = 'rbind') %>%         set_colnames(lvls)      as.documenttermmatrix(dtm, weighting = weighttfidf) %>%         as.matrix() %>%         as.data.frame() } 

here's workaround using tm alternative, quanteda. might find relative simplicity of latter, combined speed , features, sufficient use rest of analysis too!

description.text <-    c("azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram",     "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur",     "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin")  require(quanteda) require(magrittr)  qdfm <- dfm(description.text) head(qdfm, nfeat = 10) # document-feature matrix of: 3 documents, 35 features (56.2% sparse). # (showing first 3 documents , first 10 features) #        features # docs    azi sanitar local to1 presid osp martin ospedalier tofan torin #   text1   1       1     1   1      2   1      2          1     1     1 #   text2   0       0     0   0      0   0      2          0     1     2 #   text3   0       0     0   0      0   0      2          0     0     0  qdfm2 <- qdfm %>% dfm_trim(min_count = 3, min_docfreq = 3) qdfm2 # document-feature matrix of: 3 documents, 2 features (0% sparse). # (showing first 3 documents , first 2 features) #        features # docs    martin ospedal #   text1      2       1 #   text2      2       2 #   text3      2       2 

to convert tm:

convert(qdfm2, = "tm") # <<documenttermmatrix (documents: 3, terms: 2)>> # non-/sparse entries: 6/0 # sparsity           : 0% # maximal term length: 7 # weighting          : term frequency (tf) 

in example use tf-idf weighting. that's easy in quanteda:

dfm_weight(qdfm, "tfidf") %>% head # document-feature matrix of: 3 documents, 35 features (56.2% sparse). # (showing first 3 documents , first 6 features) #          features # docs          azi   sanitar     local       to1    presid       osp #   text1 0.4771213 0.4771213 0.4771213 0.4771213 0.9542425 0.4771213 #   text2 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 #   text3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 

Comments

Popular posts from this blog

javascript - Create a stacked percentage column -

Optimising Firebase database by automatically overwriting data -

javascript - Angular UI-Grid customTemplate directive causing rows to load slowly/? -