r - TM DocumentTermMatrix gives results which are unexpected given the corpus -

June 15, 2014

maybe misinterpret how tm::documenttermmatrix works. have corpus after preprocessing looks this:

head(description.text, 3) [1] "azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram"                     [2] "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur"      [3] "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin"

which process via:

description.text.features <- documenttermmatrix(corpus(vectorsource(description.text)), list(     bounds = list(local = c(3, inf)),     tokenize = 'scan' ))

when inspect first row of dtm this:

inspect(description.text.features[1,]) <<documenttermmatrix (documents: 1, terms: 887)>> non-/sparse entries: 0/887 sparsity           : 100% maximal term length: 15 weighting          : term frequency (tf) sample             :     terms docs banc camill mar martin ospedal presid san sanitar torin vittor    1    0      0   0      0       0      0   0       0     0      0

these terms don't correspond fist document in corpus description.text (eg. banc or camill not in first document , there 0 eg martin or presid are).

furthermore if run:

description.text.features[1,] %>% as.matrix() %>% sum

i zero, showing in first document there no terms frequency > zero!

what's going on here?

thanks

update

i created own 'corpus dtm' function , indeed gives different results. apart document terms weights different of tm::documenttermmatrix (mine expect given corpus), more terms function tm function (~3000 vs 800 of tm).

here's function:

corpus.to.dtm <- function(corpus, min.doc.freq = 3, minlength = 3, weight.fun = weighttfidf) {     library(dplyr)     library(magrittr)     library(tm)     library(parallel)      lvls <- mclapply(corpus, function(doc) words(doc) %>% unique, mc.cores = 8) %>%         unlist %>%         table %>%         data.frame %>%         set_colnames(c('term', 'freq')) %>%         mutate(lengths = str_length(term)) %>%         filter(freq >= min.doc.freq & lengths >= minlength) %>%         use_series(term)      dtm <- mclapply(corpus, function(doc) factor(words(doc), levels = lvls) %>% table %>% as.vector, mc.cores = 8) %>%         do.call(what = 'rbind') %>%         set_colnames(lvls)      as.documenttermmatrix(dtm, weighting = weighttfidf) %>%         as.matrix() %>%         as.data.frame() }

here's workaround using tm alternative, quanteda. might find relative simplicity of latter, combined speed , features, sufficient use rest of analysis too!

description.text <-    c("azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram",     "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur",     "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin")  require(quanteda) require(magrittr)  qdfm <- dfm(description.text) head(qdfm, nfeat = 10) # document-feature matrix of: 3 documents, 35 features (56.2% sparse). # (showing first 3 documents , first 10 features) #        features # docs    azi sanitar local to1 presid osp martin ospedalier tofan torin #   text1   1       1     1   1      2   1      2          1     1     1 #   text2   0       0     0   0      0   0      2          0     1     2 #   text3   0       0     0   0      0   0      2          0     0     0  qdfm2 <- qdfm %>% dfm_trim(min_count = 3, min_docfreq = 3) qdfm2 # document-feature matrix of: 3 documents, 2 features (0% sparse). # (showing first 3 documents , first 2 features) #        features # docs    martin ospedal #   text1      2       1 #   text2      2       2 #   text3      2       2

to convert tm:

convert(qdfm2, = "tm") # <<documenttermmatrix (documents: 3, terms: 2)>> # non-/sparse entries: 6/0 # sparsity           : 0% # maximal term length: 7 # weighting          : term frequency (tf)

in example use tf-idf weighting. that's easy in quanteda:

dfm_weight(qdfm, "tfidf") %>% head # document-feature matrix of: 3 documents, 35 features (56.2% sparse). # (showing first 3 documents , first 6 features) #          features # docs          azi   sanitar     local       to1    presid       osp #   text1 0.4771213 0.4771213 0.4771213 0.4771213 0.9542425 0.4771213 #   text2 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 #   text3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

Search This Blog

Insert

r - TM DocumentTermMatrix gives results which are unexpected given the corpus -

Comments

Post a Comment

Popular posts from this blog

service - Android MediaPlayer calls onCompletion before it already finished -

javascript - Training Neural Network to play flappy bird with genetic algorithm - Why can't it learn? -

javascript - Create a stacked percentage column -