corpus - How to Extract keywords from a Data Frame in R -
i new text-mining in r. want remove stopwords (i.e. extract keywords) data frame's column , put keywords new column.
i tried make corpus, didn't me.
df$c3
have. add column df$c4
, can't work.
df <- structure(list(c3 = structure(c(3l, 4l, 1l, 7l, 6l, 9l, 5l, 8l, 10l, 2l), .label = c("are doing good", "for help", "hello everyone", "hope all", "i hope", "i need help", "in life", "it work", "on text-mining", "thanks"), class = "factor"), c4 = structure(c(2l, 4l, 1l, 6l, 3l, 7l, 5l, 9l, 8l, 3l), .label = c("doing good", "everyone", "help", "hope", "hope", "life", "text-mining", "thanks", "work"), class = "factor")), .names = c("c3", "c4"), row.names = c(na, -10l), class = "data.frame") head(df) # c3 c4 # 1 hello # 2 hope hope # 3 doing doing # 4 in life life # 5 need # 6 on text-mining text-mining
this solution uses packages dplyr
, tidytext
.
library(dplyr) library(tidytext) # subset of dataset dt = data.frame(c1 = c(108,20, 999, 52, 400), c2 = c(1,3,7, 6, 9), c3 = c("hello everyone","hope all","are doing good","in life","i need help"), stringsasfactors = f) # function combine words (by pasting 1 next other) f = function(x) { paste(x, collapse = " ") } dt %>% unnest_tokens(word, c3) %>% # split phrases words filter(!word %in% stop_words$word) %>% # keep appropriate words group_by(c1, c2) %>% # each combination of c1 , c2 summarise(word = f(word)) %>% # combine multiple words (if there multiple) ungroup() # forget grouping # # tibble: 2 x 3 # c1 c2 word # <dbl> <dbl> <chr> # 1 20 3 hope # 2 52 6 life
the problem here "stop words" built in package filter out of words want keep. therefore, have add manual step specify words need include. can this:
dt %>% unnest_tokens(word, c3) %>% # split phrases words filter(!word %in% stop_words$word | word %in% c("everyone","doing","good")) %>% # keep appropriate words group_by(c1, c2) %>% # each combination of c1 , c2 summarise(word = f(word)) %>% # combine multiple words (if there multiple) ungroup() # forget grouping # # tibble: 4 x 3 # c1 c2 word # <dbl> <dbl> <chr> # 1 20 3 hope # 2 52 6 life # 3 108 1 # 4 999 7 doing
Comments
Post a Comment