r - Still struggling with handling large data set -

July 15, 2013

i have been reading around on website , haven't been able find exact answer. if exists, apologize repost.

i working data sets extremely large (600 million rows, 64 columns on computer 32 gb of ram). need smaller subsets of data, struggling perform functions besides importing 1 data set in fread, , selecting 5 columns need. after that, try overwrite dataset specific conditions need, hit ram cap , message "error: cannot allocate vector size of 4.5 gb. looked @ ff , bigmemory packages alternatives, seems can't subset before importing in packages? there solution problem besides upgrading ram on computer?

tasks trying perform:

>sampletable<-fread("my.csv", header = t, sep = ",", select=c("column1", "column2", "column7", "column12", "column15"))  >sampletable2<-sampletable[sampletable[,column1=="6" & column7=="1"]]

at point, hit memory cap. better try , use package import 64 columns of 600 million rows? don't want spend hours upon hours perform 1 import.

what read csv file in chunks:

# define subset of columns csv <- "my.csv" colnames <- names(read.csv(csv, header = true, nrows = 1)) colclasses <- rep(list(null), length(colnames)) ind <- c(1, 2, 7, 12, 15) colclasses[ind] <- "double"  # read header , first line library(dplyr) l_df <- list() con <- file(csv, "rt") df <- read.csv(con, header = true, nrows = 1, colclasses = colclasses) %>%   filter(v1 == 6, v7 == 1) names(df) <- paste0("v", ind) l_df[[i <- 1]] <- df  # read other lines , combine repeat {   <- + 1   df <- read.csv(con, header = false, nrows = 9973, colclasses = colclasses)   l_df[[i]] <- filter(df, v1 == 6, v7 == 1)   if (nrow(df) < 9973) break } df <- do.call("rbind", l_df)

9973 arbitrary prime number has few chance divisor of nlines - 1.

Search This Blog

Insert

r - Still struggling with handling large data set -

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -