sql - R - Leave one out aggregation on a grouping variable (NA exist) -
i want compute new variables "have" data set, in r, follows:
re: average of "r" values within given "cat" variable value excluding specific observation (note: missing data exists , re group mean re when r missing).
ie: re, average of "i" responses within given "cat" variable value excluding specific observation (same missing data technique).
an example data set , desired output given below.
have:
id cat r … (additional variables need retain) 1 1 1 3 … 2 1 2 na … 3 1 1 1 … 4 2 na 3 … 5 2 4 5 … 6 2 4 na …
the desired data set ("want") should be:
want:
id cat r re ie … (additional variables retained) 1 1 1 3 1.5 1 … 2 1 2 na 1 2 … 3 1 1 1 1.5 3 … 4 2 na 3 ... ... … 5 2 4 5 … 6 2 4 na …
notably, following sql based solution produces desired output in sas, unable working in r (using sqldf package). 1 issue i'm aware of missing function sas specific (not available in sql universally). said, might provide helpful starting point sql solution using sqldf package:
proc sql; create table want select *, (sum(r)-coalesce(r, 0))/(count(r)-1+missing(r)) re, (sum(i)-coalesce(i, 0))/(count(i)-1+missing(i)) ie have group cat order id, cat; quit;
many in advance help.
with dplyr
can apply function subset of rows without affecting other rows, sort of 'windowing' in sql if familiar concept in domain.
create function carry out desired calculation 1 id group. group rows using group_by()
, pipe result mutate()
, run custom function. grouped data affect 1 group @ time , give desired result.
library(dplyr) # data example have <- read.table(header = true, text = "id cat r 1 1 1 3 2 1 2 na 3 1 1 1 4 2 na 3 5 2 4 5 6 2 4 na") # create leave-one-out mean function -- single id group leave_one_out_mean <- function(x) { result <- c() ( in seq_along(x) ) { # note minus-i subsetting used subset 1 observation in each iteration # , na.rm option handle missing values result[i] <- mean(x[-i], na.rm = true) } return(result) } # use group _do not_ pipe result through summarize() want <- have %>% group_by(cat) %>% mutate(re = leave_one_out_mean(r), ie = leave_one_out_mean(i))
result
want source: local data frame [6 x 6] groups: cat [2] id cat r re ie <int> <int> <int> <int> <dbl> <dbl> 1 1 1 1 3 1.5 1 2 2 1 2 na 1.0 2 3 3 1 1 1 1.5 3 4 4 2 na 3 4.0 5 5 5 2 4 5 4.0 3 6 6 2 4 na 4.0 4
the for
loop replaced apply function, wrote way highlight logic rather optimize execution.
Comments
Post a Comment