sql - R - Leave one out aggregation on a grouping variable (NA exist) -

February 15, 2012

i want compute new variables "have" data set, in r, follows:

re: average of "r" values within given "cat" variable value excluding specific observation (note: missing data exists , re group mean re when r missing).

ie: re, average of "i" responses within given "cat" variable value excluding specific observation (same missing data technique).

an example data set , desired output given below.

have:

id  cat r     … (additional variables need retain) 1   1   1   3   … 2   1   2   na  … 3   1   1   1   … 4   2   na  3   … 5   2   4   5   … 6   2   4   na  …

the desired data set ("want") should be:

want:

id  cat r     re   ie     … (additional variables retained) 1   1   1   3   1.5  1      … 2   1   2   na  1    2      … 3   1   1   1   1.5  3      … 4   2   na  3   ...  ...    … 5   2   4   5               … 6   2   4   na              …

notably, following sql based solution produces desired output in sas, unable working in r (using sqldf package). 1 issue i'm aware of missing function sas specific (not available in sql universally). said, might provide helpful starting point sql solution using sqldf package:

proc sql; create table want select *,    (sum(r)-coalesce(r, 0))/(count(r)-1+missing(r)) re,   (sum(i)-coalesce(i, 0))/(count(i)-1+missing(i)) ie have group cat order id, cat; quit;

many in advance help.

with dplyr can apply function subset of rows without affecting other rows, sort of 'windowing' in sql if familiar concept in domain.

create function carry out desired calculation 1 id group. group rows using group_by(), pipe result mutate() , run custom function. grouped data affect 1 group @ time , give desired result.

library(dplyr)  # data example have <- read.table(header = true, text =  "id  cat r      1   1   1   3    2   1   2   na   3   1   1   1    4   2   na  3    5   2   4   5    6   2   4   na")  # create leave-one-out mean function -- single id group  leave_one_out_mean <- function(x) {     result <- c()      ( in seq_along(x) ) {         # note minus-i subsetting used subset 1 observation in each iteration         # , na.rm option handle missing values         result[i] <- mean(x[-i], na.rm = true)      }      return(result) }  # use group _do not_ pipe result through summarize()  want <- have %>%     group_by(cat) %>%     mutate(re = leave_one_out_mean(r),            ie = leave_one_out_mean(i))

result

want  source: local data frame [6 x 6] groups: cat [2]       id   cat     r        re    ie   <int> <int> <int> <int> <dbl> <dbl> 1     1     1     1     3   1.5     1 2     2     1     2    na   1.0     2 3     3     1     1     1   1.5     3 4     4     2    na     3   4.0     5 5     5     2     4     5   4.0     3 6     6     2     4    na   4.0     4

the for loop replaced apply function, wrote way highlight logic rather optimize execution.

Search This Blog

Insert

sql - R - Leave one out aggregation on a grouping variable (NA exist) -

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -