sql - R - Leave one out aggregation on a grouping variable (NA exist) -


i want compute new variables "have" data set, in r, follows:

re: average of "r" values within given "cat" variable value excluding specific observation (note: missing data exists , re group mean re when r missing).

ie: re, average of "i" responses within given "cat" variable value excluding specific observation (same missing data technique).

an example data set , desired output given below.

have:

id  cat r     … (additional variables need retain) 1   1   1   3   … 2   1   2   na  … 3   1   1   1   … 4   2   na  3   … 5   2   4   5   … 6   2   4   na  … 

the desired data set ("want") should be:

want:

id  cat r     re   ie     … (additional variables retained) 1   1   1   3   1.5  1      … 2   1   2   na  1    2      … 3   1   1   1   1.5  3      … 4   2   na  3   ...  ...    … 5   2   4   5               … 6   2   4   na              … 

notably, following sql based solution produces desired output in sas, unable working in r (using sqldf package). 1 issue i'm aware of missing function sas specific (not available in sql universally). said, might provide helpful starting point sql solution using sqldf package:

proc sql; create table want select *,    (sum(r)-coalesce(r, 0))/(count(r)-1+missing(r)) re,   (sum(i)-coalesce(i, 0))/(count(i)-1+missing(i)) ie have group cat order id, cat; quit; 

many in advance help.

with dplyr can apply function subset of rows without affecting other rows, sort of 'windowing' in sql if familiar concept in domain.

create function carry out desired calculation 1 id group. group rows using group_by(), pipe result mutate() , run custom function. grouped data affect 1 group @ time , give desired result.

library(dplyr)  # data example have <- read.table(header = true, text =  "id  cat r      1   1   1   3    2   1   2   na   3   1   1   1    4   2   na  3    5   2   4   5    6   2   4   na")  # create leave-one-out mean function -- single id group  leave_one_out_mean <- function(x) {     result <- c()      ( in seq_along(x) ) {         # note minus-i subsetting used subset 1 observation in each iteration         # , na.rm option handle missing values         result[i] <- mean(x[-i], na.rm = true)      }      return(result) }  # use group _do not_ pipe result through summarize()  want <- have %>%     group_by(cat) %>%     mutate(re = leave_one_out_mean(r),            ie = leave_one_out_mean(i)) 

result

want  source: local data frame [6 x 6] groups: cat [2]       id   cat     r        re    ie   <int> <int> <int> <int> <dbl> <dbl> 1     1     1     1     3   1.5     1 2     2     1     2    na   1.0     2 3     3     1     1     1   1.5     3 4     4     2    na     3   4.0     5 5     5     2     4     5   4.0     3 6     6     2     4    na   4.0     4 

the for loop replaced apply function, wrote way highlight logic rather optimize execution.


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -