r - Finding best reciprocal matches -


this problem of finding best reciprocal matches between 2 sets of strings mapped 1 against other.

what have 2 data.frame's hold results of mapping first set against other (df1 below) , of other mapping direction (df2 below).

the mapping obtains score. in addition, every string has parent, , several different strings have same parent.

here simulated data:

library(dplyr) set.seed(1)  #generate first string set mapping data.frame df1 <- data.frame(query = paste("a",unlist(sapply(1:5,function(i) rep(i,ceiling(runif(1,1,3))))),sep="."),stringsasfactors = f)  #add hits , scores df1 <- do.call(rbind,lapply(unique(df1$query),function(q) {   dplyr::filter(df1,query == q) %>%     dplyr::mutate(hit = paste("b",sample(5,nrow(.),replace = f),sep="."),score = runif(nrow(.),0,100))  }))   #add parents df1 <- left_join(df1,data.frame(query=unique(df1$query),parent = paste("p.a",sample(3,5,replace=t),sep="."),stringsasfactors = f),by=c("query"="query"))  #generate second string set mapping data.frame df2 <- data.frame(query = paste("b",unlist(sapply(1:5,function(i) rep(i,ceiling(runif(1,1,3))))),sep="."),stringsasfactors = f)  #add hits , scores df2 <- do.call(rbind,lapply(unique(df2$query),function(q) {   dplyr::filter(df2,query == q) %>%     dplyr::mutate(hit = paste("a",sample(5,nrow(.),replace = f),sep="."),                   score = runif(nrow(.),0,100))  }))   #add parents df2 <- left_join(df2,data.frame(query=unique(df2$query),parent = paste("p.b",sample(3,5,replace=t),sep="."),stringsasfactors = f),by=c("query"="query")) 

what want have efficient function returns following:

for each string in first set (unique(df1$query)), if string maps highest score maps highest score (reciprocal best hit) return hit's id, mapping score, , hit's parent id, otherwise return na hit's id , score. parent, return hit's parent id if parents best reciprocal matches, otherwise return na.

the resulting data.frame should have these columns: set.a.id, set.a.parent.id, set.b.match, set.b.match.score, set.b.parent.id.

for df1 , df2, ordered query , score:

df1[order(df1$query,-df1$score),]    query hit    score parent 1    a.1 b.4 84.72860  p.a.1 2    a.1 b.1 57.64333  p.a.1 3    a.1 b.2 56.02473  p.a.1 5    a.2 b.3 82.18000  p.a.2 4    a.2 b.4 72.56507  p.a.2 6    a.2 b.1 59.24316  p.a.2 7    a.3 b.3 71.47014  p.a.3 8    a.3 b.4 22.60246  p.a.3 10   a.4 b.2 35.13778  p.a.2 11   a.4 b.1 26.89667  p.a.2 9    a.4 b.4 20.33758  p.a.2 12   a.5 b.2 26.85910  p.a.2 13   a.5 b.4 26.77832  p.a.2    df2[order(df2$query,-df2$score),]    query hit    score parent 1    b.1 a.2 90.76623  p.b.2 2    b.1 a.5 24.57950  p.b.2 3    b.2 a.5 39.58975  p.b.1 4    b.2 a.1 39.11946  p.b.1 6    b.3 a.5 35.94917  p.b.1 5    b.3 a.4 11.57524  p.b.1 8    b.4 a.5 80.93301  p.b.2 7    b.4 a.4 57.07636  p.b.2 9    b.4 a.3 10.13098  p.b.2 11   b.5 a.5 49.70569  p.b.1 10   b.5 a.2 12.88346  p.b.1 

this resulting data.frame:

res.df <- data.frame(set.a.id = unique(df1$query),set.a.parent.id = unique(df1$parent),                      set.b.match = c(na,na,na,na,"b.2"),set.b.match.score = c(na,na,na,na,26.85910),                      set.b.parent.id = c(na,"p.b.1",na,"p.b.1","p.b.1")) 


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -