r - Finding best reciprocal matches -
this problem of finding best reciprocal matches between 2 sets of strings mapped 1 against other.
what have 2 data.frame
's hold results of mapping first set against other (df1
below) , of other mapping direction (df2
below).
the mapping obtains score
. in addition, every string has parent
, , several different strings have same parent
.
here simulated data:
library(dplyr) set.seed(1) #generate first string set mapping data.frame df1 <- data.frame(query = paste("a",unlist(sapply(1:5,function(i) rep(i,ceiling(runif(1,1,3))))),sep="."),stringsasfactors = f) #add hits , scores df1 <- do.call(rbind,lapply(unique(df1$query),function(q) { dplyr::filter(df1,query == q) %>% dplyr::mutate(hit = paste("b",sample(5,nrow(.),replace = f),sep="."),score = runif(nrow(.),0,100)) })) #add parents df1 <- left_join(df1,data.frame(query=unique(df1$query),parent = paste("p.a",sample(3,5,replace=t),sep="."),stringsasfactors = f),by=c("query"="query")) #generate second string set mapping data.frame df2 <- data.frame(query = paste("b",unlist(sapply(1:5,function(i) rep(i,ceiling(runif(1,1,3))))),sep="."),stringsasfactors = f) #add hits , scores df2 <- do.call(rbind,lapply(unique(df2$query),function(q) { dplyr::filter(df2,query == q) %>% dplyr::mutate(hit = paste("a",sample(5,nrow(.),replace = f),sep="."), score = runif(nrow(.),0,100)) })) #add parents df2 <- left_join(df2,data.frame(query=unique(df2$query),parent = paste("p.b",sample(3,5,replace=t),sep="."),stringsasfactors = f),by=c("query"="query"))
what want have efficient function returns following:
for each string in first set (unique(df1$query)
), if string maps highest score
maps highest score
(reciprocal best hit) return hit's id, mapping score
, , hit's parent
id, otherwise return na
hit's id , score
. parent, return hit's parent
id if parent
s best reciprocal matches, otherwise return na
.
the resulting data.frame
should have these columns: set.a.id
, set.a.parent.id
, set.b.match
, set.b.match.score
, set.b.parent.id
.
for df1
, df2, ordered query
, score
:
df1[order(df1$query,-df1$score),] query hit score parent 1 a.1 b.4 84.72860 p.a.1 2 a.1 b.1 57.64333 p.a.1 3 a.1 b.2 56.02473 p.a.1 5 a.2 b.3 82.18000 p.a.2 4 a.2 b.4 72.56507 p.a.2 6 a.2 b.1 59.24316 p.a.2 7 a.3 b.3 71.47014 p.a.3 8 a.3 b.4 22.60246 p.a.3 10 a.4 b.2 35.13778 p.a.2 11 a.4 b.1 26.89667 p.a.2 9 a.4 b.4 20.33758 p.a.2 12 a.5 b.2 26.85910 p.a.2 13 a.5 b.4 26.77832 p.a.2 df2[order(df2$query,-df2$score),] query hit score parent 1 b.1 a.2 90.76623 p.b.2 2 b.1 a.5 24.57950 p.b.2 3 b.2 a.5 39.58975 p.b.1 4 b.2 a.1 39.11946 p.b.1 6 b.3 a.5 35.94917 p.b.1 5 b.3 a.4 11.57524 p.b.1 8 b.4 a.5 80.93301 p.b.2 7 b.4 a.4 57.07636 p.b.2 9 b.4 a.3 10.13098 p.b.2 11 b.5 a.5 49.70569 p.b.1 10 b.5 a.2 12.88346 p.b.1
this resulting data.frame
:
res.df <- data.frame(set.a.id = unique(df1$query),set.a.parent.id = unique(df1$parent), set.b.match = c(na,na,na,na,"b.2"),set.b.match.score = c(na,na,na,na,26.85910), set.b.parent.id = c(na,"p.b.1",na,"p.b.1","p.b.1"))
Comments
Post a Comment