Alternative approach to using agrep() for fuzzy matching in R -

May 15, 2011

i have large file of administrative data, 1 million records. individual people can represented multiple times in dataset. half records have identifying code maps records individuals; half don't, need fuzzy match names flag records potentially belong same person.

from looking @ records identifying code, i've created list of differences have occurred in recording of names same individual:

inclusion of middle name e.g. jon snow vs jon targaryen snow
inclusion of second last name e.g. jon snow vs jon targaryen-snow
nickname / shortening of first name e.g. jonathon snow vs jon snow
reversal of names e.g. jon snow vs snow jon
mispellings/typos/variants: e.g. samual/samuel, monica/monika, rafael/raphael

given types of matches i'm after, there better approach using agrep()/levenshtein's distance, implemented in r?

edit: agrep() in r doesn't job problem - because of large number of insertions , substitutions need allow account ways names recorded differently, lot of false matches thrown up.

i make multiple passes.

"jon .* snow" - middle name

"jon .*snow" - second last name

nicknames require dictionary of mappings long form short, there's no regular expression that'll handle his.

"snow jon" - reversal (duh)

agrep handle minor misspellings.

you want tokenise names first-, middle- , last-.

Search This Blog

Insert

Alternative approach to using agrep() for fuzzy matching in R -

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -