Alternative approach to using agrep() for fuzzy matching in R -


i have large file of administrative data, 1 million records. individual people can represented multiple times in dataset. half records have identifying code maps records individuals; half don't, need fuzzy match names flag records potentially belong same person.

from looking @ records identifying code, i've created list of differences have occurred in recording of names same individual:

  • inclusion of middle name e.g. jon snow vs jon targaryen snow
  • inclusion of second last name e.g. jon snow vs jon targaryen-snow
  • nickname / shortening of first name e.g. jonathon snow vs jon snow
  • reversal of names e.g. jon snow vs snow jon
  • mispellings/typos/variants: e.g. samual/samuel, monica/monika, rafael/raphael

given types of matches i'm after, there better approach using agrep()/levenshtein's distance, implemented in r?

edit: agrep() in r doesn't job problem - because of large number of insertions , substitutions need allow account ways names recorded differently, lot of false matches thrown up.

i make multiple passes.

"jon .* snow" - middle name

"jon .*snow" - second last name

nicknames require dictionary of mappings long form short, there's no regular expression that'll handle his.

"snow jon" - reversal (duh)

agrep handle minor misspellings.

you want tokenise names first-, middle- , last-.


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -