Alternative approach to using agrep() for fuzzy matching in R -
i have large file of administrative data, 1 million records. individual people can represented multiple times in dataset. half records have identifying code maps records individuals; half don't, need fuzzy match names flag records potentially belong same person.
from looking @ records identifying code, i've created list of differences have occurred in recording of names same individual:
- inclusion of middle name e.g. jon snow vs jon targaryen snow
- inclusion of second last name e.g. jon snow vs jon targaryen-snow
- nickname / shortening of first name e.g. jonathon snow vs jon snow
- reversal of names e.g. jon snow vs snow jon
- mispellings/typos/variants: e.g. samual/samuel, monica/monika, rafael/raphael
given types of matches i'm after, there better approach using agrep()/levenshtein's distance, implemented in r?
edit: agrep() in r doesn't job problem - because of large number of insertions , substitutions need allow account ways names recorded differently, lot of false matches thrown up.
i make multiple passes.
"jon .* snow"
- middle name
"jon .*snow"
- second last name
nicknames require dictionary of mappings long form short, there's no regular expression that'll handle his.
"snow jon"
- reversal (duh)
agrep handle minor misspellings.
you want tokenise names first-, middle- , last-.
Comments
Post a Comment