python - How to compute accuracy of NER system? -

July 15, 2010

i using several ner tools in order extract named entities present in corpus , want test accuracy using nltk python module.

some of tools have used are:

ntlk
stanford ner: https://nlp.stanford.edu/software/crf-ner.shtml
meaningcloud: https://www.meaningcloud.com/products/topics-extraction

in order obtain system's accuracy, nltk's accuracy function takes 2 arguments: correctly annotated dataset (containing tokens in corpus along classification (person, location, organization or 'o' [which represents token not named entity]) , output of ner system.

this ok when ner returns list of tokens classification. however, tools, such meaningcloud return classification named entities recognized in corpus. makes accuracy impossible obtain (in order obtain it, full list of words should returned comparison between both annotations doable).

what approach here then? in order obtain accuracy in such cases?

you should create list of tags tokens default 'o'.

looking @ meaningcloud, looks variant_list includes locations in original string token detected, can use map labels tokens.

some pseudocode:

def get_label(token, meaningcloud_data):     variants = ...[get meaningcloud_data somehow]...     variant in variant:         if token.start_char >= variant.inip , token.end_char <= variant.endp:             return variant.label     return false  meaningcloud = get_meaningcloud_data(text) labels = [] token in tokens:      # default 'o'     labels.push(get_label(token, meaningcloud) or 'o')

note in nltk standard tokenizer - word_tokenize - doesn't save token positions, can't reconstruct original document unless you're absolutely sure of spacing conventions. said nltk have means of tokenizing , getting positions, see here details.

Search This Blog

Insert

python - How to compute accuracy of NER system? -

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -