python - How to compute accuracy of NER system? -


i using several ner tools in order extract named entities present in corpus , want test accuracy using nltk python module.

some of tools have used are:

in order obtain system's accuracy, nltk's accuracy function takes 2 arguments: correctly annotated dataset (containing tokens in corpus along classification (person, location, organization or 'o' [which represents token not named entity]) , output of ner system.

this ok when ner returns list of tokens classification. however, tools, such meaningcloud return classification named entities recognized in corpus. makes accuracy impossible obtain (in order obtain it, full list of words should returned comparison between both annotations doable).

what approach here then? in order obtain accuracy in such cases?

you should create list of tags tokens default 'o'.

looking @ meaningcloud, looks variant_list includes locations in original string token detected, can use map labels tokens.

some pseudocode:

def get_label(token, meaningcloud_data):     variants = ...[get meaningcloud_data somehow]...     variant in variant:         if token.start_char >= variant.inip , token.end_char <= variant.endp:             return variant.label     return false  meaningcloud = get_meaningcloud_data(text) labels = [] token in tokens:      # default 'o'     labels.push(get_label(token, meaningcloud) or 'o') 

note in nltk standard tokenizer - word_tokenize - doesn't save token positions, can't reconstruct original document unless you're absolutely sure of spacing conventions. said nltk have means of tokenizing , getting positions, see here details.


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -