python - How to compute accuracy of NER system? -
i using several ner tools in order extract named entities present in corpus , want test accuracy using nltk python module.
some of tools have used are:
ntlk
stanford ner: https://nlp.stanford.edu/software/crf-ner.shtml
meaningcloud: https://www.meaningcloud.com/products/topics-extraction
in order obtain system's accuracy, nltk's accuracy
function takes 2 arguments: correctly annotated dataset (containing tokens in corpus along classification (person, location, organization or 'o' [which represents token not named entity]) , output of ner system.
this ok when ner returns list of tokens classification. however, tools, such meaningcloud return classification named entities recognized in corpus. makes accuracy impossible obtain (in order obtain it, full list of words should returned comparison between both annotations doable).
what approach here then? in order obtain accuracy in such cases?
you should create list of tags tokens default 'o'.
looking @ meaningcloud, looks variant_list
includes locations in original string token detected, can use map labels tokens.
some pseudocode:
def get_label(token, meaningcloud_data): variants = ...[get meaningcloud_data somehow]... variant in variant: if token.start_char >= variant.inip , token.end_char <= variant.endp: return variant.label return false meaningcloud = get_meaningcloud_data(text) labels = [] token in tokens: # default 'o' labels.push(get_label(token, meaningcloud) or 'o')
note in nltk standard tokenizer - word_tokenize
- doesn't save token positions, can't reconstruct original document unless you're absolutely sure of spacing conventions. said nltk have means of tokenizing , getting positions, see here details.
Comments
Post a Comment