python - How to filter tokens from spaCy document -
i parse document using spacy , apply token filter final spacy document not include filtered tokens. know can take sequence of tokens filtered, insterested in having actual doc
structure.
text = u"this document example. " \ "i create custom pipeline remove specific tokesn final document." doc = nlp(text) def keep_token(tok): # example rule return tok.pos_ not not in {'punct', 'num', 'sym'} final_tokens = list(filter(keep_token, doc)) # how spacy.doc final_tokens?
i tried reconstruct new spacy doc
tokens lists api not clear how it.
depending on want there several approaches.
1. original document
tokens in spacy have references document, can this:
original_doc = final_tokens[0].doc
this way can still pos, parse data etc. original sentence.
2. construct new document without removed tokens
you can append strings of tokens whitespace , create new document. see token docs information on text_with_ws
.
doc = nlp(''.join(map(lambda x: x.text_with_ws, final_tokens)))
this not going give want though - pos tags not same, , resulting sentence may not make sense.
if neither of had in mind, let me know , maybe can help.
Comments
Post a Comment