python - How to filter tokens from spaCy document -


i parse document using spacy , apply token filter final spacy document not include filtered tokens. know can take sequence of tokens filtered, insterested in having actual doc structure.

text = u"this document example. " \     "i create custom pipeline remove specific tokesn final document."  doc = nlp(text)  def keep_token(tok):     # example rule     return tok.pos_ not not in {'punct', 'num', 'sym'}  final_tokens = list(filter(keep_token, doc))  # how spacy.doc final_tokens? 

i tried reconstruct new spacy doc tokens lists api not clear how it.

depending on want there several approaches.

1. original document

tokens in spacy have references document, can this:

original_doc = final_tokens[0].doc 

this way can still pos, parse data etc. original sentence.

2. construct new document without removed tokens

you can append strings of tokens whitespace , create new document. see token docs information on text_with_ws.

doc = nlp(''.join(map(lambda x: x.text_with_ws, final_tokens))) 

this not going give want though - pos tags not same, , resulting sentence may not make sense.

if neither of had in mind, let me know , maybe can help.


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -