You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Spacy-powered tokenization. Similar in interface to xan tokenize
Should use parallelization (builtin from Spacy or our own if too slow and we can do better).
The loaded pipelines should be customized to avoid components that are not required (e.g. don't lemmatize if user has no need for it).
Subcommands:
words
sentences
noun-chunks
triples (with textacy)
Common flags:
-l,--lang: automatic model selection based on lang (will select the smallest model by default for supported languages) defaulting on English
a flag for model size (spacy usually offers sm, md, lg and trf)
a -m,--model flag to specify the model name if you want something custom anyway
--keep-text like in xan tokenize
words flags:
-L,--lemmatize: return lemmatized tokens
POS tag filters (whitelist at first), offer some high-level filters because people, myself included, don't have a working knowledge of all the POS tags there is
a flag dropping stopwords
Examples
Finding the most specific noun chunks per president:
xzar tokenize noun-chunks transcript sotu.csv | xan vocab doc-token -D president | xan top -g president token | xan view -g president -A
Finding the most used French verbs in corpus:
xzar tokenize words transcript --lang fr --keep verb | xan vocab token | xan top pigeon | xan v
The text was updated successfully, but these errors were encountered:
Spacy-powered tokenization. Similar in interface to
xan tokenize
Should use parallelization (builtin from Spacy or our own if too slow and we can do better).
The loaded pipelines should be customized to avoid components that are not required (e.g. don't lemmatize if user has no need for it).
Subcommands:
Common flags:
-l,--lang
: automatic model selection based on lang (will select the smallest model by default for supported languages) defaulting on Englishsm
,md
,lg
andtrf
)-m,--model
flag to specify the model name if you want something custom anyway--keep-text
like inxan tokenize
words
flags:-L,--lemmatize
: return lemmatized tokensExamples
Finding the most specific noun chunks per president:
Finding the most used French verbs in corpus:
The text was updated successfully, but these errors were encountered: