xzar tokenize #1

Yomguithereal · 2025-02-13T15:23:46Z

Spacy-powered tokenization. Similar in interface to xan tokenize

Should use parallelization (builtin from Spacy or our own if too slow and we can do better).

The loaded pipelines should be customized to avoid components that are not required (e.g. don't lemmatize if user has no need for it).

Subcommands:

-l,--lang: automatic model selection based on lang (will select the smallest model by default for supported languages) defaulting on English
a flag for model size (spacy usually offers sm, md, lg and trf)
a -m,--model flag to specify the model name if you want something custom anyway
--keep-text like in xan tokenize

-L,--lemmatize: return lemmatized tokens
POS tag filters (whitelist at first), offer some high-level filters because people, myself included, don't have a working knowledge of all the POS tags there is
a flag dropping stopwords

Finding the most specific noun chunks per president:

xzar tokenize noun-chunks transcript sotu.csv | xan vocab doc-token -D president | xan top -g president token | xan view -g president -A

Finding the most used French verbs in corpus:

xzar tokenize words transcript --lang fr --keep verb | xan vocab token | xan top pigeon | xan v

The text was updated successfully, but these errors were encountered:

Yomguithereal mentioned this issue Feb 13, 2025

xzar ner #2

Open

Yomguithereal added the enhancement New feature or request label Feb 20, 2025