Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xzar tokenize #1

Open
4 tasks
Yomguithereal opened this issue Feb 13, 2025 · 0 comments
Open
4 tasks

xzar tokenize #1

Yomguithereal opened this issue Feb 13, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@Yomguithereal
Copy link
Member

Yomguithereal commented Feb 13, 2025

Spacy-powered tokenization. Similar in interface to xan tokenize

Should use parallelization (builtin from Spacy or our own if too slow and we can do better).

The loaded pipelines should be customized to avoid components that are not required (e.g. don't lemmatize if user has no need for it).

Subcommands:

  • words
  • sentences
  • noun-chunks
  • triples (with textacy)

Common flags:

  • -l,--lang: automatic model selection based on lang (will select the smallest model by default for supported languages) defaulting on English
  • a flag for model size (spacy usually offers sm, md, lg and trf)
  • a -m,--model flag to specify the model name if you want something custom anyway
  • --keep-text like in xan tokenize

words flags:

  • -L,--lemmatize: return lemmatized tokens
  • POS tag filters (whitelist at first), offer some high-level filters because people, myself included, don't have a working knowledge of all the POS tags there is
  • a flag dropping stopwords

Examples

Finding the most specific noun chunks per president:

xzar tokenize noun-chunks transcript sotu.csv | xan vocab doc-token -D president | xan top -g president token | xan view -g president -A

Finding the most used French verbs in corpus:

xzar tokenize words transcript --lang fr --keep verb | xan vocab token | xan top pigeon | xan v
@Yomguithereal Yomguithereal added the enhancement New feature or request label Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant