All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Change tokenizer from tokenizers by HuggingFace to original SentencePiece tokenizer
- Generation algorithm for chatlm which caused by replacing tokenizer to SentencePiece
- Tokenizer trainer to train your customized tokenizer
- Bad word filter which did not filter out bad ids because of the filtering order of top_k-top_p and bad_word filters
- Adopting generate function providing by HuggingFace instead of own implementation
- Dropped MeCab+BPE based tokenizer and adopt SentencePiece based custom tokenizer instead
- Implement bad_words option to chatlm.generator
- Implement response argument to chatlm.generator
- Removed PyTorch dependency and introduced Tensorflow intead
- Introduced YAML config for configuration of model hyperparameters
- Removed papermill dependency; adopted CLI with YAML config file instead of papermill
- Introduced jupyternotebook executed with Papermill.
- Implemented ChatLM model which is a simple sequence to sequence model using GPT-2.
- Implemented TopPKGenerator to specify both top-p and top-k filtering.
- Removed ChatModel. This model will be implemented in the future, but currently it has some bugs. So this model is removed from current version.
- Remove all special tokens from generated text to extract response.
- Add license file.
- Fix vocab_size and num_albels in the BaseModel training script to adapt Transformers from v2.2.0 to v2.3.0.
- Add scripts for BaseModel and ChatModel.