Feat: added snapshot saving, model downcasting, better dtype/device handling #168

Mhdaw · 2025-07-30T16:43:09Z

what is PR does:

added snapshot saving: it enables continues pre-training by saving optimizer state and hyperparams.
added model downcasting to lower persicion: it halves the memory footprint, there is performance risk specifically on FP16. if the model is downcasted to BF16/FP16, the logits will be casted to FP32 to avoid numerical instability on loss.
added better device/dtype handling to automatically select best dtype, data loader params, seed,...
better torch.compile specification, it checks if its available as well.
added mini-config for using 135 million params version.
added automatic lmms-eval installation if true in config
added requirements.txt for easier installation, added einops, it was missing before!

Mhdaw added 3 commits July 30, 2025 08:43

feat: Update model and training files

4860f92

fixed bug

8418f1e

updated README.md

2d97b03