Skip to content

Conversation

Mhdaw
Copy link

@Mhdaw Mhdaw commented Jul 30, 2025

what is PR does:

  • added snapshot saving: it enables continues pre-training by saving optimizer state and hyperparams.
  • added model downcasting to lower persicion: it halves the memory footprint, there is performance risk specifically on FP16. if the model is downcasted to BF16/FP16, the logits will be casted to FP32 to avoid numerical instability on loss.
  • added better device/dtype handling to automatically select best dtype, data loader params, seed,...
  • better torch.compile specification, it checks if its available as well.
  • added mini-config for using 135 million params version.
  • added automatic lmms-eval installation if true in config
  • added requirements.txt for easier installation, added einops, it was missing before!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant