Oh my own pytorch implementation (from scratch) of some famous large language models (LLM) for inference purposes only. Just a project for practice and learning, DO NOT use it in serious applications.
- download model file to checkpoints directory from huggingface
- modify pipeline.py to adapt to your test running
- run
python pipeline.py
- KV caching
- static cache
- paged cache
- Sampling method
- temperature
- top_k
- top_p
- repetition_penalty
- beam search
- Position Embedding
- rotary position embeddings
- partial rotory position embeddings
- Attention Mechanism
- multi-head
- multi-query
- group-query
- Decoding
- context window extension
- Quantization
- continuous batching
- Streamer
- torch.compile
Please buy me a ☕ if you find this project useful.