This project demonstrates how to fine-tune the LLaMA 3.2-1B Instruct model using text extracted from the Harry Potter book series. The training was conducted using LoRA (Low-Rank Adaptation) and QLoRA techniques for parameter-efficient fine-tuning. The goal is to create a custom model that generates Harry Potter-themed text and understands the context specific to the book series.
- Extracted text from all Harry Potter books and chunked it for training.
- Fine-tuned LLaMA 3.2-1B using LoRA and QLoRA for causal language modeling.
- Efficient parameter tuning with reduced memory requirements.
- Saved fine-tuned model weights for inference or further fine-tuning.
LoRA (Low-Rank Adaptation) is a technique designed to make fine-tuning large language models more efficient. Instead of updating all model parameters during fine-tuning, LoRA introduces additional low-rank trainable matrices into specific layers of the model, significantly reducing the number of parameters that need to be updated.
- Significantly reduces memory usage.
- Faster fine-tuning on large-scale models.
- Maintains high performance with fewer trainable parameters.
For a deeper understanding, refer to the LoRA paper:
LoRA: Low-Rank Adaptation of Large Language Models by Edward J. Hu et al.
QLoRA (Quantized LoRA) builds upon the LoRA framework by using quantization techniques. It leverages 4-bit quantization for the model weights to further reduce memory usage while maintaining the ability to fine-tune efficiently.
- 4-bit quantized base models for reduced memory consumption.
- Parameter-efficient fine-tuning to adapt the model to new tasks.
- Allows training of large-scale models on a single GPU.
- Reduces the computational footprint without sacrificing performance.
For more details, check the QLoRA paper:
QLoRA: Efficient Finetuning of Quantized LLMs by Tim Dettmers et al.
git clone https://github.com/AnanthaPadmanaban-KrishnaKumar/EffiLLaMA.git
cd EffiLLaMA
python -m venv env
source env/bin/activate # On Linux/macOS
env\Scripts\activate # On Windows
pip install -r requirements.txt
The dataset is based on the text extracted from the Harry Potter book series. The preprocessing steps included:
- Loading PDF files using PyPDFDirectoryLoader.
- Splitting text into chunks using RecursiveCharacterTextSplitter with: Chunk Size: 1500 tokens Chunk Overlap: 50 tokens
- Normalizing the text (e.g., removing unnecessary characters and newlines).
The resulting dataset is stored as a list of dictionaries with the format:
[
{
"text": "Harry Potter and the Philosopher's Stone begins with..."
},
...
]
The train.py script implements the following:
- Data Preprocessing: Loads, chunks, and normalizes the text.
- Dataset Preparation: Converts text into a Hugging Face Dataset for training.
- Model Initialization: Loads the LLaMA 3.2-1B Instruct model and tokenizer.
- LoRA Configuration: Applies parameter-efficient tuning with LoRA.
- Training: Fine-tunes the model using Trainer with mixed precision (fp16)
python train.py
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "v_proj",
"k_proj", "o_proj",
"gate_proj", "up_proj",
"down_proj"
],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
- r: Low-rank dimension for decomposition matrices.
- lora_alpha: Scaling factor for LoRA outputs.
- target_modules: Specifies which layers to adapt with LoRA.
- lora_dropout: Dropout rate for regularization.
- task_type: Task type set to CAUSAL_LM (causal language modeling).
- The fine-tuned model is saved in the final_model directory.
- Logs and checkpoints are saved during training for monitoring progress.
The fine-tuned model is hosted on Hugging Face at EffiLLaMA on Hugging Face.
Here’s how to use the model for inference:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
# Load tokenizer and model
model_name = "AIAlbus/EffiLLaMA"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)
model = PeftModel.from_pretrained(base_model, model_name)
# Prepare input
input_text = "Why did Harry Potter survive Voldemort's attack?"
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
# Generate response
output = model.generate(inputs['input_ids'], max_length=150, do_sample=True, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Hugging Face Transformers for tools to load and fine-tune the model. LangChain for efficient text preprocessing. Research on LoRA and QLoRA for parameter-efficient fine-tuning methods.
This project is licensed under the MIT. See the LICENSE file for details.