This project demonstrates how to fine-tune the DistilBERT model on the GoEmotions dataset to perform sentiment analysis. The notebook provides a step-by-step guide, from data preprocessing to model evaluation.
Sentiment analysis is a crucial task in natural language processing (NLP) that involves determining the emotional tone behind textual data. This project utilizes the GoEmotions dataset, a comprehensive collection of human-annotated Reddit comments categorized into 27 emotion labels, to fine-tune DistilBERT—a smaller, faster, and lighter version of BERT.
The GoEmotions dataset consists of approximately 58,000 Reddit comments labeled across 27 emotion categories, including happiness, sadness, anger, and more. This rich dataset enables the development of models capable of nuanced emotion detection.
DistilBERT is a distilled version of BERT, retaining 97% of its language understanding while being 60% faster and lighter. Fine-tuning DistilBERT on the GoEmotions dataset allows for efficient and effective sentiment analysis.
To run the notebook, ensure you have the following dependencies installed:
- Python 3.x
- Transformers
- Datasets
- PyTorch
- scikit-learn
- pandas
- numpy
You can install the required packages using:
pip install transformers datasets torch scikit-learn pandas numpy
- GoEmotions Dataset: https://github.com/google-research/google-research/tree/master/goemotions
- DistilBERT Paper: https://arxiv.org/abs/1910.01108
- Hugging Face Transformers: https://github.com/huggingface/transformers