This repository focuses on detecting explicit content in images using Vision Transformers (ViTs) and Swin Transformers. The project explores multiple architectures, including convolution-based patch extraction and autoencoder-based feature extraction, to improve detection accuracy.
- Feed Forward Patch Extraction - Standard ViT approach with feed-forward networks.
- Convolution-Based Patch Extraction - Using CNNs to extract patches before feeding them into transformers.
- Swin Transformer Model - A hierarchical vision transformer achieving 98.5% accuracy.
- Autoencoder Feature Extraction (Upcoming) - Enhancing feature representation with autoencoders.
Clone the repository using:
git clone https://github.com/yaseeng-md/Explicit-Content-Detection
cd Explicit-Content-Detection
Install the necessary dependencies from the requirements.txt file.
pip install -r requirements.txt
Run the provided Jupyter notebooks to preprocess the data, train models, and make predictions.
Before you continue with the implemntation, Take the help of Dataset Collection.py and Remove Corrupt Files.py in Helpers.
Remove Corrupt Files.py Helps you to remove corrupted files from the exitsing dataset.
Dataset Collection.py You can decide the amount of dataset that you want to train and experiment on.
Download the trained Models from here !
Provided on request.
Model | Traning Accuracy | Training Loss | Validation Accuracy | Validation Loss |
---|---|---|---|---|
ViT (Feed Forward) | 78.95% | 0.4638 | 81.89% | 0.4351 |
ViT (CNN Patch Extractor) | 88.05% | 0.430 | 81.86% | 0.4893 |
Swin Transformer | 98.35% | 0.0458 | 86.89% | 0.3443 |
- Implement Feed Forward Autoencoder-based Feature Extraction.
- Implement Convlution Autoencoder-based Feature Extraction.