Skip to content

This Project focus on detectin the Explicit Images using ViTs. I have used ViT with Feed Forward Patch extraction techqniqe and Conv based Patch Extraction Technique additionally Swin ViT model. Achived an accuracy of 985 with Swin. Future work is going on this project that uses Autoencoders for feature extraction.

Notifications You must be signed in to change notification settings

yaseeng-md/Explicit-Content-Detection

Repository files navigation

Explicit Content Detection using Vision Transformers

Overview

This repository focuses on detecting explicit content in images using Vision Transformers (ViTs) and Swin Transformers. The project explores multiple architectures, including convolution-based patch extraction and autoencoder-based feature extraction, to improve detection accuracy.

Models Implemented

  1. Feed Forward Patch Extraction - Standard ViT approach with feed-forward networks.
  2. Convolution-Based Patch Extraction - Using CNNs to extract patches before feeding them into transformers.
  3. Swin Transformer Model - A hierarchical vision transformer achieving 98.5% accuracy.
  4. Autoencoder Feature Extraction (Upcoming) - Enhancing feature representation with autoencoders.

How to Use This Repository

Clone the repository using:

  git clone https://github.com/yaseeng-md/Explicit-Content-Detection
  cd Explicit-Content-Detection

Install the necessary dependencies from the requirements.txt file.

pip install -r requirements.txt

Run the provided Jupyter notebooks to preprocess the data, train models, and make predictions.

Prerequisits

Before you continue with the implemntation, Take the help of Dataset Collection.py and Remove Corrupt Files.py in Helpers.

Remove Corrupt Files.py Helps you to remove corrupted files from the exitsing dataset.

Dataset Collection.py You can decide the amount of dataset that you want to train and experiment on.

Trained Models

Download the trained Models from here !

Dataset

Provided on request.

Results

Model Traning Accuracy Training Loss Validation Accuracy Validation Loss
ViT (Feed Forward) 78.95% 0.4638 81.89% 0.4351
ViT (CNN Patch Extractor) 88.05% 0.430 81.86% 0.4893
Swin Transformer 98.35% 0.0458 86.89% 0.3443

Future Work

  • Implement Feed Forward Autoencoder-based Feature Extraction.
  • Implement Convlution Autoencoder-based Feature Extraction.

About

This Project focus on detectin the Explicit Images using ViTs. I have used ViT with Feed Forward Patch extraction techqniqe and Conv based Patch Extraction Technique additionally Swin ViT model. Achived an accuracy of 985 with Swin. Future work is going on this project that uses Autoencoders for feature extraction.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published