Welcome to the CV Parsing project! This project focuses on CV parsing and leverages the BEIT model for layout detection. The model consists of three key components:
- Backbone: We employ BEIT (BERT for Image Transformers) as the backbone for pretraining image transformers.
- Neck: Our model incorporates a Feature Pyramid Network (FPN) for improved feature extraction.
- Head: We use Faster R-CNN for object detection and recognition.
To achieve accurate layout detection, our backbone (BEIT) is pretrained on a self-supervised task based on masked image modeling. For detailed information on the pretraining process, please refer to the pretraining readme.
Before you start working with this project, ensure you have the following prerequisites in place:
- MMdetection 3.1.0: Make sure you have MMdetection version 3.1.0 installed.
- BEIT Backbone Integration: Move the file
layout detection/backbone/beit.py
tommdetection/mmdet/models/backbones
within your MMdetection installation. Additionally, import BEIT inmmdetection/mmdet/models/backbones/__init__.py
.
To fine-tune the model, you can use the following command:
python tools/train.py <config> --resume-from <last_checkpoint>
To test the model, you can use the following command:
python tools/test.py <config> <checkpoint> --show-dir <directory_results>
Here are some results on PublayNet :
For user interface examples and inference, please refer to the Gradio UI notebook included in this repository. You'll find examples of how to interact with the model through the Gradio user interface.
If you need to extract small pieces of information, such as names and company details, we've provided a notebook where we've implemented Pix2Struct to perform this task.