This project provides a video processing tool that utilizes advanced AI models, specifically Florence2 and SAM2, to detect and segment specific objects or activities in a video based on textual descriptions. The system identifies significant motion in video frames and then performs deep learning inference to locate objects or actions described by the user's textual input.
Before running the script, ensure that all dependencies are installed. You can install the necessary packages using the following command:
pip install -r requirements.txt
For downloading model checkpoints, run the following commands:
cd checkpoints
./download_ckpts.sh
cd ..
- Python 3.7+
- OpenCV
- Pillow (PIL)
- PyTorch
- tqdm
Additionally, install the following packages:
pip install -q einops spaces timm transformers samv2 gradio supervision opencv-python
The video processing can be executed from the command line with various arguments to specify the input video, output video, mask video, text input, and processing options.
python main.py --input_video_path <path_to_input_video> --output_video_path <path_to_output_video> --mask_video_path <path_to_mask_video> --text_input "your text here"
-
--input_video_path
Required. Path to the source video file. -
--output_video_path
Required. Path to save the processed output video. -
--mask_video_path
Required. Path to save the mask video that highlights detected objects. -
--text_input
Required. Textual description of the object or activity to detect and segment in the video. -
--fps
Frames per second for the output video. Default is 20. -
--history
Background subtraction history length. Default is 500. -
--var_threshold
Background subtraction threshold. Default is 16. -
--detect_shadows
Enable shadow detection in background subtraction. Default is True. -
--use_flow
Toggle to use RAFT-based optical flow instead of background subtraction. Default is False. -
--raft_path
Path to the RAFT directory (required if--use_flow
is enabled). Default is/kaggle/input/raft-pytorch
.
python main.py --input_video_path ./input_video.mp4 --output_video_path ./output_video.mp4 --mask_video_path ./mask_video.mp4 --text_input "person carrying a weapon"
python main.py --input_video_path ./input_video.mp4 --output_video_path ./output_video.mp4 --mask_video_path ./mask_video.mp4 --text_input "person carrying a weapon" --use_flow --raft_path /path/to/raft
A web-based user interface is available using Streamlit. To launch the web interface, run:
streamlit run app.py
-
Motion Detection:
Detect significant motion in the video to focus processing on relevant segments. -
Object and Action Detection:
Utilize state-of-the-art models (Florence2 and SAM2) to detect and segment objects or actions based on the provided text input. -
Dual Processing Modes:
Choose between traditional background subtraction and RAFT-based optical flow for foreground extraction. -
Output Generation:
Generate an annotated video along with a corresponding mask video showing the detected segments.
- When using optical flow, ensure that the RAFT model and its weights are correctly set up in your environment.
- The web interface (app.py) allows you to upload videos and toggle between processing modes, providing a convenient user experience.
- WebUI
- Robust Video Synopsis