Simple Video Summarization using Text-to-Segment Anything (Florence2 + SAM2)

This project provides a video processing tool that utilizes advanced AI models, specifically Florence2 and SAM2, to detect and segment specific objects or activities in a video based on textual descriptions. The system identifies significant motion in video frames and then performs deep learning inference to locate objects or actions described by the user's textual input.

Installation

Before running the script, ensure that all dependencies are installed. You can install the necessary packages using the following command:

pip install -r requirements.txt

For downloading model checkpoints, run the following commands:

cd checkpoints
./download_ckpts.sh
cd ..

Requirements

Python 3.7+
OpenCV
Pillow (PIL)
PyTorch
tqdm

Additionally, install the following packages:

pip install -q einops spaces timm transformers samv2 gradio supervision opencv-python

Usage

The video processing can be executed from the command line with various arguments to specify the input video, output video, mask video, text input, and processing options.

Basic Command

python main.py --input_video_path <path_to_input_video> --output_video_path <path_to_output_video> --mask_video_path <path_to_mask_video> --text_input "your text here"

Parameters

--input_video_path
Required. Path to the source video file.
--output_video_path
Required. Path to save the processed output video.
--mask_video_path
Required. Path to save the mask video that highlights detected objects.
--text_input
Required. Textual description of the object or activity to detect and segment in the video.
--fps
Frames per second for the output video. Default is 20.
--history
Background subtraction history length. Default is 500.
--var_threshold
Background subtraction threshold. Default is 16.
--detect_shadows
Enable shadow detection in background subtraction. Default is True.
--use_flow
Toggle to use RAFT-based optical flow instead of background subtraction. Default is False.
--raft_path
Path to the RAFT directory (required if --use_flow is enabled). Default is /kaggle/input/raft-pytorch.

Example Command (Using Background Subtraction)

python main.py --input_video_path ./input_video.mp4 --output_video_path ./output_video.mp4 --mask_video_path ./mask_video.mp4 --text_input "person carrying a weapon"

Example Command (Using RAFT Optical Flow)

python main.py --input_video_path ./input_video.mp4 --output_video_path ./output_video.mp4 --mask_video_path ./mask_video.mp4 --text_input "person carrying a weapon" --use_flow --raft_path /path/to/raft

Web Interface

A web-based user interface is available using Streamlit. To launch the web interface, run:

streamlit run app.py

Features

Motion Detection:
Detect significant motion in the video to focus processing on relevant segments.
Object and Action Detection:
Utilize state-of-the-art models (Florence2 and SAM2) to detect and segment objects or actions based on the provided text input.
Dual Processing Modes:
Choose between traditional background subtraction and RAFT-based optical flow for foreground extraction.
Output Generation:
Generate an annotated video along with a corresponding mask video showing the detected segments.

Notes

When using optical flow, ensure that the RAFT model and its weights are correctly set up in your environment.
The web interface (app.py) allows you to upload videos and toggle between processing modes, providing a convenient user experience.

To Do

WebUI
Robust Video Synopsis

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
checkpoints		checkpoints
configs		configs
utils		utils
vid_src		vid_src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
main.py		main.py
requirements.txt		requirements.txt
video_flow.py		video_flow.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Video Summarization using Text-to-Segment Anything (Florence2 + SAM2)

Installation

Requirements

Usage

Basic Command

Parameters

Example Command (Using Background Subtraction)

Example Command (Using RAFT Optical Flow)

Web Interface

Features

Notes

To Do

Related work

About

Releases

Packages

Languages

mithunparab/text2segment_video

Folders and files

Latest commit

History

Repository files navigation

Simple Video Summarization using Text-to-Segment Anything (Florence2 + SAM2)

Installation

Requirements

Usage

Basic Command

Parameters

Example Command (Using Background Subtraction)

Example Command (Using RAFT Optical Flow)

Web Interface

Features

Notes

To Do

Related work

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages