Skip to content

Latest commit

 

History

History
executable file
·
209 lines (145 loc) · 12.8 KB

README.md

File metadata and controls

executable file
·
209 lines (145 loc) · 12.8 KB

🎥 FAR: Frame Autoregressive Model for Both Short- and Long-Context Video Modeling 🚀

Project Page arXiv  huggingface weights  SOTA google colab logo

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

dmlab_sample

📢 News

  • 2025-04: Update multi-level KV cache for faster inference on long video. 🎉 Check our updated paper for details. We release colab demo for inference speed test. google colab logo
  • 2025-04: Release colab demo for quick inference! 🎉 google colab logo
  • 2025-03: Paper and code of FAR are released! 🎉

🌟 What's the Potential of FAR?

🔥 Introducing FAR: a new baseline for autoregressive video generation

FAR (i.e., Frame AutoRegressive Model) learns to predict continuous frames based on an autoregressive context. Its objective aligns well with video modeling, similar to the next-token prediction in language modeling.

dmlab_sample

🔥 FAR achieves better convergence than video diffusion models with the same continuous latent space:

🔥 FAR leverages clean visual context without additional image-to-video fine-tuning:

Unconditional pretraining on UCF-101 achieves state-of-the-art results in both video generation (context frame = 0) and video prediction (context frame ≥ 1) within a single model.

🔥 FAR supports efficient training on long video sequences with manageable token lengths:

The key technique behind this is long short-term context modeling, where we use regular patchification for short-term context to ensure fine-grained temporal consistency and aggressive patchification for long-term context to reduce redundant tokens.

🔥 FAR exploits the multi-level KV-Cache to speed up autoregressive inference on long videos:

📚 For more details, check out our paper.

🏋️‍♂️ FAR Model Zoo

We provide trained FAR models in our paper for re-implementation.

Video Generation

We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of Latte:

Model (Config) #Params Resolution Condition FVD HF Weights Pre-Computed Samples Train Cost (H100 Days) Memory (Per GPU)
FAR-L 457 M 128x128 280 ± 11.7 Model-HF Google Drive 12.2 22 G
FAR-L 457 M 128x128 99 ± 5.9 Model-HF Google Drive 12.2 22 G
FAR-L 457 M 256x256 303 ± 13.5 Model-HF Google Drive 12.7 22 G
FAR-L 457 M 256x256 113 ± 3.6 Model-HF Google Drive 12.7 22 G
FAR-XL 657 M 256x256 279 ± 9.2 Model-HF Google Drive 14.6 22 G
FAR-XL 657 M 256x256 108 ± 4.2 Model-HF Google Drive 14.6 22 G

Short-Video Prediction

We follows the evaluation prototype of MCVD and ExtDM:

Model (Config) #Params Dataset PSNR SSIM LPIPS FVD HF Weights Pre-Computed Samples Train Cost (H100 Days) Memory (Per GPU)
FAR-B 130 M UCF101 25.64 0.818 0.037 194.1 Model-HF Google Drive 3.6 9 G
FAR-B 130 M BAIR (c=2, p=28) 19.40 0.819 0.049 144.3 Model-HF Google Drive 2.6 12 G

Long-Video Prediction

We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of TECO:

Model (Config) #Params Dataset PSNR SSIM LPIPS FVD HF Weights Pre-Computed Samples Train Cost (H100 Days) Memory (Per GPU)
FAR-B-Long 150 M DMLab 22.3 0.687 0.104 64 Model-HF Google Drive 17.5 13 G
FAR-M-Long 280 M Minecraft 16.9 0.448 0.251 39 Model-HF Google Drive 18.2 19 G

🔧 Dependencies and Installation

1. Setup Environment:

# Setup Conda Environment
conda create -n FAR python=3.10
conda activate FAR

# Install Pytorch
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.4 -c pytorch -c nvidia

# Install Other Dependences
pip install -r requirements.txt

2. Prepare Dataset:

We have uploaded the dataset used in this paper to Hugging Face datasets for faster download. Please follow the instructions below to prepare.

from huggingface_hub import snapshot_download, hf_hub_download

dataset_url = {
    "ucf101": "guyuchao/UCF101",
    "bair": "guyuchao/BAIR",
    "minecraft": "guyuchao/Minecraft",
    "minecraft_latent": "guyuchao/Minecraft_Latent",
    "dmlab": "guyuchao/DMLab",
    "dmlab_latent": "guyuchao/DMLab_Latent"
}

for key, url in dataset_url.items():
    snapshot_download(
        repo_id=url,
        repo_type="dataset",
        local_dir=f"datasets/{key}",
        token="input your hf token here"
    )

Then, enter its directory and execute:

find . -name "shard-*.tar" -exec tar -xvf {} \;

3. Prepare Pretrained Models of FAR:

We have uploaded the pretrained models of FAR to Hugging Face models. Please follow the instructions below to download if you want to evaluate FAR.

from huggingface_hub import snapshot_download, hf_hub_download

snapshot_download(
    repo_id="guyuchao/FAR_Models",
    repo_type="model",
    local_dir="experiments/pretrained_models/FAR_Models",
    token="input your hf token here"
)

🚀 Training

To train different models, you can run the following command:

accelerate launch \
    --num_processes 8 \
    --num_machines 1 \
    --main_process_port 19040 \
    train.py \
    -opt train_config.yml
  • Wandb: Set use_wandb to True in config to enable wandb monitor.
  • Periodally Evaluation: Set val_freq to control the peroidly evaluation in training.
  • Auto Resume: Directly rerun the script, the model will find the lastest checkpoint to resume, the wandb log will automatically resume.
  • Efficient Training on Pre-Extracted Latent: Set use_latent to True, and set the data_list to corresponding latent path list.

💻 Sampling & Evaluation

To evaluate the performance of a pretrained model, just copy the training config and set the pretrain_network: ~ to your trained folder. Then run the following scripts:

accelerate launch \
    --num_processes 8 \
    --num_machines 1 \
    --main_process_port 10410 \
    test.py \
    -opt test_config.yml

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

📖 Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using:

@article{gu2025long,
    title={Long-Context Autoregressive Video Modeling with Next-Frame Prediction},
    author={Gu, Yuchao and Mao, weijia and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2503.19325},
    year={2025}
}