MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

MoGe is a powerful model for recovering 3D geometry from monocular open-domain images. The model consists of a ViT encoder and a convolutional decoder. It directly predicts an affine-invariant point map as well as a mask that excludes regions with undefined geometry (e.g., sky), from which the camera shift, camera focal length and depth map can be further derived.

Check our website for videos and interactive results!

Features

Accurately estimate 3D geometry in point map or mesh format from a single image.
Support various image resolutions and aspect ratios, ranging from 2:1 to 1:2.
Capable of producing an extensive depth range, with distances from nearest to farthest reaching up to 1000x.
Fast inference, typically 0.2s for a single image on an A100 or RTX 3090 GPU.

TODO List

Release inference code & ViT-Large model.
Release ViT-Base and ViT-Giant models.
Release evaluation and training code.

🌟Updated on 2024/11/28 - CHANGELOG:

Supported user-provided camera FOV.
Added the script for panorama images scripts/infer_panorama.py.

Usage

Prerequisite

Clone this repository.

git clone https://github.com/microsoft/MoGe.git
cd MoGe

Python (>= 3.10) environment:
- torch (>= 2.0) and torchvision (compatible with the torch version).
- other requirements
```
pip install -r requirements.txt
```
  MoGe should be compatible with most requirements versions. Please check the requirements.txt for more details if you have concerns.

Pretrained model

The ViT-Large model has been uploaded to Hugging Face hub at Ruicheng/moge-vitl. You may load the model via MoGeModel.from_pretrained("Ruicheng/moge-vitl") without manually downloading.

If loading the model from a local file is preferred, you may manually download the model from the huggingface hub and load it via MoGeModel.from_pretrained("PATH_TO_LOCAL_MODEL.pt").

Minimal example

Here is a minimal example for loading the model and inferring on a single image.

import cv2
import torch
from moge.model import MoGeModel

device = torch.device("cuda")

# Load the model from huggingface hub (or load from local).
model = MoGeModel.from_pretrained("Ruicheng/moge-vitl").to(device)                             

# Read the input image and convert to tensor (3, H, W) and normalize to [0, 1]
input_image = cv2.cvtColor(cv2.imread("PATH_TO_IMAGE.jpg"), cv2.COLOR_BGR2RGB)                       
input_image = torch.tensor(input_image / 255, dtype=torch.float32, device=device).permute(2, 0, 1)    

# Infer 
output = model.infer(input_image)
# `output` has keys "points", "depth", "mask" and "intrinsics",
# The maps are in the same size as the input image. 
# {
#     "points": (H, W, 3),    # scale-invariant point map in OpenCV camera coordinate system (x right, y down, z forward)
#     "depth": (H, W),        # scale-invariant depth map
#     "mask": (H, W),         # a binary mask for valid pixels. 
#     "intrinsics": (3, 3),   # normalized camera intrinsics
# }
# For more usage details, see the `MoGeModel.infer` docstring.

Using scripts/app.py for a web demo

Make sure that gradio is installed and then run the following command to start the web demo:

python scripts/app.py   # --share for Gradio public sharing

The web demo is also available at our Hugging Face space.

Using scripts/infer.py

Run the script scripts/infer.py via the following command:

# Save the output [maps], [glb] and [ply] files
python scripts/infer.py --input IMAGES_FOLDER_OR_IMAGE_PATH --output OUTPUT_FOLDER --maps --glb --ply

# Show the result in a window (requires pyglet < 2.0, e.g. pip install pyglet==1.5.29)
python scripts/infer.py --input IMAGES_FOLDER_OR_IMAGE_PATH --output OUTPUT_FOLDER --show

For detailed options, run python scripts/infer.py --help:

Usage: infer.py [OPTIONS]

  Inference script for the MoGe model.

Options:
  --input PATH                Input image or folder path. "jpg" and "png" are
                              supported.
  --fov_x FLOAT               If camera parameters are known, set the
                              horizontal field of view in degrees. Otherwise,
                              MoGe will estimate it.
  --output PATH               Output folder path
  --pretrained TEXT           Pretrained model name or path. Default is
                              "Ruicheng/moge-vitl"
  --device TEXT               Device name (e.g. "cuda", "cuda:0", "cpu").
                              Default is "cuda"
  --resize INTEGER            Resize the image(s) & output maps to a specific
                              size. Default is None (no resizing).
  --resolution_level INTEGER  An integer [0-9] for the resolution level of
                              inference. The higher, the better but slower.
                              Default is 9. Note that it is irrelevant to the
                              output resolution.
  --threshold FLOAT           Threshold for removing edges. Default is 0.03.
                              Smaller value removes more edges. "inf" means no
                              thresholding.
  --maps                      Whether to save the output maps and fov(image,
                              depth, mask, points, fov).
  --glb                       Whether to save the output as a.glb file. The
                              color will be saved as a texture.
  --ply                       Whether to save the output as a.ply file. The
                              color will be saved as vertex colors.
  --show                      Whether show the output in a window. Note that
                              this requires pyglet<2 installed as required by
                              trimesh.
  --help                      Show this message and exit.

Using scripts/infer_panorama.py for 360° panorama images

NOTE: This is an experimental extension of MoGe.

The script will split the 360-degree panorama image into multiple perspective views and infer on each view separately. The output maps will be combined to produce a panorama depth map and point map.

Note that the panorama image must have spherical parameterization (e.g., environment maps or equirectangular images). Other formats must be converted to spherical format before using this script. Run python scripts/infer_panorama.py --help for detailed options.

The photo is from this URL

MoGrammetry: Integrating MoGe with COLMAP for Enhanced 3D Reconstruction

We’ve extended MoGe’s capabilities by combining it with a classical Structure-from-Motion (SfM) pipeline (e.g., COLMAP). This hybrid workflow leverages MoGe’s accurate monocular geometry estimation together with robust camera alignment from COLMAP, enabling faster and more comprehensive 3D reconstructions from image sets.

Overview

MoGrammetry merges single-image 3D point maps generated by MoGe with camera poses and intrinsics recovered by COLMAP. By doing so, we achieve a dense, consistent 3D scene representation without relying solely on multi-view stereo. This pipeline can be particularly beneficial when:

You have a sequence of images aligned and registered by COLMAP.
You want to quickly generate dense and consistent point clouds from each image using MoGe.
You aim to integrate those point clouds into a unified 3D model with minimal manual intervention.

Steps

Run COLMAP or Metashape (COLMAP Export) to Obtain Camera Parameters:
Use COLMAP (or export from Metashape in COLMAP-compatible format) to compute camera poses and intrinsics.
You should have:
- images.txt and cameras.txt (and optionally points3d.txt) in the standard COLMAP format.
- A set of aligned and (optionally) converted images.
MoGe Inference per Image:
For each input image, run MoGe’s model.infer to get:
- A dense affine-invariant point map.
- A mask to filter out sky/undefined geometry.
- Intrinsics and scale-invariant depth if needed.
Alignment & Fusion:
- Parse COLMAP’s camera parameters and poses.
- Adjust MoGe output to match the COLMAP camera model and transform each set of image-based points into the global coordinate system.
- Discard sky and outliers, then merge the resulting point clouds.
- Optionally apply outlier removal and meshing techniques for a clean, unified 3D model.

Example Code

Please refer to the newly added Python script scripts/colmap_integration.py in this repository for a working example. The script demonstrates:

Parsing COLMAP’s cameras.txt and images.txt.
Running MoGe inference on each image.
Aligning and merging the resulting point clouds.
Saving a final .ply file of the reconstructed scene.

Requirements

In addition to MoGe’s prerequisites, you’ll need:

COLMAP for camera alignment and pose estimation.
Open3D or another library for point cloud processing and merging.
A compatible Python environment (as described in MoGe’s prerequisites).

By combining MoGe’s single-view geometry with COLMAP’s robust camera alignment, MoGrammetry aims to streamline and accelerate your image-based 3D reconstruction workflow.

License

MoGe code is released under the MIT license, except for DINOv2 code in moge/model/dinov2 which is released by Meta AI under the Apache 2.0 license. See LICENSE for more details.

Citation

If you find our work useful in your research, we gratefully request that you consider citing our paper:

@misc{wang2024moge,
    title={MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision},
    author={Wang, Ruicheng and Xu, Sicheng and Dai, Cassie and Xiang, Jianfeng and Deng, Yu and Tong, Xin and Yang, Jiaolong},
    year={2024},
    eprint={2410.19115},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2410.19115}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
example_images		example_images
moge		moge
scripts		scripts
utils3d		utils3d
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

Features

TODO List

Usage

Prerequisite

Pretrained model

Minimal example

Using scripts/app.py for a web demo

Using scripts/infer.py

Using scripts/infer_panorama.py for 360° panorama images

MoGrammetry: Integrating MoGe with COLMAP for Enhanced 3D Reconstruction

Overview

Steps

Example Code

Requirements

License

Citation

About

Releases

Packages

Languages

License

jenkinsm13/MoGrammetry

Folders and files

Latest commit

History

Repository files navigation

MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

Features

TODO List

Usage

Prerequisite

Pretrained model

Minimal example

Using scripts/app.py for a web demo

Using scripts/infer.py

Using scripts/infer_panorama.py for 360° panorama images

MoGrammetry: Integrating MoGe with COLMAP for Enhanced 3D Reconstruction

Overview

Steps

Example Code

Requirements

License

Citation

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages