Lihe Yang1 · Bingyi Kang2+ · Zilong Huang2 · Xiaogang Xu3,4 · Jiashi Feng2 · Hengshuang Zhao1+
1The University of Hong Kong · 2TikTok · 3Zhejiang Lab · 4Zhejiang University
+corresponding authors
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation by training on a combination of 1.5M labeled images and 62M+ unlabeled images.
- 2024-01-25: Support video depth visualization.
- 2024-01-23: The new ControlNet based on Depth Anything is integrated into ControlNet WebUI and ComfyUI's ControlNet.
- 2024-01-23: Depth Anything ONNX and TensorRT versions are supported.
- 2024-01-22: Paper, project page, code, models, and demo (HuggingFace, OpenXLab) are released.
-
Relative depth estimation:
Our foundation models listed here can provide relative depth estimation for any given image robustly. Please refer here for details.
-
Metric depth estimation
We fine-tune our Depth Anything model with metric depth information from NYUv2 or KITTI. It offers strong capabilities of both in-domain and zero-shot metric depth estimation. Please refer here for details.
-
Better depth-conditioned ControlNet
We re-train a better depth-conditioned ControlNet based on Depth Anything. It offers more precise synthesis than the previous MiDaS-based ControlNet. Please refer here for details. You can also use our new ControlNet based on Depth Anything in ControlNet WebUI or ComfyUI's ControlNet.
-
Downstream high-level scene understanding
The Depth Anything encoder can be fine-tuned to downstream high-level perception tasks, e.g., semantic segmentation, 86.2 mIoU on Cityscapes and 59.4 mIoU on ADE20K. Please refer here for details.
Here we compare our Depth Anything with the previously best MiDaS v3.1 BEiTL-512 model.
Please note that the latest MiDaS is also trained on KITTI and NYUv2, while we do not.
Method | Params | KITTI | NYUv2 | Sintel | DDAD | ETH3D | DIODE | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AbsRel | AbsRel | AbsRel | AbsRel | AbsRel | AbsRel | ||||||||
MiDaS | 345.0M | 0.127 | 0.850 | 0.048 | 0.980 | 0.587 | 0.699 | 0.251 | 0.766 | 0.139 | 0.867 | 0.075 | 0.942 |
Ours-S | 24.8M | 0.080 | 0.936 | 0.053 | 0.972 | 0.464 | 0.739 | 0.247 | 0.768 | 0.127 | 0.885 | 0.076 | 0.939 |
Ours-B | 97.5M | 0.080 | 0.939 | 0.046 | 0.979 | 0.432 | 0.756 | 0.232 | 0.786 | 0.126 | 0.884 | 0.069 | 0.946 |
Ours-L | 335.3M | 0.076 | 0.947 | 0.043 | 0.981 | 0.458 | 0.760 | 0.230 | 0.789 | 0.127 | 0.882 | 0.066 | 0.952 |
We highlight the best and second best results in bold and italic respectively (better results: AbsRel
We provide three models of varying scales for robust relative depth estimation:
Model | Params | Inference Time on V100 (ms) | A100 | RTX4090 (TensorRT) |
---|---|---|---|---|
Depth-Anything-Small | 24.8M | 12 | 8 | 3 |
Depth-Anything-Base | 97.5M | 13 | 9 | 6 |
Depth-Anything-Large | 335.3M | 20 | 13 | 12 |
Note that the V100 and A100 inference time (without TensorRT) is computed by excluding the pre-processing and post-processing stages, whereas the last column RTX4090 (with TensorRT) is computed by including these two stages (please refer to Depth-Anything-TensorRT).
You can easily load our pre-trained models by:
from depth_anything.dpt import DepthAnything
encoder = 'vits' # can also be 'vitb' or 'vitl'
depth_anything = DepthAnything.from_pretrained('LiheYoung/depth_anything_{:}14'.format(encoder))
Depth Anything is also supported in transformers
. You can use it for depth prediction within 3 lines of code (credit to @niels).
Click here for solutions
-
First, please manually download our models (both config and checkpoints files) from here: depth-anything-small, depth-anything-base, and depth-anything-large.
-
Second, upload the folder which contains config and checkpoint files to your remote server.
-
Lastly, load the model locally by:
# suppose the config and checkpoint files are stored under the folder checkpoints/depth_anything_vitb14
depth_anything = DepthAnything.from_pretrained('checkpoints/depth_anything_vitb14', local_files_only=True)
git clone https://github.com/LiheYoung/Depth-Anything
cd Depth-Anything
pip install -r requirements.txt
python run.py --encoder <vits | vitb | vitl> --img-path <img-directory | single-img | txt-file> --outdir <outdir>
For the img-path
, you can either 1) point it to an image directory storing all interested images, 2) point it to a single image, or 3) point it to a text file storing all image paths.
For example:
python run.py --encoder vitl --img-path assets/examples --outdir depth_vis
If you want to use Depth Anything on videos:
python run_video.py --encoder vitl --video-path assets/examples_video --outdir video_depth_vis
To use our gradio demo locally:
python app.py
You can also try our online demo.
If you want to use Depth Anything in your own project, you can simply follow run.py
to load our models and define data pre-processing.
Code snippet (note the difference between our data pre-processing and that of MiDaS)
from depth_anything.dpt import DepthAnything
from depth_anything.util.transform import Resize, NormalizeImage, PrepareForNet
import cv2
import torch
encoder = 'vits' # can also be 'vitb' or 'vitl'
depth_anything = DepthAnything.from_pretrained('LiheYoung/depth_anything_{:}14'.format(encoder)).eval()
transform = Compose([
Resize(
width=518,
height=518,
resize_target=False,
keep_aspect_ratio=True,
ensure_multiple_of=14,
resize_method='lower_bound',
image_interpolation_method=cv2.INTER_CUBIC,
),
NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
PrepareForNet(),
])
image = cv2.cvtColor(cv2.imread('your image path'), cv2.COLOR_BGR2RGB) / 255.0
image = transform({'image': image})['image']
image = torch.from_numpy(image).unsqueeze(0)
# depth shape: 1xHxW
depth = depth_anything(image)
Easily use Depth Anything through transformers
within 3 lines of code! Please refer to these instructions (credit to @niels).
Click here for a brief demo:
from transformers import pipeline
from PIL import Image
image = Image.open('Your-image-path')
pipe = pipeline(task="depth-estimation", model="LiheYoung/depth-anything-small-hf")
depth = pipe(image)["depth"]
We sincerely appreciate all the extentions built on our Depth Anything from the community. Thank you a lot!
Here we list the extensions we have found:
- Depth Anything ONNX: https://github.com/fabio-sim/Depth-Anything-ONNX
- Depth Anything TensorRT: https://github.com/spacewalk01/depth-anything-tensorrt
- Depth Anything in ControlNet WebUI: https://github.com/Mikubill/sd-webui-controlnet
- Depth Anything in ComfyUI's ControlNet: https://github.com/Fannovel16/comfyui_controlnet_aux
- Depth Anything in X-AnyLabeling: https://github.com/CVHub520/X-AnyLabeling
- Depth Anything in OpenXLab: https://openxlab.org.cn/apps/detail/yyfan/depth_anything
If you have your amazing projects supporting or improving (e.g., speed) Depth Anything, please feel free to drop an issue. We will add them here.
We would like to express our deepest gratitude to AK(@_akhaliq) and the awesome HuggingFace team (@niels, @hysts, and @yuvraj) for helping improve the online demo and build the HF models.
Besides, we thank the MagicEdit team for providing some video examples for video depth estimation, and Tiancheng Shen for evaluating the depth maps with MagicEdit.
If you find this project useful, please consider citing:
@article{depthanything,
title={Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data},
author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
journal={arXiv:2401.10891},
year={2024}
}