This repository contains the code associated to the ACE0 paper:
Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer
Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cavallari, Áron Monszpart, Daniyar Turmukhambetov, and Victor Adrian Prisacariu
ECCV 2024
For further information please visit:
Table of contents:
This code uses PyTorch and has been tested on Ubuntu 20.04 with a V100 Nvidia GPU, although it should reasonably run with other Linux distributions and GPUs as well. Note our FAQ if you want to run ACE0 on GPUs with less memory.
We provide a pre-configured conda
environment containing all required dependencies
necessary to run our code.
You can re-create and activate the environment with:
conda env create -f environment.yml
conda activate ace0
All the following commands in this file need to run from the repository root and in the ace0
environment.
ACE0 represents a scene using an ACE scene coordinate regression model. In order to register cameras to the scene, it relies on the RANSAC implementation of the DSAC* paper (Brachmann and Rother, TPAMI 2021), which is written in C++. As such, you need to build and install the C++/Python bindings of those functions. You can do this with:
cd dsacstar
python setup.py install
cd ..
Having done the steps above, you are ready to experiment with ACE0!
Important note: the first time you run ACE0, the script may ask you to confirm that you are happy to download the ZoeDepth depth estimation code and its pretrained weights from GitHub. See this link for its license and details. ACE0 uses that model to estimate the depth for the seed images. It can be replaced, please see the FAQ section below for details.
We explain how to run ACE0 to reconstruct images from scratch, with and without knowledge about the image intrinsics. We also explain how to use ACE0 to refine existing poses, or to initialise reconstruction with a subset of poses. Furthermore, we cover the visualization capabilities of ACE0, including export of the reconstruction as a video and as 3D models.
In the minimal case, you can run ACE0 on a set of images as defined by a glob pattern.
# running on a set of images with default parameters
python ace_zero.py "/path/to/some/images/*.jpg" result_folder
Note the quotes around the glob pattern to ensure it is passed to the ACE0 script rather than being expanded by the shell.
If you want to run ACE0 on a video, you can extract frames from the video and run ACE0 on the extracted frames, see our Utility Scripts.
The ACE0 script will call ACE training (train_ace.py
) and camera registration (register_mapping.py
) in a loop until
all images have been registered to the scene representation, or there is no change between iterations.
The result of an ACE0 reconstruction is the poses_final.txt
in the result folder.
These files contain the estimated image poses in the following format:
filename qw qx qy qz x y z focal_length confidence
filename
is the image file relative to the repository root.
qw qx qy qz
is the camera rotation as a quaternion, and x y z
is the camera translation.
Camera poses are world-to-camera transformations, using the OpenCV camera convention.
focal_length
is the focal length estimated by ACE0 or set externally (see below).
confidence
is the reliability of an estimate.
If the confidence is less than 1000, it should be treated as unreliable and possibly ignored.
The pose files can be used e.g. to train a Nerfacto model, using our benchmarking scripts, see Benchmarking. Our benchmarking scripts also allow you to only convert our pose files to the format required by Nerfstudio, without running the benchmark itself.
Other content of the result folder explained.
The result folder will contain files such as the following:
iterationX.pt
: The ACE scene model (the MLP network) at iteration X. Output oftrain_ace.py
in iteration X.iterationX.txt
: Training statistics of the ACE model at iteration X, e.g. loss values, pose statistics, etc. Seeace_trainer.py
. Output oftrain_ace.py
in iteration X.poses_iterationX_preliminary.txt
: Poses of cameras after the mapping iteration but before relocalization. Contains poses refined by the MLP, rather than poses re-estimated by RANSAC. Output oftrain_ace.py
in iteration X.poses_iterationX.txt
: Final poses of iteration X, after relocalization, i.e. re-estimated by RANSAC. Output ofregister_mapping.py
in iteration X.poses_final.txt
: The final poses of the images in the scene. Corresponds to the poses of the last relocalisation iteration, i.e. the output of the lastregister_mapping.py
call.
Using default parameters, ACE0 will estimate the focal length of the images, starting from a heuristic value (70% of the image diagonal.) If you have a better estimate of the focal length, you can provide it as an initialisation parameter.
# running ACE0 with an initial guess for the focal length
python ace_zero.py "/path/to/some/images/*.jpg" result_folder --use_external_focal_length <focal_length>
Using the call above, ACE0 will refine the focal length throughout the reconstruction process. If you are confident that your focal length value is correct, you can disable focal length refinement.
# running ACE0 with a fixed focal length
python ace_zero.py "/path/to/some/images/*.jpg" result_folder --use_external_focal_length <focal_length> --refine_calibration False
Note: The current implementation of ACE0 supports only a single focal length value shared by all images. ACE0 currently also does assume that the principal point is at the image center, and pixels are square and unskewed. Changing these assumptions should be possible, but requires some implementation effort.
ACE0 can visualize the reconstruction process as a video.
# running ACE0 with visualisation enabled
python ace_zero.py "/path/to/some/images/*.jpg" result_folder --render_visualization True
With visualisation enabled, ACE0 will render individual frames in a subfolder renderings
and call ffmpeg
at the end.
The visualisation will be saved as a video in the results folder, named reconstruction.mp4
.
Other content of the renderings folder explained.
frame_N.png
: The Nth frame of the video.iterationX_mapping.pkl
: The visualisation buffer of the mapping call in iteration X. It stores the 3D point cloud of the scene, the last rendering camera for a smooth transition, and the last frame index.iterationX_register.pkl
: The visualisation buffer of the relocalization call in iteration X.
Note that this will slow down the reconstruction considerably. Alternatively, you can run without visualisation enabled and export the final reconstruction as a 3D model, see Utility Scripts.
You can combine the ACE0 meta script with custom calls to train_ace.py
and register_mapping.py
to cater to more advanced use cases.
train_ace.py
: Trains an ACE model on a set of images with corresponding poses.register_mapping.py
: Estimates poses of images in a scene given an ACE model.ace_zero.py
: Can start from an existing ACE model.
You are free to switch image sets between the calls to these functions. We provide some examples of advanced use cases that also cover some of the experiments in our paper.
If you have an initial guess of all image poses, you can use ACE to refine them quickly. We combine a single ACE mapping call with pose refinement enabled, and a single relocalization call.
# running ACE mapping with pose refinement enabled
python train_ace.py "/path/to/some/images/*.jpg" result_folder/ace_network.pt --pose_files "/path/to/some/images/*.txt" --pose_refinement mlp --pose_refinement_wait 5000 --use_external_focal_length <focal_length> --refine_calibration False
# re-estimate poses of all images
python register_mapping.py "/path/to/some/images/*.jpg" result_folder/ace_network.pt --use_external_focal_length <focal_length> --session ace_network
In this example, ACE takes the existing poses in the 7-Scenes format as input: one text file per image with the camera-to-world pose stored as a 4x4 matrix.
The option --pose_refinement mlp
enables pose refinement using a refinement network.
The option --pose_refinement_wait 5000
freezes poses for the first 5000 iterations which increases the stability if you are mapping from scratch with pose refinement.
After calling register_mapping.py
, the result folder will contain the refined poses in poses_ace_network.txt
.
Note that the example above assumes a known, fixed focal length. If you let ACE refine the calibration, you need to pass the refined focal length of train_ace.py
to register_mapping.py
.
Please see scripts/reconstruct_7scenes_warmstart.sh
for a complete example where we refine KinectFusion poses with ACE.
If you have pose estimates for subsets of images, you can use ACE0 to complete the reconstruction. First, you call ACE mapping on the subset of images with poses which results in an ACE scene model. You pass this model to ACE0, which will then register the remaining images to the scene.
# running ACE mapping on a subset of images wit poses
python train_ace.py "/images/with/poses/*.jpg" result_folder/iteration0_seed0.pt --pose_files "/poses/of/images/*.txt" --use_external_focal_length <focal_length> --refine_calibration False
# running ACE0 with the ACE model as a seed, and the complete set of images
python ace_zero.py "/all/images/*.jpg" result_folder --seed_network result_folder/iteration0_seed0.pt --use_external_focal_length ${focal_length} --refine_calibration False
ACE0 will store the final poses in poses_final.txt
in the result folder, containing poses of all images.
Note that the example above assumes a known, fixed focal length.
You can also let ACE or ACE0 estimate or refine the focal length, but you need to take care of passing the correct focal length between the calls.
Please see scripts/reconstruct_t2_training_videos_warmstart.sh
for a complete example where we reconstruct the Tanks and Temples training scenes starting from a partial reconstruction by COLMAP. More information about this example in Tanks and Temples.
You can use ACE0 to map a set of images, and call register_mapping.py
on a different set of images to relocalize them.
Here, ACE0 would run on the set of mapping images, while register_mapping.py
would run on the set of query images.
# running ACE0 on the mapping images
python ace_zero.py "/path/to/mapping/images/*.jpg" result_folder --use_external_focal_length <focal_length> --refine_calibration False
# running relocalization on the query images
python register_mapping.py "/path/to/query/images/*.jpg" result_folder/iterationX.pt --use_external_focal_length <focal_length> --session query
You need to point register_mapping.py
to the ACE model from the last mapping iteration (e.g. iterationX.pt
).
The relocalization results will be stored in poses_query.txt
.
Note that ACE0 reconstructions are only approximately metric.
If you compare the query poses to ground truth, you need to fit a similarity transform first.
We provide a script for doing that.
python eval_poses.py result_folder/poses_query.txt "/path/to/ground/truth/poses/*.txt"
More information about the evaluation script can be found under Utility Scripts.
We provide a script for extracting frames from MP4 videos via ffmpeg.
python datasets/video_to_dataset.py datasets
The script looks for all MP4 files in the target folder (here datasets
) and extracts frames into a subfolder datasets/video_<mp4_file_name>
for each video.
We provide a script for exporting ACE point clouds from a network and a pose file.
python export_point_cloud.py point_cloud_out.txt --network /path/to/ace_network.pt --pose_file /path/to/poses_final.txt
The script will write the point cloud into a text file in the format x y z r g b
per line for each point.
This format can be imported into most 3D software, e.g. Meshlab, CloudCompare, etc.
Note, you can also point the script to an existing visualization buffer, result_folder/renderings/iterationX_mapping.pkl
, which already contains the point cloud so it does not have to be re-generated.
We provide a script for exporting an ACE pose file to PLY showing the cameras.
python export_cameras.py /path/to/ace/pose_file.txt /path/to/output.ply
The script will color-code the cameras by their confidence value. The PLY format can be imported into most 3D software, e.g. Meshlab, CloudCompare, etc.
We provide a script that measures the pose error of a set of estimated poses against a set of ground truth poses.
python eval_poses.py /path/to/ace/pose_file.txt "/path/to/ground/truth/poses/*.txt"
The ground truth poses are given as a glob pattern, where each file contains the pose of a single image as a 4x4 camera-to-world transformation (e.g. as provided by the 7-Scenes dataset). Correspondence between ACE estimates and ground truth files will be established via alphabetical order of the image filenames.
By default, the script will calculate the percentage of poses below 5cm and 5 degrees error, as well as median rotation and translation errors. Since ACE0 poses are only approximately metric and in an arbitrary reference frame, the script will fit a similarity transform between estimates and ground truth before calculating error. This behaviour can be disabled with the appropriate command line flags.
In our paper, we benchmark the ACE0 reconstruction by training a Nerfacto model and measuring PSNR on a dataset-specific training/test split of images. To setup the benchmark, follow the instructions in the Benchmark README.
Note that the benchmark lives in its own conda environment, so you have to change environments between reconstruction and benchmarking.
The benchmark takes an ACE0 pose file and fits a Nerfacto model. Optionally, you can also use our benchmarking scripts to generate the input files for Nerfstudio without running the benchmark, see the --no_run_nerfstudio
flag.
If you do run the benchmark, it will apply a 1/8 split of the images by default to calculate PSNR. The scripts we provide for our Paper Experiments do optionally run the benchmark on each dataset using the correct split.
Since the benchmarking results are stored in a nested structure, we provide a script to extract the PSNR values:
# show the benchmark results of all scenes as sub-folders in the provided top-level folder
python scripts/show_benchmark_results.py /path/to/top/level/results/folder
The script assumes a folder structure where each scene is a sub-folder in a dataset-specific top-level folder.
E.g. results_7scenes
contains sub-folders chess
, fire
, heads
, etc.
After running benchmarking on a reconstruction, you can also render a NeRF video using Nerfstudio's ns-render
command. E.g.
ns-render interpolate --load-config /path/to/nerf/config.yaml --output-path /path/to/output/video.mp4 --pose-source eval
This will render the NeRF reconstruction using interpolated query poses (every 8th image by default).
If your images are not sequential, try --order-poses True
.
We provide scripts to run the main experiments of the paper. We also provide pre-computed results for all these experiments, along with the corresponding visualizations, in the respective sections below.
Setup the dataset.
# setup the 7-Scenes dataset in the datasets folder
cd datasets
# download and unpack the dataset
python setup_7scenes.py
# back to root directory
cd ..
The script can optionally convert the dataset to the ACE format, download alternative pseudo ground truth poses, calibrate depth maps, etc. However, it is not required for the ACE0 experiments.
(Optional for the benchmark) Create a benchmarking train/test split for the 7-Scenes dataset, see the Benchmark README for details.
pyton scripts/create_splits_7scenes.py splits_files
Reconstruct each scene (corresponding to "ACE0" in Table 1, left).
bash scripts/reconstruct_7scenes.sh
By default, the script will run with benchmarking enabled (make sure you set it up, see Nerfacto Benchmark) and visualisation disabled.
Flip the appropriate flags in the script to change this behaviour.
The ACE0 reconstruction files will be stored in results_7scenes
while the benchmarking results will be stored in results_7scenes_benchmark
.
To show the benchmarking results, call:
python scripts/show_benchmark_results.py results_7scenes_benchmark
To refine KinectFusion poses using ACE (corresponding to "KF+ACE0" in Table 1, left), run:
bash scripts/reconstruct_7scenes_warmstart.sh
# show the benchmark results
python scripts/show_benchmark_results.py results_7scenes_warmstart_benchmark
Find pre-computed poses and reconstruction videos for 7-Scenes here. These results are from a different run of ACE0 than the one we used for the paper results, but PSNR values are very close (± 0.1dB PSNR on average).
For some experiments in the paper (see right side of Table 1), we run ACE0 and baselines on a subset of images for each scene. We provide the lists of images, together with how they have been split for the view synthesis benchmark here: 200 images per scene and 50 images per scene.
Setup the dataset.
# setup the Mip-NeRF 360 dataset in the datasets folder
cd datasets
# download and unpack the dataset
python setup_mip360.py
# back to root directory
cd ..
The script can optionally convert the COLMAP ground truth to the ACE format, but it is not required for the ACE0 experiments.
(Optional for the benchmark) Create a benchmarking train/test split for the Mip-NeRF 360 dataset, see the Benchmark README for details. This uses a slightly different 1/8 split than the default benchmark split.
pyton scripts/create_splits_mip360.py splits_files
Reconstruct each scene (corresponding to "ACE0" in Table 2 (b)).
bash scripts/reconstruct_mip360.sh
By default, the script will run with benchmarking enabled (make sure you set it up, see Nerfacto Benchmark) and visualisation disabled.
Flip the appropriate flags in the script to change this behaviour.
The ACE0 reconstruction files will be stored in results_mip360
while the benchmarking results will be stored in results_mip360_benchmark
.
To show the benchmarking results, call:
python scripts/show_benchmark_results.py results_mip360_benchmark
Find pre-computed poses and reconstruction videos for the Mip-NerF 360 dataset here. These results are from a different run of ACE0 than the one we used for the paper results, but PSNR values are very close (± 0.1dB PSNR on average).
You have to manually download the dataset.
Our dataset script assumes you downloaded the group archives into datasets/t2
without unpacking them:
datasets/t2/training.zip
datasets/t2/training_videos.zip
datasets/t2/intermediate.zip
datasets/t2/intermediate_videos.zip
datasets/t2/advanced.zip
datasets/t2/advanced_videos.zip
Setup the dataset.
# setup the T&T dataset in the datasets folder
cd datasets
# unpack the dataset
python setup_t2.py
# back to root directory
cd ..
Optionally, the script can download and setup COLMAP ground truth poses, and convert them to the ACE format.
This is required for the ACE0 experiments which reconstruct the dataset videos starting from a sparse COLMAP reconstruction.
Call the script with --with-colmap
.
This will create an additional folder t2_colmap
in the datasets folder where each scene folder not only has the image
files, but also corresponding *_pose.txt
files with COLMAP poses as 4x4, camera-to-world transformations.
Also, per scene, a single focal_length.txt
file is created with the COLMAP focal length estimate.
We provide scripts for Tanks and Temples separated by scene group, i.e. training, intermediate, and advanced. The following explanations are for the training group, but the scripts for the intermediate and advanced groups are similar.
Reconstruct each scene from a few hundred images (corresponding to "ACE0" in Table 3, left).
bash scripts/reconstruct_t2_training.sh
By default, the script will run with benchmarking enabled (make sure you set it up, see Nerfacto Benchmark) and visualisation disabled.
Flip the appropriate flags in the script to change this behaviour.
The ACE0 reconstruction files will be stored in results_t2_training
while the benchmarking results will be stored in results_t2_training_benchmark
.
To show the benchmarking results, call:
python scripts/show_benchmark_results.py results_t2_training_benchmark
Note that no benchmarking split files need to be generated for Tanks and Temples. The benchmark will apply a default 1/8 split.
To reconstruct the full videos of each scene (corresponding to "ACE0" in Table 3, right), call:
bash scripts/reconstruct_t2_training_videos.sh
# show benchmarking results
python scripts/show_benchmark_results.py results_t2_training_videos_benchmark
To reconstruct the full videos of each scene starting from a COLMAP reconstruction (corresponding to "Sparse COLMAP + ACE0" in Table 3, left), call:
bash scripts/reconstruct_t2_training_videos_warmstart.sh
# show benchmarking results
python scripts/show_benchmark_results.py results_t2_training_videos_warmstart_benchmark
Note that the last experiment assumes that you set up the dataset with --with-colmap
.
The code will first call ACE mapping on the images with COLMAP poses to create an initial scene model.
This model is then passed to ACE0 which will use it as a seed for the full video reconstruction.
In this example, we trust the focal length estimate of COLMAP and keep it fixed throughout the reconstruction.
Find pre-computed poses and reconstruction videos for Tanks and Temples here: Training scenes, Intermediate scenes, Advanced scenes. These results are from a different run of ACE0 than the one we used for the paper results, but PSNR values are very close (± 0.3dB PSNR on average).
Q: I run out of GPU memory during the ACE0 reconstruction. What can I do?
A: All experiments in the paper were performed with 16GB of GPU memory (e.g. NVIDIA V100/T4) and the default settings should work with such a GPU.
The bulk of the memory is used by the ACE training buffer (up to ~8GB).
You can run ACE0 with the flag --training_buffer_cpu True
to keep the training buffer on the CPU at the expense of reconstruction speed.
With that option, ACE0 should require ~1GB of GPU memory.
Q: I have an image collection with various images sizes, aspect ratios and intrinsics. Can I use ACE0?
A: No. ACE0 assumes that all images share their intrinsics, particularly the focal length.
This is a limitation of the current implementation, rather than the method.
Supporting images with varying intrinics should work, but would require some implementation effort, particularly in refine_calibration.py
.
Q: Does ACE0 estimate intrinsics other than the focal length?
A: No. ACE0 assumes that the principal point is at the image center, and pixels are square and unskewed. The focal length, shared by all images, is the only intrinsic parameter estimated and/or refined by ACE0.
Q: I have images from a complex camera model. e.g. with severe image distortion. Can I use ACE0?
A: No. The scene coordinate regression network might be able to remove some distortion, but presumably not much. The reprojection loss of ACE and the RANSAC pose estimator assume a pinhole camera model. These parts would need to implement a camera distortion model. If the distortion parameters are known, we would recommend to undistort the images before passing them to ACE0.
Q: How can I run ACE0 with depth other than ZoeDepth estimates?
A: If you have pre-calculated depth maps, you can call ace_zero.py
with --depth_files "/path/to/depths/*.png"
.
In this case, ACE0 will use the provided depth maps for the seed images instead of estimating depth.
Otherwise, the functions get_depth_model()
and estimate_depth()
in dataset_io.py
can be adapted to use a depth estimator other than ZoeDepth.
Note that we found the impact of the depth estimation model to be rather small in our experiments.
Q: Is ACE0 able to reconstruct from a small set of sparse views?
A: It can work but this scenario is challenging for ACE0.
We expect other methods, and even COLMAP, to work much better in this case.
ACE0 relies on images having sufficient visual overlap, particularly when registering new images to the reconstruction.
You can lower the registration threshold when running ace_zero.py
via --registration_confidence
setting it to 300 or 100 - but at some point ACE0 will get unstable.
ACE0 shines if you have dense coverage of a scene, and reconstruct it from many images in reasonable time.
If you use ACE0 or parts of its code in your own work, please cite:
@inproceedings{brachmann2024acezero,
title={Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer},
author={Brachmann, Eric and Wynn, Jamie and Chen, Shuai and Cavallari, Tommaso and Monszpart, {\'{A}}ron and Turmukhambetov, Daniyar and Prisacariu, Victor Adrian},
booktitle={ECCV},
year={2024},
}
This code builds on the ACE relocalizer and uses the DSAC* pose estimator. Please consider citing:
@inproceedings{brachmann2023ace,
title={Accelerated Coordinate Encoding: Learning to Relocalize in Minutes using RGB and Poses},
author={Brachmann, Eric and Cavallari, Tommaso and Prisacariu, Victor Adrian},
booktitle={CVPR},
year={2023},
}
@article{brachmann2021dsacstar,
title={Visual Camera Re-Localization from {RGB} and {RGB-D} Images Using {DSAC}},
author={Brachmann, Eric and Rother, Carsten},
journal={TPAMI},
year={2021}
}
ACE0 estimates depth of seed images using ZoeDepth. Please consider citing:
@article{bhat2023zoedepth,
title={Zoe{D}epth: Zero-shot transfer by combining relative and metric depth},
author={Bhat, Shariq Farooq and Birkl, Reiner and Wofk, Diana and Wonka, Peter and M{\"u}ller, Matthias},
journal={arXiv},
year={2023}
}
This repository relies on Nerfstudio for benchmarking. Please consider citing according to their docs.
Copyright © Niantic, Inc. 2024. Patent Pending. All rights reserved. Please see the license file for terms.