CoAtXNet: A Dual-Stream Hybrid Transformer Based on Relative Cross-Attention for End-to-End Camera Localization from RGB-D images
CoAtXNet is a hybrid model that leverages the strengths of both Convolutional Neural Networks (CNNs) and Transformers to enhance vision-based camera localization. By integrating RGB and depth images through cross-attention mechanisms, CoAtXNet significantly improves feature representation and bidirectional information flow between modalities. This approach combines the local feature extraction capabilities of CNNs with the global context modeling strengths of Transformers, resulting in superior performance across various indoor scenes.
- Introduction
- Methodology
- Implementation Details
- Experiments
- Results
- Discussion
- Conclusion
- How to Use
- Contact
Camera localization, the process of determining a camera’s position and orientation within an environment, plays a pivotal role in the functionality of several systems. Traditional localization methods, including those based on structure and convolutional neural networks (CNNs), often encounter limitations in dynamic or visually complex environments.
This repository contains the implementation of CoAtXNet, a novel hybrid architecture that merges CNNs and Transformers using cross-attention mechanisms to efficiently integrate RGB and depth images. CoAtXNet processes these modalities independently through convolutional layers and then combines them with cross-attention, resulting in enhanced feature representation.
CoAtXNet utilizes dual streams to process RGB and depth information independently using convolutional layers and then fuses these features with cross-attention mechanisms. This design leverages the detailed texture information from RGB images and geometric depth cues to enhance localization accuracy.
- Cross-Attention Mechanism: Novel cross-attention mechanisms fuse features from the RGB and depth streams, helping the model better grasp local and global contexts.
- Dual-Stream Hybrid Architecture: A new dual-stream version of the hybrid CNN-transformer network processes the RGB and depth images through convolutional layers separately and then combines them using transformer-based cross-attention, optimizing the strengths of both approaches.
We implemented our proposed model using PyTorch, leveraging the Adam optimizer with an initial learning rate of 0.0001 and a batch size of 32. Input images were resized to a size of 256 × 256 pixels. During both training and testing, preprocessing applied to the images includes resizing and normalization into tensors.
The network was trained using the K-fold cross-validation method with 5 splits. Each fold was trained for 150 epochs. The learning rate was dynamically adjusted using a ReduceLROnPlateau scheduler that drops it by a factor of 0.1 if the validation loss does not improve over 10 epochs.
In this work, we utilize the 7Scenes and 12Scenes dataset, a well-known benchmarks to evaluate vision-based camera localization. both datasets includes a collection of RGBD data at different indoor scenes.
The results demonstrate that all variants of the CoAtXNet model achieve competitive performance across different scenes, with CoAtXNet-4 showing the best overall accuracy in terms of both translation and orientation errors.
The experimental results highlight the superior performance of the CoAtXNet model in the domain of absolute pose regression. By combining the strengths of traditional Convolutional Neural Networks (CNNs) with transformers, CoAtXNet effectively utilizes both local and global features, leading to improvements in position and orientation accuracy.
CoAtXNet represents a substantial advancement in the field of camera localization by effectively combining CNNs and Transformers through cross-attention mechanisms. This work not only enhances the accuracy and robustness of camera localization but also opens new avenues for research in hybrid models for various vision-based tasks.
- To run the implementation on Google Colab, open the provided
CoAtXNet.ipynb
notebook:- Open Google Colab
- Upload the
CoAtXNet.ipynb
notebook - Follow the instructions in the notebook to run the complete implementation
CoAtXNet.ipynb
: Jupyter notebook for running the complete implementation on Google Colab.requirements.txt
: List of dependencies required to run the code.Trainig.py
: Script to train the CoAtXNet model.Model.py
: Script to define the CoAtXNet model.LoadData.py
: Script to load and prepocess data from 7sence dataset.
We used the CoAtNet implementation from CoAtNet PyTorch.
If you have any questions or need further assistance, please feel free to email Hossein Hasan at hossein.h.hasan@gmail.com.