Skip to content
/ TFGrid Public

TransFuseGrid: Transformer-based Lidar-RGB fusion for semantic grid prediction

License

Notifications You must be signed in to change notification settings

gsg213/TFGrid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TransFuseGrid: Transformer-based Lidar-RGB fusion for semantic grid prediction

Semantic grids are a succinct and convenient approach to represent the environment for mobile robotics and autonomous driving applications. While the use of Lidar sensors is now generalized in robotics, most semantic grid prediction approaches in the literature focus only on RGB data. In this paper, we present an approach for semantic grid prediction that uses a transformer architecture to fuse Lidar sensor data with RGB images from multiple cameras. Our proposed method, TransFuseGrid, first transforms both input streams into top-view embeddings, and then fuses these embeddings at multiple scales with Transformers. Finally, a decoder transforms the fused, top-view feature map into a semantic grid of the vehicle's environment. We evaluate the performance of our approach on the nuScenes dataset for the vehicle, drivable area, lane divider and walkway segmentation tasks. The results show that TransFuseGrid achieves superior performance than competing RGB-only and Lidar-only methods. Additionally, the Transformer feature fusion leads to a significative improvement over naive RGB-Lidar concatenation. In particular, for the segmentation of vehicles, our model outperforms state-of-the-art RGB-only and Lidar-only methods by 24% and 53%, respectively.

Architecture

Comparison with competing methods

Results on validation split

Results on the NuScenes validation split. We compare the Intersection over Union (IOU) of the generated semantic grids with the competing approaches in validation set. Best results are presented in bold font, second best with 🔹Blue.

Models Vehicle Drivable area Lane divider Walkway
Lift-Splat-Shoot (pre-trained) 32.80 - - -
Lift-Splat-Shoot * 28.94 61.98 37.41 🔹50.07
Pillar feature Net 23.43 69.19 26.05 30.57
TFGrid (concat) 🔹32.88 🔹74.18 30.41 43.78
TFGrid (ours) 35.88 78.87 🔹35.70 50.98

We compare the IOU of the competing approaches in only Night conditions and only Rain conditions. Best results are presented in bold font, second best in 🔹Blue.

Class Vehicle Drivable area Lane divider Walkway
Models Night Rain Night Rain Night Rain Night Rain
Lift-Splat-Shoot (pre-trained) 31.06 🔹34.06 - - - - - -
Lift-Splat-Shoot * 28.41 30.81 49.13 56.86 15.87 34.72 🔹24.56 46.59
Pillar feature Net 28.54 19.56 63.89 64.01 3.60 23.72 23.52 27.22
TFGrid (concat) 🔹37.8 33.52 🔹68.98 🔹67.25 7.27 27.90 23.60 40.98
TFGrid (ours) 39.33 37.14 70.45 69.74 🔹15.17 🔹30.25 24.65 🔹44.26

Qualitative results

Comparison between ground truth and semantic grid predictions for vehicles (top row), walkway (second row) and drivable area (third row) for different samples in the validation split. We display the class probabilities as a colorscale, without thresholding. In the top row, we can see how the models with Lidar input can detect all vehicles in the scene, while LSS fails to do so, probably due to occlusions. Furthermore, the definition of the detected vehicles with the TFGrid model is much cleaner than TFGrid concat model. In the second row, we see how TFGrid out performs both baselines for walkway class. In the third row, the ability of the models with Lidar input to predict the details of the drivable area is clearly superior to that of LSS. The bottom row shows the combination of the three classes and results compared with the ground truth (*: Lift-Splat model trained by us, using the source code given by the authors.)

Comparison

Scene prediction visualization

A scene from nuScenes validation split is used to validate vehicle, drivable_area, lane_divider and walkway classes.

Video release

Use TransFuseGrid

In order to train TransFuseGrid for Drivable_area class, use the following command:

python main.py train_multigpu_2T -s trainval -c ./configs/config_nT_da.py --logdir ./runs/v2T_da --gpus 8

This script allows training using classes: vehicle, drivable_area, walkway, lane_divider, ped_crossing and stop_line.

Acknowledgement

This model repo is created using Philion and Fidler's work Lift, Splat, Shoot. Pillar feature net based on GndNet work. And multi-modal fuser transformer based on Prakash et al work TransFuser.

About

TransFuseGrid: Transformer-based Lidar-RGB fusion for semantic grid prediction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages