open-mmlab · ZwwWayne · Dec 1, 2021 · Dec 1, 2021 · Dec 1, 2021
diff --git a/configs/3dssd/README.md b/configs/3dssd/README.md
@@ -1,5 +1,20 @@
 # 3DSSD: Point-based 3D Single Stage Object Detector
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Currently, there have been many kinds of voxel-based 3D single stage detectors, while point-based single stage methods are still underexplored. In this paper, we first present a lightweight and effective point-based 3D single stage object detector, named 3DSSD, achieving a good balance between accuracy and efficiency. In this paradigm, all upsampling layers and refinement stage, which are indispensable in all existing point-based methods, are abandoned to reduce the large computation cost. We novelly propose a fusion sampling strategy in downsampling process to make detection on less representative points feasible. A delicate box prediction network including a candidate generation layer, an anchor-free regression head with a 3D center-ness assignment strategy is designed to meet with our demand of accuracy and speed. Our paradigm is an elegant single stage anchor-free framework, showing great superiority to other existing methods. We evaluate 3DSSD on widely used KITTI dataset and more challenging nuScenes dataset. Our method outperforms all state-of-the-art voxel-based single stage methods by a large margin, and has comparable performance to two stage point-based methods as well, with inference speed more than 25 FPS, 2x faster than former state-of-the-art point-based methods.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30491025/143854187-54ed1257-a046-4764-81cd-d2c8404137d3.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: 3DSSD: Point-based 3D Single Stage Object Detector] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2002.10187] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->

diff --git a/configs/centerpoint/README.md b/configs/centerpoint/README.md
@@ -1,5 +1,20 @@
 # Center-based 3D Object Detection and Tracking
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Three-dimensional objects are commonly represented as 3D boxes in a point-cloud. This representation mimics the well-studied image-based 2D bounding-box detection but comes with additional challenges. Objects in a 3D world do not follow any particular orientation, and box-based detectors have difficulties enumerating all orientations or fitting an axis-aligned bounding box to rotated objects. In this paper, we instead propose to represent, detect, and track 3D objects as points. Our framework, CenterPoint, first detects centers of objects using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation, and velocity. In a second stage, it refines these estimates using additional point features on the object. In CenterPoint, 3D object tracking simplifies to greedy closest-point matching. The resulting detection and tracking algorithm is simple, efficient, and effective. CenterPoint achieved state-of-the-art performance on the nuScenes benchmark for both 3D detection and tracking, with 65.5 NDS and 63.8 AMOTA for a single model. On the Waymo Open Dataset, CenterPoint outperforms all previous single model method by a large margin and ranks first among all Lidar-only submissions.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30491025/143854976-11af75ae-e828-43ad-835d-ac1146f99925.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: Center-based 3D Object Detection and Tracking] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2006.11275] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->

diff --git a/configs/dgcnn/README.md b/configs/dgcnn/README.md
@@ -1,5 +1,20 @@
 # Dynamic Graph CNN for Learning on Point Clouds
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Point clouds provide a flexible geometric representation suitable for countless applications in computer graphics; they also comprise the raw output of most 3D data acquisition devices. While hand-designed features on point clouds have long been proposed in graphics and vision, however, the recent overwhelming success of convolutional neural networks (CNNs) for image analysis suggests the value of adapting insight from CNN to the point cloud world. Point clouds inherently lack topological information so designing a model to recover topology can enrich the representation power of point clouds. To this end, we propose a new neural network module dubbed EdgeConv suitable for CNN-based high-level tasks on point clouds including classification and segmentation. EdgeConv acts on graphs dynamically computed in each layer of the network. It is differentiable and can be plugged into existing architectures. Compared to existing modules operating in extrinsic space or treating each point independently, EdgeConv has several appealing properties: It incorporates local neighborhood information; it can be stacked applied to learn global shape properties; and in multi-layer systems affinity in feature space captures semantic characteristics over potentially long distances in the original embedding. We show the performance of our model on standard benchmarks including ModelNet40, ShapeNetPart, and S3DIS.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30491025/143855852-3d7888ed-2cfc-416c-9ec8-57621edeaa34.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: Dynamic Graph CNN for Learning on Point Clouds] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/1801.07829] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->

diff --git a/configs/dynamic_voxelization/README.md b/configs/dynamic_voxelization/README.md
@@ -1,5 +1,20 @@
 # Dynamic Voxelization
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Recent work on 3D object detection advocates point cloud voxelization in birds-eye view, where objects preserve their physical dimensions and are naturally separable. When represented in this view, however, point clouds are sparse and have highly variable point density, which may cause detectors difficulties in detecting distant or small objects (pedestrians, traffic signs, etc.). On the other hand, perspective view provides dense observations, which could allow more favorable feature encoding for such cases. In this paper, we aim to synergize the birds-eye view and the perspective view and propose a novel end-to-end multi-view fusion (MVF) algorithm, which can effectively learn to utilize the complementary information from both. Specifically, we introduce dynamic voxelization, which has four merits compared to existing voxelization methods, i) removing the need of pre-allocating a tensor with fixed size; ii) overcoming the information loss due to stochastic point/voxel dropout; iii) yielding deterministic voxel embeddings and more stable detection outcomes; iv) establishing the bi-directional relationship between points and voxels, which potentially lays a natural foundation for cross-view feature fusion. By employing dynamic voxelization, the proposed feature fusion architecture enables each point to learn to fuse context information from different views. MVF operates on points and can be naturally extended to other approaches using LiDAR point clouds. We evaluate our MVF model extensively on the newly released Waymo Open Dataset and on the KITTI dataset and demonstrate that it significantly improves detection accuracy over the comparable single-view PointPillars baseline.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30491025/143856017-98b77ecb-7c13-4164-9c1d-e3011a7645e6.png" width="600"/>
+</div>
+
+<!-- [PAPER_TITLE: End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/1910.06528] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->

diff --git a/configs/fcos3d/README.md b/configs/fcos3d/README.md
@@ -1,5 +1,20 @@
 # FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Monocular 3D object detection is an important task for autonomous driving considering its advantage of low cost. It is much more challenging than conventional 2D cases due to its inherent ill-posed property, which is mainly reflected in the lack of depth information. Recent progress on 2D detection offers opportunities to better solving this problem. However, it is non-trivial to make a general adapted 2D detector work in this 3D task. In this paper, we study this problem with a practice built on a fully convolutional single-stage detector and propose a general framework FCOS3D. Specifically, we first transform the commonly defined 7-DoF 3D targets to the image domain and decouple them as 2D and 3D attributes. Then the objects are distributed to different feature levels with consideration of their 2D scales and assigned only according to the projected 3D-center for the training procedure. Furthermore, the center-ness is redefined with a 2D Gaussian distribution based on the 3D-center to fit the 3D target formulation. All of these make this framework simple yet effective, getting rid of any 2D detection or 2D-3D correspondence priors. Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30491025/143856739-93b7c4ff-e116-4824-8cc3-8cf1a433a84c.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2104.10956] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->

diff --git a/configs/free_anchor/README.md b/configs/free_anchor/README.md
@@ -1,5 +1,21 @@
 # FreeAnchor for 3D Object Detection
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Modern CNN-based object detectors assign anchors for ground-truth objects under the restriction of object-anchor Intersection-over-Unit (IoU). In this study, we propose a learning-to-match approach to break IoU restriction, allowing objects to match anchors in a flexible manner. Our approach, referred to as FreeAnchor, updates hand-crafted anchor assignment to “free" anchor matching by formulating detector training as a maximum likelihood estimation (MLE) procedure. FreeAnchor targets at learning features which best explain a class of objects in terms of both classification and localization. FreeAnchor is implemented by optimizing detection customized likelihood and can be fused with CNN-based detectors in a plug-and-play manner. Experiments on COCO demonstrate that FreeAnchor consistently outperforms the counterparts with significant margins.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36950400/143866685-e3ac08bb-cd0c-4ada-ba8a-18e03cccdd0f.png" width="600"/>
+</div>
+
+<!-- [PAPER_TITLE: FreeAnchor: Learning to Match Anchors for Visual
+Object Detection] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/1909.02466.pdf] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->

diff --git a/configs/groupfree3d/README.md b/configs/groupfree3d/README.md
@@ -1,5 +1,20 @@
 # Group-Free 3D Object Detection via Transformers
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Recently, directly detecting 3D objects from 3D point clouds has received increasing attention. To extract object representation from an irregular point cloud, existing methods usually take a point grouping step to assign the points to an object candidate so that a PointNet-like network could be used to derive object features from the grouped points. However, the inaccurate point assignments caused by the hand-crafted grouping scheme decrease the performance of 3D object detection. In this paper, we present a simple yet effective method for directly detecting 3D objects from the 3D point cloud. Instead of grouping local points to each object candidate, our method computes the feature of an object from all the points in the point cloud with the help of an attention mechanism in the Transformers, where the contribution of each point is automatically learned in the network training. With an improved attention stacking scheme, our method fuses object features in different stages and generates more accurate object detection results. With few bells and whistles, the proposed method achieves state-of-the-art 3D object detection performance on two widely used benchmarks, ScanNet V2 and SUN RGB-D.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36950400/143868101-09787c2a-9e0b-4013-8800-b4e315d535f0.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: Group-Free 3D Object Detection via Transformers] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2104.00678] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->

diff --git a/configs/h3dnet/README.md b/configs/h3dnet/README.md
@@ -1,5 +1,20 @@
 # H3DNet: 3D Object Detection Using Hybrid Geometric Primitives
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+We introduce H3DNet, which takes a colorless 3D point cloud as input and outputs a collection of oriented object bounding boxes (or BB) and their semantic labels. The critical idea of H3DNet is to predict a hybrid set of geometric primitives, i.e., BB centers, BB face centers, and BB edge centers. We show how to convert the predicted geometric primitives into object proposals by defining a distance function between an object and the geometric primitives. This distance function enables continuous optimization of object proposals, and its local minimums provide high-fidelity object proposals. H3DNet then utilizes a matching and refinement module to classify object proposals into detected objects and fine-tune the geometric parameters of the detected objects. The hybrid set of geometric primitives not only provides more accurate signals for object detection than using a single type of geometric primitives, but it also provides an overcomplete set of constraints on the resulting 3D layout. Therefore, H3DNet can tolerate outliers in predicted geometric primitives. Our model achieves state-of-the-art 3D detection results on two large datasets with real 3D scans, ScanNet and SUN RGB-D.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36950400/143868884-26f7fc63-93fd-48cb-a469-e2f55fda5550.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: H3DNet: 3D Object Detection Using Hybrid Geometric Primitives] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2006.05682] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->

diff --git a/configs/imvotenet/README.md b/configs/imvotenet/README.md
@@ -1,5 +1,20 @@
 # ImVoteNet: Boosting 3D Object Detection in Point Clouds with Image Votes
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+3D object detection has seen quick progress thanks to advances in deep learning on point clouds. A few recent works have even shown state-of-the-art performance with just point clouds input (e.g. VOTENET). However, point cloud data have inherent limitations. They are sparse, lack color information and often suffer from sensor noise. Images, on the other hand, have high resolution and rich texture. Thus they can complement the 3D geometry provided by point clouds. Yet how to effectively use image information to assist point cloud based detection is still an open question. In this work, we build on top of VOTENET and propose a 3D detection architecture called IMVOTENET specialized for RGB-D scenes. IMVOTENET is based on fusing 2D votes in images and 3D votes in point clouds. Compared to prior work on multi-modal detection, we explicitly extract both geometric and semantic features from the 2D images. We leverage camera parameters to lift these features to 3D. To improve the synergy of 2D-3D feature fusion, we also propose a multi-tower training scheme. We validate our model on the challenging SUN RGB-D dataset, advancing state-of-the-art results by 5.7 mAP. We also provide rich ablation studies to analyze the contribution of each design choice.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36950400/143869878-a2ae7f43-55c3-4b95-af09-8f97dfd975f4.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: ImVoteNet: Boosting 3D Object Detection in Point Clouds with Image Votes] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2001.10692] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->

diff --git a/configs/imvoxelnet/README.md b/configs/imvoxelnet/README.md
@@ -1,5 +1,20 @@
 # ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+In this paper, we introduce the task of multi-view RGB-based 3D object detection as an end-to-end optimization problem. To address this problem, we propose ImVoxelNet, a novel fully convolutional method of 3D object detection based on posed monocular or multi-view RGB images. The number of monocular images in each multiview input can variate during training and inference; actually, this number might be unique for each multi-view input. ImVoxelNet successfully handles both indoor and outdoor scenes, which makes it general-purpose. Specifically, it achieves state-of-the-art results in car detection on KITTI (monocular) and nuScenes (multi-view) benchmarks among all methods that accept RGB images. Moreover, it surpasses existing RGB-based 3D object detection methods on the SUN RGB-D dataset. On ScanNet, ImVoxelNet sets a new benchmark for multi-view 3D object detection.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36950400/143871445-38a55168-b8cd-4520-8ed6-f5c8c8ea304a.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2106.01178] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->

diff --git a/configs/mvxnet/README.md b/configs/mvxnet/README.md
@@ -1,5 +1,18 @@
 # MVX-Net: Multimodal VoxelNet for 3D Object Detection
 
+## Abstract
+
+Many recent works on 3D object detection have focused on designing neural network architectures that can consume point cloud data. While these approaches demonstrate encouraging performance, they are typically based on a single modality and are unable to leverage information from other modalities, such as a camera. Although a few approaches fuse data from different modalities, these methods either use a complicated pipeline to process the modalities sequentially, or perform late-fusion and are unable to learn interaction between different modalities at early stages. In this work, we present PointFusion and VoxelFusion: two simple yet effective early-fusion approaches to combine the RGB and point cloud modalities, by leveraging the recently introduced VoxelNet architecture. Evaluation on the KITTI dataset demonstrates significant improvements in performance over approaches which only use point cloud data. Furthermore, the proposed method provides results competitive with the state-of-the-art multimodal algorithms, achieving top-2 ranking in five of the six bird's eye view and 3D detection categories on the KITTI benchmark, by using a simple single stage network.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/79644370/143880819-560675ca-e7e3-4d77-8808-ea661ff8e6e6.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: MVX-Net: Multimodal VoxelNet for 3D Object Detection] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/1904.01649] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->