From a443ee178e81909d16c600c3fd5ca986259cb94d Mon Sep 17 00:00:00 2001
From: ChaimZhu <zhuchenming@pjlab.org.cn>
Date: Wed, 1 Dec 2021 19:11:47 +0800
Subject: [PATCH 1/2] add abstract for methods

---
 configs/3dssd/README.md                | 15 +++++++++++++++
 configs/centerpoint/README.md          | 15 +++++++++++++++
 configs/dgcnn/README.md                | 15 +++++++++++++++
 configs/dynamic_voxelization/README.md | 15 +++++++++++++++
 configs/fcos3d/README.md               | 15 +++++++++++++++
 configs/free_anchor/README.md          | 16 ++++++++++++++++
 configs/groupfree3d/README.md          | 15 +++++++++++++++
 configs/h3dnet/README.md               | 15 +++++++++++++++
 configs/imvotenet/README.md            | 15 +++++++++++++++
 configs/imvoxelnet/README.md           | 15 +++++++++++++++
 configs/mvxnet/README.md               | 13 +++++++++++++
 configs/paconv/README.md               | 16 ++++++++++++++++
 configs/parta2/README.md               | 15 +++++++++++++++
 configs/pgd/README.md                  | 15 +++++++++++++++
 configs/pointnet2/README.md            | 15 +++++++++++++++
 configs/pointpillars/README.md         | 15 +++++++++++++++
 configs/regnet/README.md               | 15 +++++++++++++++
 configs/second/README.md               | 15 +++++++++++++++
 configs/smoke/README.md                | 15 +++++++++++++++
 configs/ssn/README.md                  | 15 +++++++++++++++
 configs/votenet/README.md              | 15 +++++++++++++++
 21 files changed, 315 insertions(+)
diff --git a/configs/3dssd/README.md b/configs/3dssd/README.md
index 06caec5d7d..61d9c71afb 100644
--- a/configs/3dssd/README.md
+++ b/configs/3dssd/README.md
@@ -1,5 +1,20 @@
 # 3DSSD: Point-based 3D Single Stage Object Detector
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Currently, there have been many kinds of voxel-based 3D single stage detectors, while point-based single stage methods are still underexplored. In this paper, we first present a lightweight and effective point-based 3D single stage object detector, named 3DSSD, achieving a good balance between accuracy and efficiency. In this paradigm, all upsampling layers and refinement stage, which are indispensable in all existing point-based methods, are abandoned to reduce the large computation cost. We novelly propose a fusion sampling strategy in downsampling process to make detection on less representative points feasible. A delicate box prediction network including a candidate generation layer, an anchor-free regression head with a 3D center-ness assignment strategy is designed to meet with our demand of accuracy and speed. Our paradigm is an elegant single stage anchor-free framework, showing great superiority to other existing methods. We evaluate 3DSSD on widely used KITTI dataset and more challenging nuScenes dataset. Our method outperforms all state-of-the-art voxel-based single stage methods by a large margin, and has comparable performance to two stage point-based methods as well, with inference speed more than 25 FPS, 2x faster than former state-of-the-art point-based methods.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30491025/143854187-54ed1257-a046-4764-81cd-d2c8404137d3.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: 3DSSD: Point-based 3D Single Stage Object Detector] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2002.10187] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/centerpoint/README.md b/configs/centerpoint/README.md
index 69d4cdf90a..9b545a7428 100644
--- a/configs/centerpoint/README.md
+++ b/configs/centerpoint/README.md
@@ -1,5 +1,20 @@
 # Center-based 3D Object Detection and Tracking
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Three-dimensional objects are commonly represented as 3D boxes in a point-cloud. This representation mimics the well-studied image-based 2D bounding-box detection but comes with additional challenges. Objects in a 3D world do not follow any particular orientation, and box-based detectors have difficulties enumerating all orientations or fitting an axis-aligned bounding box to rotated objects. In this paper, we instead propose to represent, detect, and track 3D objects as points. Our framework, CenterPoint, first detects centers of objects using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation, and velocity. In a second stage, it refines these estimates using additional point features on the object. In CenterPoint, 3D object tracking simplifies to greedy closest-point matching. The resulting detection and tracking algorithm is simple, efficient, and effective. CenterPoint achieved state-of-the-art performance on the nuScenes benchmark for both 3D detection and tracking, with 65.5 NDS and 63.8 AMOTA for a single model. On the Waymo Open Dataset, CenterPoint outperforms all previous single model method by a large margin and ranks first among all Lidar-only submissions.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30491025/143854976-11af75ae-e828-43ad-835d-ac1146f99925.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: Center-based 3D Object Detection and Tracking] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2006.11275] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/dgcnn/README.md b/configs/dgcnn/README.md
index fa31e43f34..5b4bddca74 100644
--- a/configs/dgcnn/README.md
+++ b/configs/dgcnn/README.md
@@ -1,5 +1,20 @@
 # Dynamic Graph CNN for Learning on Point Clouds
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Point clouds provide a flexible geometric representation suitable for countless applications in computer graphics; they also comprise the raw output of most 3D data acquisition devices. While hand-designed features on point clouds have long been proposed in graphics and vision, however, the recent overwhelming success of convolutional neural networks (CNNs) for image analysis suggests the value of adapting insight from CNN to the point cloud world. Point clouds inherently lack topological information so designing a model to recover topology can enrich the representation power of point clouds. To this end, we propose a new neural network module dubbed EdgeConv suitable for CNN-based high-level tasks on point clouds including classification and segmentation. EdgeConv acts on graphs dynamically computed in each layer of the network. It is differentiable and can be plugged into existing architectures. Compared to existing modules operating in extrinsic space or treating each point independently, EdgeConv has several appealing properties: It incorporates local neighborhood information; it can be stacked applied to learn global shape properties; and in multi-layer systems affinity in feature space captures semantic characteristics over potentially long distances in the original embedding. We show the performance of our model on standard benchmarks including ModelNet40, ShapeNetPart, and S3DIS.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30491025/143855852-3d7888ed-2cfc-416c-9ec8-57621edeaa34.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: Dynamic Graph CNN for Learning on Point Clouds] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/1801.07829] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/dynamic_voxelization/README.md b/configs/dynamic_voxelization/README.md
index eab62d48f0..e41b438e43 100644
--- a/configs/dynamic_voxelization/README.md
+++ b/configs/dynamic_voxelization/README.md
@@ -1,5 +1,20 @@
 # Dynamic Voxelization
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Recent work on 3D object detection advocates point cloud voxelization in birds-eye view, where objects preserve their physical dimensions and are naturally separable. When represented in this view, however, point clouds are sparse and have highly variable point density, which may cause detectors difficulties in detecting distant or small objects (pedestrians, traffic signs, etc.). On the other hand, perspective view provides dense observations, which could allow more favorable feature encoding for such cases. In this paper, we aim to synergize the birds-eye view and the perspective view and propose a novel end-to-end multi-view fusion (MVF) algorithm, which can effectively learn to utilize the complementary information from both. Specifically, we introduce dynamic voxelization, which has four merits compared to existing voxelization methods, i) removing the need of pre-allocating a tensor with fixed size; ii) overcoming the information loss due to stochastic point/voxel dropout; iii) yielding deterministic voxel embeddings and more stable detection outcomes; iv) establishing the bi-directional relationship between points and voxels, which potentially lays a natural foundation for cross-view feature fusion. By employing dynamic voxelization, the proposed feature fusion architecture enables each point to learn to fuse context information from different views. MVF operates on points and can be naturally extended to other approaches using LiDAR point clouds. We evaluate our MVF model extensively on the newly released Waymo Open Dataset and on the KITTI dataset and demonstrate that it significantly improves detection accuracy over the comparable single-view PointPillars baseline.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30491025/143856017-98b77ecb-7c13-4164-9c1d-e3011a7645e6.png" width="600"/>
+</div>
+
+<!-- [PAPER_TITLE: End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/1910.06528] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/fcos3d/README.md b/configs/fcos3d/README.md
index 5e22e27606..e95a34c76f 100644
--- a/configs/fcos3d/README.md
+++ b/configs/fcos3d/README.md
@@ -1,5 +1,20 @@
 # FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Monocular 3D object detection is an important task for autonomous driving considering its advantage of low cost. It is much more challenging than conventional 2D cases due to its inherent ill-posed property, which is mainly reflected in the lack of depth information. Recent progress on 2D detection offers opportunities to better solving this problem. However, it is non-trivial to make a general adapted 2D detector work in this 3D task. In this paper, we study this problem with a practice built on a fully convolutional single-stage detector and propose a general framework FCOS3D. Specifically, we first transform the commonly defined 7-DoF 3D targets to the image domain and decouple them as 2D and 3D attributes. Then the objects are distributed to different feature levels with consideration of their 2D scales and assigned only according to the projected 3D-center for the training procedure. Furthermore, the center-ness is redefined with a 2D Gaussian distribution based on the 3D-center to fit the 3D target formulation. All of these make this framework simple yet effective, getting rid of any 2D detection or 2D-3D correspondence priors. Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30491025/143856739-93b7c4ff-e116-4824-8cc3-8cf1a433a84c.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2104.10956] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/free_anchor/README.md b/configs/free_anchor/README.md
index e88c024871..42110465d7 100644
--- a/configs/free_anchor/README.md
+++ b/configs/free_anchor/README.md
@@ -1,5 +1,21 @@
 # FreeAnchor for 3D Object Detection
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Modern CNN-based object detectors assign anchors for ground-truth objects under the restriction of object-anchor Intersection-over-Unit (IoU). In this study, we propose a learning-to-match approach to break IoU restriction, allowing objects to match anchors in a flexible manner. Our approach, referred to as FreeAnchor, updates hand-crafted anchor assignment to “free" anchor matching by formulating detector training as a maximum likelihood estimation (MLE) procedure. FreeAnchor targets at learning features which best explain a class of objects in terms of both classification and localization. FreeAnchor is implemented by optimizing detection customized likelihood and can be fused with CNN-based detectors in a plug-and-play manner. Experiments on COCO demonstrate that FreeAnchor consistently outperforms the counterparts with significant margins.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36950400/143866685-e3ac08bb-cd0c-4ada-ba8a-18e03cccdd0f.png" width="600"/>
+</div>
+
+<!-- [PAPER_TITLE: FreeAnchor: Learning to Match Anchors for Visual
+Object Detection] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/1909.02466.pdf] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/groupfree3d/README.md b/configs/groupfree3d/README.md
index fede6af7f4..f8d734738f 100644
--- a/configs/groupfree3d/README.md
+++ b/configs/groupfree3d/README.md
@@ -1,5 +1,20 @@
 # Group-Free 3D Object Detection via Transformers
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Recently, directly detecting 3D objects from 3D point clouds has received increasing attention. To extract object representation from an irregular point cloud, existing methods usually take a point grouping step to assign the points to an object candidate so that a PointNet-like network could be used to derive object features from the grouped points. However, the inaccurate point assignments caused by the hand-crafted grouping scheme decrease the performance of 3D object detection. In this paper, we present a simple yet effective method for directly detecting 3D objects from the 3D point cloud. Instead of grouping local points to each object candidate, our method computes the feature of an object from all the points in the point cloud with the help of an attention mechanism in the Transformers, where the contribution of each point is automatically learned in the network training. With an improved attention stacking scheme, our method fuses object features in different stages and generates more accurate object detection results. With few bells and whistles, the proposed method achieves state-of-the-art 3D object detection performance on two widely used benchmarks, ScanNet V2 and SUN RGB-D.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36950400/143868101-09787c2a-9e0b-4013-8800-b4e315d535f0.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: Group-Free 3D Object Detection via Transformers] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2104.00678] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/h3dnet/README.md b/configs/h3dnet/README.md
index 3eceaf7fb3..cad4a63ce0 100644
--- a/configs/h3dnet/README.md
+++ b/configs/h3dnet/README.md
@@ -1,5 +1,20 @@
 # H3DNet: 3D Object Detection Using Hybrid Geometric Primitives
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+We introduce H3DNet, which takes a colorless 3D point cloud as input and outputs a collection of oriented object bounding boxes (or BB) and their semantic labels. The critical idea of H3DNet is to predict a hybrid set of geometric primitives, i.e., BB centers, BB face centers, and BB edge centers. We show how to convert the predicted geometric primitives into object proposals by defining a distance function between an object and the geometric primitives. This distance function enables continuous optimization of object proposals, and its local minimums provide high-fidelity object proposals. H3DNet then utilizes a matching and refinement module to classify object proposals into detected objects and fine-tune the geometric parameters of the detected objects. The hybrid set of geometric primitives not only provides more accurate signals for object detection than using a single type of geometric primitives, but it also provides an overcomplete set of constraints on the resulting 3D layout. Therefore, H3DNet can tolerate outliers in predicted geometric primitives. Our model achieves state-of-the-art 3D detection results on two large datasets with real 3D scans, ScanNet and SUN RGB-D.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36950400/143868884-26f7fc63-93fd-48cb-a469-e2f55fda5550.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: H3DNet: 3D Object Detection Using Hybrid Geometric Primitives] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2006.05682] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/imvotenet/README.md b/configs/imvotenet/README.md
index f1e09ad802..038e8e9cd6 100644
--- a/configs/imvotenet/README.md
+++ b/configs/imvotenet/README.md
@@ -1,5 +1,20 @@
 # ImVoteNet: Boosting 3D Object Detection in Point Clouds with Image Votes
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+3D object detection has seen quick progress thanks to advances in deep learning on point clouds. A few recent works have even shown state-of-the-art performance with just point clouds input (e.g. VOTENET). However, point cloud data have inherent limitations. They are sparse, lack color information and often suffer from sensor noise. Images, on the other hand, have high resolution and rich texture. Thus they can complement the 3D geometry provided by point clouds. Yet how to effectively use image information to assist point cloud based detection is still an open question. In this work, we build on top of VOTENET and propose a 3D detection architecture called IMVOTENET specialized for RGB-D scenes. IMVOTENET is based on fusing 2D votes in images and 3D votes in point clouds. Compared to prior work on multi-modal detection, we explicitly extract both geometric and semantic features from the 2D images. We leverage camera parameters to lift these features to 3D. To improve the synergy of 2D-3D feature fusion, we also propose a multi-tower training scheme. We validate our model on the challenging SUN RGB-D dataset, advancing state-of-the-art results by 5.7 mAP. We also provide rich ablation studies to analyze the contribution of each design choice.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36950400/143869878-a2ae7f43-55c3-4b95-af09-8f97dfd975f4.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: ImVoteNet: Boosting 3D Object Detection in Point Clouds with Image Votes] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2001.10692] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/imvoxelnet/README.md b/configs/imvoxelnet/README.md
index 325570fe5b..87999dc77c 100644
--- a/configs/imvoxelnet/README.md
+++ b/configs/imvoxelnet/README.md
@@ -1,5 +1,20 @@
 # ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+In this paper, we introduce the task of multi-view RGBbased 3D object detection as an end-to-end optimization problem. To address this problem, we propose ImVoxelNet, a novel fully convolutional method of 3D object detection based on posed monocular or multi-view RGB images. The number of monocular images in each multiview input can variate during training and inference; actually, this number might be unique for each multi-view input. ImVoxelNet successfully handles both indoor and outdoor scenes, which makes it general-purpose. Specifically, it achieves state-of-the-art results in car detection on KITTI (monocular) and nuScenes (multi-view) benchmarks among all methods that accept RGB images. Moreover, it surpasses existing RGB-based 3D object detection methods on the SUN RGB-D dataset. On ScanNet, ImVoxelNet sets a new benchmark for multi-view 3D object detection.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36950400/143871445-38a55168-b8cd-4520-8ed6-f5c8c8ea304a.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2106.01178] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/mvxnet/README.md b/configs/mvxnet/README.md
index 6a56d5563f..da7eb35da4 100644
--- a/configs/mvxnet/README.md
+++ b/configs/mvxnet/README.md
@@ -1,5 +1,18 @@
 # MVX-Net: Multimodal VoxelNet for 3D Object Detection
 
+## Abstract
+
+Many recent works on 3D object detection have focused on designing neural network architectures that can consume point cloud data. While these approaches demonstrate encouraging performance, they are typically based on a single modality and are unable to leverage information from other modalities, such as a camera. Although a few approaches fuse data from different modalities, these methods either use a complicated pipeline to process the modalities sequentially, or perform late-fusion and are unable to learn interaction between different modalities at early stages. In this work, we present PointFusion and VoxelFusion: two simple yet effective early-fusion approaches to combine the RGB and point cloud modalities, by leveraging the recently introduced VoxelNet architecture. Evaluation on the KITTI dataset demonstrates significant improvements in performance over approaches which only use point cloud data. Furthermore, the proposed method provides results competitive with the state-of-the-art multimodal algorithms, achieving top-2 ranking in five of the six bird's eye view and 3D detection categories on the KITTI benchmark, by using a simple single stage network.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/79644370/143880819-560675ca-e7e3-4d77-8808-ea661ff8e6e6.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: MVX-Net: Multimodal VoxelNet for 3D Object Detection] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/1904.01649] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/paconv/README.md b/configs/paconv/README.md
index 38b31337d6..6f1789005b 100644
--- a/configs/paconv/README.md
+++ b/configs/paconv/README.md
@@ -1,5 +1,21 @@
 # PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+We introduce Position Adaptive Convolution (PAConv), a generic convolution operation for 3D point cloud processing. The key of PAConv is to construct the convolution kernel by dynamically assembling basic weight matrices stored in Weight Bank, where the coefficients of these weight matrices are self-adaptively learned from point positions through ScoreNet. In this way, the kernel is built in a data-driven manner, endowing PAConv with more flexibility than 2D convolutions to better handle the irregular and unordered point cloud data. Besides, the complexity of the learning process is reduced by combining weight matrices instead of brutally predicting kernels from point positions.
+Furthermore, different from the existing point convolution operators whose network architectures are often heavily engineered, we integrate our PAConv into classical MLP-based point cloud pipelines without changing network configurations. Even built on simple networks, our method still approaches or even surpasses the state-of-the-art models, and significantly improves baseline performance on both classification and segmentation tasks, yet with decent efficiency. Thorough ablation studies and visualizations are provided to understand PAConv.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/79644370/143881915-003d5f10-3999-474e-969a-c354cb738a11.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2103.14635] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/parta2/README.md b/configs/parta2/README.md
index 1c35aa38a2..4fc4a86fb3 100644
--- a/configs/parta2/README.md
+++ b/configs/parta2/README.md
@@ -1,5 +1,20 @@
 # From Points to Parts: 3D Object Detection from Point Cloud with Part-aware and Part-aggregation Network
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+3D object detection from LiDAR point cloud is a challenging problem in 3D scene understanding and has many practical applications. In this paper, we extend our preliminary work PointRCNN to a novel and strong point-cloud-based 3D object detection framework, the part-aware and aggregation neural network (Part-A2 net). The whole framework consists of the part-aware stage and the part-aggregation stage. Firstly, the part-aware stage for the first time fully utilizes free-of-charge part supervisions derived from 3D ground-truth boxes to simultaneously predict high quality 3D proposals and accurate intra-object part locations. The predicted intra-object part locations within the same proposal are grouped by our new-designed RoI-aware point cloud pooling module, which results in an effective representation to encode the geometry-specific features of each 3D proposal. Then the part-aggregation stage learns to re-score the box and refine the box location by exploring the spatial relationship of the pooled intra-object part locations. Extensive experiments are conducted to demonstrate the performance improvements from each component of our proposed framework. Our Part-A2 net outperforms all existing 3D detection methods and achieves new state-of-the-art on KITTI 3D object detection dataset by utilizing only the LiDAR point cloud data.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/79644370/143882774-6fc5f736-10d1-499a-8929-ca0768419049.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: From Points to Parts: 3D Object Detection from Point Cloud with Part-aware and Part-aggregation Network] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/1907.03670] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/pgd/README.md b/configs/pgd/README.md
index 1d8a320ccf..e10f2675ab 100644
--- a/configs/pgd/README.md
+++ b/configs/pgd/README.md
@@ -1,5 +1,20 @@
 # Probabilistic and Geometric Depth: Detecting Objects in Perspective
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+3D object detection is an important capability needed in various practical applications such as driver assistance systems. Monocular 3D detection, as a representative general setting among image-based approaches, provides a more economical solution than conventional settings relying on LiDARs but still yields unsatisfactory results. This paper first presents a systematic study on this problem. We observe that the current monocular 3D detection can be simplified as an instance depth estimation problem: The inaccurate instance depth blocks all the other 3D attribute predictions from improving the overall detection performance. Moreover, recent methods directly estimate the depth based on isolated instances or pixels while ignoring the geometric relations across different objects. To this end, we construct geometric relation graphs across predicted objects and use the graph to facilitate depth estimation. As the preliminary depth estimation of each instance is usually inaccurate in this ill-posed setting, we incorporate a probabilistic representation to capture the uncertainty. It provides an important indicator to identify confident predictions and further guide the depth propagation. Despite the simplicity of the basic idea, our method, PGD, obtains significant improvements on KITTI and nuScenes benchmarks, achieving 1st place out of all monocular vision-only methods while still maintaining real-time efficiency. Code and models will be released at [this https URL](https://github.com/open-mmlab/mmdetection3d).
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/79644370/143884065-d1a19fdf-bcc0-4249-84cf-b7a85fa1eb2f.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: Probabilistic and Geometric Depth: Detecting Objects in Perspective] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2107.14160] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/pointnet2/README.md b/configs/pointnet2/README.md
index fc6978bf8e..2cef3c91d7 100644
--- a/configs/pointnet2/README.md
+++ b/configs/pointnet2/README.md
@@ -1,5 +1,20 @@
 # PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Few prior works study deep learning on point sets. PointNet by Qi et al. is a pioneer in this direction. However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. In this work, we introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales. With further observation that point sets are usually sampled with varying densities, which results in greatly decreased performance for networks trained on uniform densities, we propose novel set learning layers to adaptively combine features from multiple scales. Experiments show that our network called PointNet++ is able to learn deep point set features efficiently and robustly. In particular, results significantly better than state-of-the-art have been obtained on challenging benchmarks of 3D point clouds.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/79644370/143885530-ae53ed38-8132-4bb7-85a7-d2577de7de3f.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/1706.02413] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/pointpillars/README.md b/configs/pointpillars/README.md
index 3d260e3f36..dc8bebd930 100644
--- a/configs/pointpillars/README.md
+++ b/configs/pointpillars/README.md
@@ -1,5 +1,20 @@
 # PointPillars: Fast Encoders for Object Detection from Point Clouds
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. In this paper we consider the problem of encoding a point cloud into a format appropriate for a downstream detection pipeline. Recent literature suggests two types of encoders; fixed encoders tend to be fast but sacrifice accuracy, while encoders that are learned from data are more accurate, but slower. In this work we propose PointPillars, a novel encoder which utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars). While the encoded features can be used with any standard 2D convolutional detection architecture, we further propose a lean downstream network. Extensive experimentation shows that PointPillars outperforms previous encoders with respect to both speed and accuracy by a large margin. Despite only using lidar, our full detection pipeline significantly outperforms the state of the art, even among fusion methods, with respect to both the 3D and bird's eye view KITTI benchmarks. This detection performance is achieved while running at 62 Hz: a 2 - 4 fold runtime improvement. A faster version of our method matches the state of the art at 105 Hz. These benchmarks suggest that PointPillars is an appropriate encoding for object detection in point clouds.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/79644370/143885905-aab6ffcf-7727-495e-90ca-edb8dd5e324b.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: PointPillars: Fast Encoders for Object Detection from Point Clouds] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/1812.05784] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/regnet/README.md b/configs/regnet/README.md
index 3e74a37a82..35704526e3 100644
--- a/configs/regnet/README.md
+++ b/configs/regnet/README.md
@@ -1,5 +1,20 @@
 # Designing Network Design Spaces
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/79644370/144025148-b73002cb-3c82-42e4-8da4-65df97aead9c.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: Designing Network Design Spaces] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2003.13678] -->
+
 ## Introduction
 
 <!-- [BACKBONE] -->
diff --git a/configs/second/README.md b/configs/second/README.md
index 6e7e9e5296..81f1e02f3d 100644
--- a/configs/second/README.md
+++ b/configs/second/README.md
@@ -1,5 +1,20 @@
 # Second: Sparsely embedded convolutional detection
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+LiDAR-based or RGB-D-based object detection is used in numerous applications, ranging from autonomous driving to robot vision. Voxel-based 3D convolutional networks have been used for some time to enhance the retention of information when processing point cloud LiDAR data. However, problems remain, including a slow inference speed and low orientation estimation performance. We therefore investigate an improved sparse convolution method for such networks, which significantly increases the speed of both training and inference. We also introduce a new form of angle loss regression to improve the orientation estimation performance and a new data augmentation approach that can enhance the convergence speed and performance. The proposed network produces state-of-the-art results on the KITTI 3D object detection benchmarks while maintaining a fast inference speed.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/79644370/143889364-10be11c3-838e-4fc9-9613-184f0cd08907.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: SECOND: Sparsely Embedded Convolutional Detection] -->
+<!-- [PAPER_URL: https://www.mdpi.com/1424-8220/18/10/3337] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/smoke/README.md b/configs/smoke/README.md
index cfabfcca10..a7306eaa54 100644
--- a/configs/smoke/README.md
+++ b/configs/smoke/README.md
@@ -1,5 +1,20 @@
 # SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Estimating 3D orientation and translation of objects is essential for infrastructure-less autonomous navigation and driving. In case of monocular vision, successful methods have been mainly based on two ingredients: (i) a network generating 2D region proposals, (ii) a R-CNN structure predicting 3D object pose by utilizing the acquired regions of interest. We argue that the 2D detection network is redundant and introduces non-negligible noise for 3D detection. Hence, we propose a novel 3D object detection method, named SMOKE, in this paper that predicts a 3D bounding box for each detected object by combining a single keypoint estimate with regressed 3D variables. As a second contribution, we propose a multi-step disentangling approach for constructing the 3D bounding box, which significantly improves both training convergence and detection accuracy. In contrast to previous 3D detection techniques, our method does not require complicated pre/post-processing, extra data, and a refinement stage. Despite of its structural simplicity, our proposed SMOKE network outperforms all existing monocular 3D detection methods on the KITTI dataset, giving the best state-of-the-art result on both 3D object detection and Bird's eye view evaluation.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/79644370/143886681-52cb72b9-6635-4624-a728-1c243b046517.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2002.10111] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/ssn/README.md b/configs/ssn/README.md
index d47f0d90e9..eb027e9377 100644
--- a/configs/ssn/README.md
+++ b/configs/ssn/README.md
@@ -1,5 +1,20 @@
 # SSN: Shape Signature Networks for Multi-class Object Detection from Point Clouds
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Multi-class 3D object detection aims to localize and classify objects of multiple categories from point clouds. Due to the nature of point clouds, i.e. unstructured, sparse and noisy, some features benefit-ting multi-class discrimination are underexploited, such as shape information. In this paper, we propose a novel 3D shape signature to explore the shape information from point clouds. By incorporating operations of symmetry, convex hull and chebyshev fitting, the proposed shape sig-nature is not only compact and effective but also robust to the noise, which serves as a soft constraint to improve the feature capability of multi-class discrimination. Based on the proposed shape signature, we develop the shape signature networks (SSN) for 3D object detection, which consist of pyramid feature encoding part, shape-aware grouping heads and explicit shape encoding objective. Experiments show that the proposed method performs remarkably better than existing methods on two large-scale datasets. Furthermore, our shape signature can act as a plug-and-play component and ablation study shows its effectiveness and good scalability.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/79644370/144024507-9c1f23c1-5e5a-49c8-b346-ff37e30adc3a.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: SSN: Shape Signature Networks for Multi-class Object Detection from Point Clouds] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/2004.02774] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->
diff --git a/configs/votenet/README.md b/configs/votenet/README.md
index f21316a2d8..dd2a315f52 100644
--- a/configs/votenet/README.md
+++ b/configs/votenet/README.md
@@ -1,5 +1,20 @@
 # Deep Hough Voting for 3D Object Detection in Point Clouds
 
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids (i.e., to voxel grids or to bird's eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect objects in point clouds. In this work, we return to first principles to construct a 3D detection pipeline for point cloud data and as generic as possible. However, due to the sparse nature of the data -- samples from 2D manifolds in 3D space -- we face a major challenge when directly predicting bounding box parameters from scene points: a 3D object centroid can be far from any surface point thus hard to regress accurately in one step. To address the challenge, we propose VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting. Our model achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency. Remarkably, VoteNet outperforms previous methods by using purely geometric information without relying on color images.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/79644370/143888295-af7435b4-9f75-4669-b5f8-a19ae24a051c.png" width="800"/>
+</div>
+
+<!-- [PAPER_TITLE: Deep Hough Voting for 3D Object Detection in Point Clouds] -->
+<!-- [PAPER_URL: https://arxiv.org/abs/1904.09664] -->
+
 ## Introduction
 
 <!-- [ALGORITHM] -->

From e136fa045fd0046761bde27e2e63feff323c390e Mon Sep 17 00:00:00 2001
From: ChaimZhu <zhuchenming@pjlab.org.cn>
Date: Wed, 1 Dec 2021 20:35:53 +0800
Subject: [PATCH 2/2] fix typos

---
 configs/imvoxelnet/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/configs/imvoxelnet/README.md b/configs/imvoxelnet/README.md
index 87999dc77c..6746deeb7c 100644
--- a/configs/imvoxelnet/README.md
+++ b/configs/imvoxelnet/README.md
@@ -4,7 +4,7 @@
 
 <!-- [ABSTRACT] -->
 
-In this paper, we introduce the task of multi-view RGBbased 3D object detection as an end-to-end optimization problem. To address this problem, we propose ImVoxelNet, a novel fully convolutional method of 3D object detection based on posed monocular or multi-view RGB images. The number of monocular images in each multiview input can variate during training and inference; actually, this number might be unique for each multi-view input. ImVoxelNet successfully handles both indoor and outdoor scenes, which makes it general-purpose. Specifically, it achieves state-of-the-art results in car detection on KITTI (monocular) and nuScenes (multi-view) benchmarks among all methods that accept RGB images. Moreover, it surpasses existing RGB-based 3D object detection methods on the SUN RGB-D dataset. On ScanNet, ImVoxelNet sets a new benchmark for multi-view 3D object detection.
+In this paper, we introduce the task of multi-view RGB-based 3D object detection as an end-to-end optimization problem. To address this problem, we propose ImVoxelNet, a novel fully convolutional method of 3D object detection based on posed monocular or multi-view RGB images. The number of monocular images in each multiview input can variate during training and inference; actually, this number might be unique for each multi-view input. ImVoxelNet successfully handles both indoor and outdoor scenes, which makes it general-purpose. Specifically, it achieves state-of-the-art results in car detection on KITTI (monocular) and nuScenes (multi-view) benchmarks among all methods that accept RGB images. Moreover, it surpasses existing RGB-based 3D object detection methods on the SUN RGB-D dataset. On ScanNet, ImVoxelNet sets a new benchmark for multi-view 3D object detection.
 
 <!-- [IMAGE] -->