Detectron Model Zoo and Baselines

Introduction

This file documents a large collection of baselines trained with Detectron, primarily in late December 2017. We refer to these results as the 12_2017_baselines. All configurations for these baselines are located in the configs/12_2017_baselines directory. The tables below provide results and useful statistics about training and inference. Links to the trained models as well as their output are provided. Unless noted differently below (see "Notes" under each table), the following common settings are used for all training and inference runs.

Common Settings and Notes

All baselines were run on Big Basin servers with 8 NVIDIA Tesla P100 GPU accelerators (with 16GB GPU memory, CUDA 8.0, and cuDNN 6.0.21).
All baselines were trained using 8 GPU data parallel sync SGD with a minibatch size of either 8 or 16 images (see the im/gpu column).
For training, only horizontal flipping data augmentation was used.
For inference, no test-time augmentations (e.g., multiple scales, flipping) were used.
All models were trained on the union of coco_2014_train and coco_2014_valminusminival, which is exactly equivalent to the recently defined coco_2017_train dataset.
All models were tested on the coco_2014_minival dataset, which is exactly equivalent to the recently defined coco_2017_val dataset.
Inference times are often expressed as "X + Y", in which X is time taken in reasonably well-optimized GPU code and Y is time taken in unoptimized CPU code. (The CPU code time could be reduced substantially with additional engineering.)
Inference results for boxes, masks, and keypoints ("kps") are provided in the COCO json format.
The model id column is provided for ease of reference.
To check downloaded file integrity: for any download URL on this page, simply append .md5sum to the URL to download the file's md5 hash.
All models and results below are on the COCO dataset.
Baseline models and results for the Cityscapes dataset are coming soon!

Training Schedules

We use three training schedules, indicated by the lr schd column in the tables below.

1x: For minibatch size 16, this schedule starts at a LR of 0.02 and is decreased by a factor of * 0.1 after 60k and 80k iterations and finally terminates at 90k iterations. This schedules results in 12.17 epochs over the 118,287 images in coco_2014_train union coco_2014_valminusminival (or equivalently, coco_2017_train).
2x: Twice as long as the 1x schedule with the LR change points scaled proportionally.
s1x ("stretched 1x"): This schedule scales the 1x schedule by roughly 1.44x, but also extends the duration of the first learning rate. With a minibatch size of 16, it reduces the LR by * 0.1 at 100k and 120k iterations, finally ending after 130k iterations.

All training schedules also use a 500 iteration linear learning rate warm up. When changing the minibatch size between 8 and 16 images, we adjust the number of SGD iterations and the base learning rate according to the principles outlined in our paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.

License

All models available for download through this document are licensed under the Creative Commons Attribution-ShareAlike 3.0 license.

ImageNet Pretrained Models

The backbone models pretrained on ImageNet are available in the format used by Detectron. Unless otherwise noted, these models are trained on the standard ImageNet-1k dataset.

R-50.pkl: converted copy of MSRA's original ResNet-50 model
R-101.pkl: converted copy of MSRA's original ResNet-101 model
X-101-64x4d.pkl: converted copy of FB's original ResNeXt-101-64x4d model trained with Torch7
X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB
X-152-32x8d-IN5k.pkl: ResNeXt-152-32x8d model trained on ImageNet-5k with Caffe2 at FB (see our ResNeXt paper for details on ImageNet-5k)

Proposal, Box, and Mask Detection Baselines

RPN Proposal Baselines

^_backbone	^_type	^_lr schd	^_im/ gpu	^{_{train mem (GB)}}	^{_{train time (s/iter)}}	^{_{train time total (hr)}}	^{_{inference time (s/im)}}	^_box AP	^_mask AP	^_kp AP	^_prop. AR	^{_{model id}}	^{_{download links}}
^_R-50-C4	^_RPN	^_1x	^₂	^_4.3	^_0.187	^_4.7	^_0.113	^_-	^_-	^_-	^_51.6	^_35998355	^{_{model \| props: 1, 2, 3}}
^_R-50-FPN	^_RPN	^_1x	^₂	^_6.4	^_0.416	^_10.4	^_0.080	^_-	^_-	^_-	^_57.2	^_35998814	^{_{model \| props: 1, 2, 3}}
^_R-101-FPN	^_RPN	^_1x	^₂	^_8.1	^_0.503	^_12.6	^_0.108	^_-	^_-	^_-	^_58.2	^_35998887	^{_{model \| props: 1, 2, 3}}
^{_{X-101-64x4d-FPN}}	^_RPN	^_1x	^₂	^_11.5	^_1.395	^_34.9	^_0.292	^_-	^_-	^_-	^_59.4	^_35998956	^{_{model \| props: 1, 2, 3}}
^{_{X-101-32x8d-FPN}}	^_RPN	^_1x	^₂	^_11.6	^_1.102	^_27.6	^_0.222	^_-	^_-	^_-	^_59.5	^_36760102	^{_{model \| props: 1, 2, 3}}

Notes:

Inference time only includes RPN proposal generation.
"prop. AR" is proposal average recall at 1000 proposals per image.
Proposal download links ("props"): "1" is coco_2014_train; "2" is coco_2014_valminusminival; and "3" is coco_2014_minival.

Fast & Mask R-CNN Baselines Using Precomputed RPN Proposals

^_backbone	^_type	^_lr schd	^_im/ gpu	^{_{train mem (GB)}}	^{_{train time (s/iter)}}	^{_{train time total (hr)}}	^{_{inference time (s/im)}}	^_box AP	^_mask AP	^_kp AP	^_prop. AR	^{_{model id}}	^{_{download links}}
^_R-50-C4	^_Fast	^_1x	^₁	^_6.0	^_0.456	^_22.8	^{_{0.241 + 0.003}}	^_34.4	^_-	^_-	^_-	^_36224013	^{_{model \| boxes}}
^_R-50-C4	^_Fast	^_2x	^₁	^_6.0	^_0.453	^_45.3	^{_{0.241 + 0.003}}	^_35.6	^_-	^_-	^_-	^_36224046	^{_{model \| boxes}}
^_R-50-FPN	^_Fast	^_1x	^₂	^_6.0	^_0.285	^_7.1	^{_{0.076 + 0.004}}	^_36.4	^_-	^_-	^_-	^_36225147	^{_{model \| boxes}}
^_R-50-FPN	^_Fast	^_2x	^₂	^_6.0	^_0.287	^_14.4	^{_{0.077 + 0.004}}	^_36.8	^_-	^_-	^_-	^_36225249	^{_{model \| boxes}}
^_R-101-FPN	^_Fast	^_1x	^₂	^_7.7	^_0.448	^_11.2	^{_{0.102 + 0.003}}	^_38.5	^_-	^_-	^_-	^_36228880	^{_{model \| boxes}}
^_R-101-FPN	^_Fast	^_2x	^₂	^_7.7	^_0.449	^_22.5	^{_{0.103 + 0.004}}	^_39.0	^_-	^_-	^_-	^_36228933	^{_{model \| boxes}}
^{_{X-101-64x4d-FPN}}	^_Fast	^_1x	^₁	^_6.3	^_0.994	^_49.7	^{_{0.292 + 0.003}}	^_40.4	^_-	^_-	^_-	^_36226250	^{_{model \| boxes}}
^{_{X-101-64x4d-FPN}}	^_Fast	^_2x	^₁	^_6.3	^_0.980	^_98.0	^{_{0.291 + 0.003}}	^_39.8	^_-	^_-	^_-	^_36226326	^{_{model \| boxes}}
^{_{X-101-32x8d-FPN}}	^_Fast	^_1x	^₁	^_6.4	^_0.721	^_36.1	^{_{0.217 + 0.003}}	^_40.6	^_-	^_-	^_-	^_37119777	^{_{model \| boxes}}
^{_{X-101-32x8d-FPN}}	^_Fast	^_2x	^₁	^_6.4	^_0.720	^_72.0	^{_{0.217 + 0.003}}	^_39.7	^_-	^_-	^_-	^_37121469	^{_{model \| boxes}}
^_R-50-C4	^_Mask	^_1x	^₁	^_6.4	^_0.466	^_23.3	^{_{0.252 + 0.020}}	^_35.5	^_31.3	^_-	^_-	^_36224121	^{_{model \| boxes \| masks}}
^_R-50-C4	^_Mask	^_2x	^₁	^_6.4	^_0.464	^_46.4	^{_{0.253 + 0.019}}	^_36.9	^_32.5	^_-	^_-	^_36224151	^{_{model \| boxes \| masks}}
^_R-50-FPN	^_Mask	^_1x	^₂	^_7.9	^_0.377	^_9.4	^{_{0.082 + 0.019}}	^_37.3	^_33.7	^_-	^_-	^_36225401	^{_{model \| boxes \| masks}}
^_R-50-FPN	^_Mask	^_2x	^₂	^_7.9	^_0.377	^_18.9	^{_{0.083 + 0.018}}	^_37.7	^_34.0	^_-	^_-	^_36225732	^{_{model \| boxes \| masks}}
^_R-101-FPN	^_Mask	^_1x	^₂	^_9.6	^_0.539	^_13.5	^{_{0.111 + 0.018}}	^_39.4	^_35.6	^_-	^_-	^_36229407	^{_{model \| boxes \| masks}}
^_R-101-FPN	^_Mask	^_2x	^₂	^_9.6	^_0.537	^_26.9	^{_{0.109 + 0.016}}	^_40.0	^_35.9	^_-	^_-	^_36229740	^{_{model \| boxes \| masks}}
^{_{X-101-64x4d-FPN}}	^_Mask	^_1x	^₁	^_7.3	^_1.036	^_51.8	^{_{0.292 + 0.016}}	^_41.3	^_37.0	^_-	^_-	^_36226382	^{_{model \| boxes \| masks}}
^{_{X-101-64x4d-FPN}}	^_Mask	^_2x	^₁	^_7.3	^_1.035	^_103.5	^{_{0.292 + 0.014}}	^_41.1	^_36.6	^_-	^_-	^_36672114	^{_{model \| boxes \| masks}}
^{_{X-101-32x8d-FPN}}	^_Mask	^_1x	^₁	^_7.4	^_0.766	^_38.3	^{_{0.223 + 0.017}}	^_41.3	^_37.0	^_-	^_-	^_37121516	^{_{model \| boxes \| masks}}
^{_{X-101-32x8d-FPN}}	^_Mask	^_2x	^₁	^_7.4	^_0.765	^_76.5	^{_{0.222 + 0.014}}	^_40.7	^_36.3	^_-	^_-	^_37121596	^{_{model \| boxes \| masks}}

Notes:

Each row uses precomputed RPN proposals from the corresponding table row above that uses the same backbone.
Inference time excludes proposal generation.

End-to-End Faster & Mask R-CNN Baselines

^_backbone	^_type	^_lr schd	^_im/ gpu	^{_{train mem (GB)}}	^{_{train time (s/iter)}}	^{_{train time total (hr)}}	^{_{inference time (s/im)}}	^_box AP	^_mask AP	^_kp AP	^_prop. AR	^{_{model id}}	^{_{download links}}
^_R-50-C4	^_Faster	^_1x	^₁	^_6.3	^_0.566	^_28.3	^{_{0.167 + 0.003}}	^_34.8	^_-	^_-	^_-	^_35857197	^{_{model \| boxes}}
^_R-50-C4	^_Faster	^_2x	^₁	^_6.3	^_0.569	^_56.9	^{_{0.174 + 0.003}}	^_36.5	^_-	^_-	^_-	^_35857281	^{_{model \| boxes}}
^_R-50-FPN	^_Faster	^_1x	^₂	^_7.2	^_0.544	^_13.6	^{_{0.093 + 0.004}}	^_36.7	^_-	^_-	^_-	^_35857345	^{_{model \| boxes}}
^_R-50-FPN	^_Faster	^_2x	^₂	^_7.2	^_0.546	^_27.3	^{_{0.092 + 0.004}}	^_37.9	^_-	^_-	^_-	^_35857389	^{_{model \| boxes}}
^_R-101-FPN	^_Faster	^_1x	^₂	^_8.9	^_0.647	^_16.2	^{_{0.120 + 0.004}}	^_39.4	^_-	^_-	^_-	^_35857890	^{_{model \| boxes}}
^_R-101-FPN	^_Faster	^_2x	^₂	^_8.9	^_0.647	^_32.4	^{_{0.119 + 0.004}}	^_39.8	^_-	^_-	^_-	^_35857952	^{_{model \| boxes}}
^{_{X-101-64x4d-FPN}}	^_Faster	^_1x	^₁	^_6.9	^_1.057	^_52.9	^{_{0.305 + 0.003}}	^_41.5	^_-	^_-	^_-	^_35858015	^{_{model \| boxes}}
^{_{X-101-64x4d-FPN}}	^_Faster	^_2x	^₁	^_6.9	^_1.055	^_105.5	^{_{0.304 + 0.003}}	^_40.8	^_-	^_-	^_-	^_35858198	^{_{model \| boxes}}
^{_{X-101-32x8d-FPN}}	^_Faster	^_1x	^₁	^_7.0	^_0.799	^_40.0	^{_{0.233 + 0.004}}	^_41.3	^_-	^_-	^_-	^_36761737	^{_{model \| boxes}}
^{_{X-101-32x8d-FPN}}	^_Faster	^_2x	^₁	^_7.0	^_0.800	^_80.0	^{_{0.233 + 0.003}}	^_40.6	^_-	^_-	^_-	^_36761786	^{_{model \| boxes}}
^_R-50-C4	^_Mask	^_1x	^₁	^_6.6	^_0.620	^_31.0	^{_{0.181 + 0.018}}	^_35.8	^_31.4	^_-	^_-	^_35858791	^{_{model \| boxes \| masks}}
^_R-50-C4	^_Mask	^_2x	^₁	^_6.6	^_0.620	^_62.0	^{_{0.182 + 0.017}}	^_37.8	^_32.8	^_-	^_-	^_35858828	^{_{model \| boxes \| masks}}
^_R-50-FPN	^_Mask	^_1x	^₂	^_8.6	^_0.889	^_22.2	^{_{0.099 + 0.019}}	^_37.7	^_33.9	^_-	^_-	^_35858933	^{_{model \| boxes \| masks}}
^_R-50-FPN	^_Mask	^_2x	^₂	^_8.6	^_0.897	^_44.9	^{_{0.099 + 0.018}}	^_38.6	^_34.5	^_-	^_-	^_35859007	^{_{model \| boxes \| masks}}
^_R-101-FPN	^_Mask	^_1x	^₂	^_10.2	^_1.008	^_25.2	^{_{0.126 + 0.018}}	^_40.0	^_35.9	^_-	^_-	^_35861795	^{_{model \| boxes \| masks}}
^_R-101-FPN	^_Mask	^_2x	^₂	^_10.2	^_0.993	^_49.7	^{_{0.126 + 0.017}}	^_40.9	^_36.4	^_-	^_-	^_35861858	^{_{model \| boxes \| masks}}
^{_{X-101-64x4d-FPN}}	^_Mask	^_1x	^₁	^_7.6	^_1.217	^_60.9	^{_{0.309 + 0.018}}	^_42.4	^_37.5	^_-	^_-	^_36494496	^{_{model \| boxes \| masks}}
^{_{X-101-64x4d-FPN}}	^_Mask	^_2x	^₁	^_7.6	^_1.210	^_121.0	^{_{0.309 + 0.015}}	^_42.2	^_37.2	^_-	^_-	^_35859745	^{_{model \| boxes \| masks}}
^{_{X-101-32x8d-FPN}}	^_Mask	^_1x	^₁	^_7.7	^_0.961	^_48.1	^{_{0.239 + 0.019}}	^_42.1	^_37.3	^_-	^_-	^_36761843	^{_{model \| boxes \| masks}}
^{_{X-101-32x8d-FPN}}	^_Mask	^_2x	^₁	^_7.7	^_0.975	^_97.5	^{_{0.240 + 0.016}}	^_41.7	^_36.9	^_-	^_-	^_36762092	^{_{model \| boxes \| masks}}

Notes:

For these models, RPN and the detector are trained jointly and end-to-end.
Inference time is fully image-to-detections, including proposal generation.

RetinaNet Baselines

^_backbone	^_type	^_lr schd	^_im/ gpu	^{_{train mem (GB)}}	^{_{train time (s/iter)}}	^{_{train time total (hr)}}	^{_{inference time (s/im)}}	^_box AP	^_mask AP	^_kp AP	^_prop. AR	^{_{model id}}	^{_{download links}}
^_R-50-FPN	^_RetinaNet	^_1x	^₂	^_6.8	^_0.483	^_12.1	^_0.125	^_35.7	^_-	^_-	^_-	^_36768636	^{_{model \| boxes}}
^_R-50-FPN	^_RetinaNet	^_2x	^₂	^_6.8	^_0.482	^_24.1	^_0.127	^_35.7	^_-	^_-	^_-	^_36768677	^{_{model \| boxes}}
^_R-101-FPN	^_RetinaNet	^_1x	^₂	^_8.7	^_0.666	^_16.7	^_0.156	^_37.7	^_-	^_-	^_-	^_36768744	^{_{model \| boxes}}
^_R-101-FPN	^_RetinaNet	^_2x	^₂	^_8.7	^_0.666	^_33.3	^_0.154	^_37.8	^_-	^_-	^_-	^_36768840	^{_{model \| boxes}}
^{_{X-101-64x4d-FPN}}	^_RetinaNet	^_1x	^₂	^_12.6	^_1.613	^_40.3	^_0.341	^_39.8	^_-	^_-	^_-	^_36768875	^{_{model \| boxes}}
^{_{X-101-64x4d-FPN}}	^_RetinaNet	^_2x	^₂	^_12.6	^_1.625	^_81.3	^_0.339	^_39.2	^_-	^_-	^_-	^_36768907	^{_{model \| boxes}}
^{_{X-101-32x8d-FPN}}	^_RetinaNet	^_1x	^₂	^_12.7	^_1.343	^_33.6	^_0.277	^_39.5	^_-	^_-	^_-	^_36769563	^{_{model \| boxes}}
^{_{X-101-32x8d-FPN}}	^_RetinaNet	^_2x	^₂	^_12.7	^_1.340	^_67.0	^_0.276	^_38.6	^_-	^_-	^_-	^_36769641	^{_{model \| boxes}}

Notes: none

Mask R-CNN with Bells & Whistles

^_backbone	^_type	^_lr schd	^_im/ gpu	^{_{train mem (GB)}}	^{_{train time (s/iter)}}	^{_{train time total (hr)}}	^{_{inference time (s/im)}}	^_box AP	^_mask AP	^_kp AP	^_prop. AR	^{_{model id}}	^{_{download links}}
^{_{X-152-32x8d-FPN-IN5k}}	^_Mask	^_s1x	^₁	^_9.6	^_1.188	^_85.8	^{_{12.100 + 0.046}}	^_48.1	^_41.5	^_-	^_-	^_37129812	^{_{model \| boxes \| masks}}
^{_{[above without test-time aug.]}}							^{_{0.325 + 0.018}}	^_45.2	^_39.7	^_-	^_-

Notes:

A deeper backbone architecture is used: ResNeXt-152-32x8d-FPN
The backbone ResNeXt-152-32x8d model was trained on ImageNet-5k (not the usual ImageNet-1k)
Training uses multi-scale jitter over scales {640, 672, 704, 736, 768, 800}
Row 1: test-time augmentations are multi-scale testing over {400, 500, 600, 700, 900, 1000, 1100, 1200} and horizontal flipping (on each scale)
Row 2: same model as row 1, but without any test-time augmentation (i.e., same as the common baseline configuration)
Like the other results, this is a single model result (it is not an ensemble of models)

Keypoint Detection Baselines

Common Settings for Keypoint Detection Baselines (That Differ from Boxes and Masks)

Our keypoint detection baselines differ from our box and mask baselines in a couple of details:

Due to less training data for the keypoint detection task compared with boxes and masks, we enable multi-scale jitter during training for all keypoint detection models. (Testing is still without any test-time augmentations by default.)
Models are trained only on images from coco_2014_train union coco_2014_valminusminival that contain at least one person with keypoint annotations (all other images are discarded from the training set).
Metrics are reported for the person class only (still run on the entire coco_2014_minival dataset).

Person-Specific RPN Baselines

^_backbone	^_type	^_lr schd	^_im/ gpu	^{_{train mem (GB)}}	^{_{train time (s/iter)}}	^{_{train time total (hr)}}	^{_{inference time (s/im)}}	^{_{box AP}}	^{_{mask AP}}	^{_{kp AP}}	^{_{prop. AR}}	^{_{model id}}	^{_{download links}}
^_R-50-FPN	^_RPN	^_1x	^₂	^_6.4	^_0.391	^_9.8	^_0.082	^_-	^_-	^_-	^_64.0	^_35998996	^{_{model \| props: 1, 2, 3}}
^_R-101-FPN	^_RPN	^_1x	^₂	^_8.1	^_0.504	^_12.6	^_0.109	^_-	^_-	^_-	^_65.2	^_35999521	^{_{model \| props: 1, 2, 3}}
^{_{X-101-64x4d-FPN}}	^_RPN	^_1x	^₂	^_11.5	^_1.394	^_34.9	^_0.289	^_-	^_-	^_-	^_65.9	^_35999553	^{_{model \| props: 1, 2, 3}}
^{_{X-101-32x8d-FPN}}	^_RPN	^_1x	^₂	^_11.6	^_1.104	^_27.6	^_0.224	^_-	^_-	^_-	^_66.2	^_36760438	^{_{model \| props: 1, 2, 3}}

Notes:

Metrics are for the person category only.
Inference time only includes RPN proposal generation.
"prop. AR" is proposal average recall at 1000 proposals per image.
Proposal download links ("props"): "1" is coco_2014_train; "2" is coco_2014_valminusminival; and "3" is coco_2014_minival. These include all images, not just the ones with valid keypoint annotations.

Keypoint-Only Mask R-CNN Baselines Using Precomputed RPN Proposals

^_backbone	^_type	^_lr schd	^_im/ gpu	^{_{train mem (GB)}}	^{_{train time (s/iter)}}	^{_{train time total (hr)}}	^{_{inference time (s/im)}}	^{_{box AP}}	^{_{mask AP}}	^{_{kp AP}}	^{_{prop. AR}}	^{_{model id}}	^{_{download links}}
^_R-50-FPN	^_Kps	^_1x	^₂	^_7.7	^_0.533	^_13.3	^{_{0.081 + 0.087}}	^_52.7	^_-	^_64.1	^_-	^_37651787	^{_{model \| boxes \| kps}}
^_R-50-FPN	^_Kps	^_s1x	^₂	^_7.7	^_0.533	^_19.2	^{_{0.080 + 0.085}}	^_53.4	^_-	^_65.5	^_-	^_37651887	^{_{model \| boxes \| kps}}
^_R-101-FPN	^_Kps	^_1x	^₂	^_9.4	^_0.668	^_16.7	^{_{0.109 + 0.080}}	^_53.5	^_-	^_65.0	^_-	^_37651996	^{_{model \| boxes \| kps}}
^_R-101-FPN	^_Kps	^_s1x	^₂	^_9.4	^_0.668	^_24.1	^{_{0.108 + 0.076}}	^_54.6	^_-	^_66.0	^_-	^_37652016	^{_{model \| boxes \| kps}}
^{_{X-101-64x4d-FPN}}	^_Kps	^_1x	^₂	^_12.8	^_1.477	^_36.9	^{_{0.288 + 0.077}}	^_55.8	^_-	^_66.7	^_-	^_37731079	^{_{model \| boxes \| kps}}
^{_{X-101-64x4d-FPN}}	^_Kps	^_s1x	^₂	^_12.9	^_1.478	^_53.4	^{_{0.286 + 0.075}}	^_56.3	^_-	^_67.1	^_-	^_37731142	^{_{model \| boxes \| kps}}
^{_{X-101-32x8d-FPN}}	^_Kps	^_1x	^₂	^_12.9	^_1.215	^_30.4	^{_{0.219 + 0.084}}	^_55.4	^_-	^_66.2	^_-	^_37730253	^{_{model \| boxes \| kps}}
^{_{X-101-32x8d-FPN}}	^_Kps	^_s1x	^₂	^_12.9	^_1.214	^_43.8	^{_{0.218 + 0.071}}	^_55.9	^_-	^_67.0	^_-	^_37731010	^{_{model \| boxes \| kps}}

Notes:

Metrics are for the person category only.
Each row uses precomputed RPN proposals from the corresponding table row above that uses the same backbone.
Inference time excludes proposal generation.

End-to-End Keypoint-Only Mask R-CNN Baselines

^_backbone	^_type	^_lr schd	^_im/ gpu	^{_{train mem (GB)}}	^{_{train time (s/iter)}}	^{_{train time total (hr)}}	^{_{inference time (s/im)}}	^{_{box AP}}	^{_{mask AP}}	^{_{kp AP}}	^{_{prop. AR}}	^{_{model id}}	^{_{download links}}
^_R-50-FPN	^_Kps	^_1x	^₂	^_9.0	^_0.832	^_20.8	^{_{0.097 + 0.092}}	^_53.6	^_-	^_64.2	^_-	^_37697547	^{_{model \| boxes \| kps}}
^_R-50-FPN	^_Kps	^_s1x	^₂	^_9.0	^_0.828	^_29.9	^{_{0.096 + 0.089}}	^_54.3	^_-	^_65.4	^_-	^_37697714	^{_{model \| boxes \| kps}}
^_R-101-FPN	^_Kps	^_1x	^₂	^_10.6	^_0.923	^_23.1	^{_{0.124 + 0.084}}	^_54.5	^_-	^_64.8	^_-	^_37697946	^{_{model \| boxes \| kps}}
^_R-101-FPN	^_Kps	^_s1x	^₂	^_10.6	^_0.921	^_33.3	^{_{0.123 + 0.083}}	^_55.3	^_-	^_65.8	^_-	^_37698009	^{_{model \| boxes \| kps}}
^{_{X-101-64x4d-FPN}}	^_Kps	^_1x	^₂	^_14.1	^_1.655	^_41.4	^{_{0.302 + 0.079}}	^_56.3	^_-	^_66.0	^_-	^_37732355	^{_{model \| boxes \| kps}}
^{_{X-101-64x4d-FPN}}	^_Kps	^_s1x	^₂	^_14.1	^_1.731	^_62.5	^{_{0.322 + 0.074}}	^_56.9	^_-	^_66.8	^_-	^_37732415	^{_{model \| boxes \| kps}}
^{_{X-101-32x8d-FPN}}	^_Kps	^_1x	^₂	^_14.2	^_1.410	^_35.3	^{_{0.235 + 0.080}}	^_56.0	^_-	^_66.0	^_-	^_37792158	^{_{model \| boxes \| kps}}
^{_{X-101-32x8d-FPN}}	^_Kps	^_s1x	^₂	^_14.2	^_1.408	^_50.8	^{_{0.236 + 0.075}}	^_56.9	^_-	^_67.0	^_-	^_37732318	^{_{model \| boxes \| kps}}

Notes:

Metrics are for the person category only.
For these models, RPN and the detector are trained jointly and end-to-end.
Inference time is fully image-to-detections, including proposal generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODEL_ZOO.md

MODEL_ZOO.md

Detectron Model Zoo and Baselines

Introduction

Common Settings and Notes

Training Schedules

License

ImageNet Pretrained Models

Proposal, Box, and Mask Detection Baselines

RPN Proposal Baselines

Fast & Mask R-CNN Baselines Using Precomputed RPN Proposals

End-to-End Faster & Mask R-CNN Baselines

RetinaNet Baselines

Mask R-CNN with Bells & Whistles

Keypoint Detection Baselines

Common Settings for Keypoint Detection Baselines (That Differ from Boxes and Masks)

Person-Specific RPN Baselines

Keypoint-Only Mask R-CNN Baselines Using Precomputed RPN Proposals

End-to-End Keypoint-Only Mask R-CNN Baselines

Files

MODEL_ZOO.md

Latest commit

History

MODEL_ZOO.md

File metadata and controls

Detectron Model Zoo and Baselines

Introduction

Common Settings and Notes

Training Schedules

License

ImageNet Pretrained Models

Proposal, Box, and Mask Detection Baselines

RPN Proposal Baselines

Fast & Mask R-CNN Baselines Using Precomputed RPN Proposals

End-to-End Faster & Mask R-CNN Baselines

RetinaNet Baselines

Mask R-CNN with Bells & Whistles

Keypoint Detection Baselines

Common Settings for Keypoint Detection Baselines (That Differ from Boxes and Masks)

Person-Specific RPN Baselines

Keypoint-Only Mask R-CNN Baselines Using Precomputed RPN Proposals

End-to-End Keypoint-Only Mask R-CNN Baselines