Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The SSD for object detection on Fluid. #7402

Closed
qingqing01 opened this issue Jan 10, 2018 · 2 comments
Closed

The SSD for object detection on Fluid. #7402

qingqing01 opened this issue Jan 10, 2018 · 2 comments
Assignees

Comments

@qingqing01
Copy link
Contributor

qingqing01 commented Jan 10, 2018

The details of SSD algorithm for object detection is not introduced here.

The implementation comparison among Paddle, Caffe, and TensorFlow.

At first, compare the SSD implementation of the three frameworks and the corresponding relationship is as follows. TensorFlow Object Detection API is fine-grained, flexible, but maybe a litter complex. And, the SSD, Faster-RCNN and R-FCN share some implementation, like box encoder/decoder, non_max_suppression, region similarity calculator and so on. The loss implementation in Caffe and Paddle is a coarse-grained operator. It's a little hard to read codes and understand the overall algorithm by the codes. So, maybe it's better to split the loss into several sub-operations.

Paddle Caffe TensorFlow
PriorBoxLayer PriorBoxLayer anchor_generators
MultiBoxLossLayer MultiBoxLossLayer
Transpose
Flatten
Concat
1. box_coder_builder:
 box encoder and box decoder
2. matcher_builder:
  argmax_matcher / bipartite_matcher
3. region_similarity_calculator_builder:
  iou/ioa/neg_sq_dist_similarity
4. losses_builder:
  softmax_loss, hard_example_miner, target_assigner,
  smoothL1 tf.image.non_max_suppression,
  tf.gather ... and so on.
DetectionOutputLayer DetectionOutputLayer
Transpose
Flatten
Concat
post_processing_builder:
  box decoder,
  batch_multiclass_non_max_suppression,
  multiclass_non_max_suppression,
  tf.image.non_max_suppression

SSD on Fluid.

  • 1). anchor_box_op: generate anchors on the fly corresponding to one CNN layer.

    • Input:
      • Input(1): the input image with shape [N, C1, H1, W1]
      • Input(2): the layer to generator anchor box with shape [N, C2, H2, W2]
    • Attr: min_size(int), max_size(int), aspect_ratio(int), variance(int), flip(bool), clip(bool)
    • Output: anchor boxes with shape [2, H, W, M, 4]. H * W * M is total number of anchor box for the Input(2).
  • 2). Python API of prior_box_op: must handle multiple CNN layers.

    • Args: the arguments in Section 2.2 in SSD paper
      • A list of CNN layers which to generate anchor boxes.
      • min_ratio, max_ratio, aspect_ratios, anchor_box_variance,
      • minimum dimension of input image
    • Output:
      • anchor boxes, Tensor with shape [Np, 4], Np is the total number of anchor boxes for the multiple CNN layers.
      • the variance of anchor boxes, Tensor with shape [Np, 4]
  • 3). iou_similarity_op: compute similarity based on Intersection over Union (IOU) metric.

    • Input:
      • Input(1): the first box is ground-truth boxes, it is a LoDTensor with shape [Ng, 4], Ng is the total number of ground-truth boxes in the batch.
      • Input(2): the second box is generated anchor boxes, it is a LoDTensor with shape [Np, 4]
    • Output: the output is IOU metric, it is LoDTensor with shape [Ng, Np]
  • 4). bipartite_match_op

    • Input:
      • Input(1): the IOU metricis, a LoDTensor with shape [Ng, Np]
      • Input(2): ground-truth boxes, a LoDTensor with shape [Ng, 4 ]
    • Output:
      • Output(1): matched indices is a LoDTensor with shape [N, Np], N is the batch size and Ng>=N.
      • Output(2): matched IOU metric is a LoDTensor with shape [N, Np]
      • Output(3): matched target label is a LodTensor with shape [N, Np], the Output(1) saved the ground-truth box index, not the label, this output save the ground-truth label.
  • 5). box_coder_op

    • Support encoder and decoder. Here the input is anchor boxes and ground-truth boxes.
    • The output is LoDTensor with shape [Ng, Np, 4]
  • 6). softmax_with_loss_op: compute the confidence loss for each prior classification prediction

    • Input:
      • Input(1): Classification prediction input with shape [N, Np, Nc], Nc is the class number.
      • Input(2): matched target label with shape [N, Np]
    • Output:
      • classification loss [N, 1]
  • 6). mine_hard_examples_op

    • Input:
      • Input(1): classification loss wit shape [N, Np]
      • Input(2): localization loss if needed. Now the defalut demo in Caffe and TensorFlow only use classification loss.
      • Input(3): matched indices
    • Output:
      • Negatives indices, LoDTensor, [Neg, 1]
      • The match indices will also be changed, the hard example indices will be labeled -1.
  • 7). target_assign_op

    • Input:
      • Input(1): localization predictions
      • Input(2): matched indices aftern mine_hard_examples_op
      • Input(3):encoded ground-truth bboxes with shape [Ng, Np, 4]
      • Input(4): the variance of anchor boxes [Np, 4]
    • Output:
      • The encoderd ground-truth bboxes for each localization offset
        prediction, it's a LoDTensor with shape [N, Np, 4]
  • 8). smooth_l1_op

    • Input:
      • Input(1): localization offset predictions
      • Input(2): encoded ground-truth bboxes for each localization offset prediction
    • Output:
      • localization loss, Tensor wit shape [N, 1]
  • 9). batch_multiclass_nms_op

    • Input:
      • Input(1): decoded localization predictions after box_coder_op.
      • Input(2): classification prediction
    • Output:
      • The output is a LoDTensor with shape [Ng, 6] (label, score, xmin, ymin, xmax, ymax)
  • 10). transpose_op, concat_op, softmax_with_loss_op and smooth_l1_op. These operators have been implemented.

Data Struct

In the Caffe, since each input image may have different number of ground-truth boxes, and for the convenience of calculation, the input Blobs (similar to tensor) for the ground-truth boxes and anchor boxes are converted into std::map<int, vector<NormalizedBBox> > and std::vector<NormalizedBBox>. The NormalizedBBox is a struct.

But in Fluid, we have LoDTensor, maybe Tensor(or LoDTensor)-in/Tensor(or LoDTensor)-out for each operator is enough.

If there is any problem with the above descriptions, please help to correct it. Thank you.

@wanghaox
Copy link
Contributor

wanghaox commented Jan 10, 2018

  1. The output shape of anchor_box_op is [2, H, W, M, 4].
  2. Which OP is used to generate negative indices in fluid.
  3. Box_coder_op is designed to be OP, or just by using a function.
  4. I think LoDTensor is ok for ground-truth boxes, and I used it in my code.
  5. How to implement DetectionOutputLayer in fluid.
  6. What is the role of mine_hard_examples_op and target_assign_op.

@qingqing01
Copy link
Contributor Author

@wanghaox Thanks for your review. I update above descriptions and add more comments.

  1. Which OP is used to generate negative indices in fluid.
  2. How to implement DetectionOutputLayer in fluid.
  3. Box_coder_op is designed to be OP, or just by using a function.
  • The mine_hard_examples_op is used to generate negative indices.
  • The box_coder_op and batch_multiclass_nms_op are used to get detection outputs.
  • The smooth_l1_op is used to compute localization loss, one input is encoded ground-truth boxes. The batch_multiclass_nms_op is used to get detection outputs, and one input is decoded localization predictions. So maybe an operator for box_coder is better. Of course, we can think again in the realization.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants