-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train YOLOv3-SPP from scratch to 62.6 mAP@0.5 #310
Comments
@Aurora33 the mAPs reported in https://github.com/ultralytics/yolov3#map are using the original darknet weights files. We are still trying to determine the correct loss function and optimal hyperparameters for training in pytorch. There are a few issues open on this, such as #205 and #12. A couple things of note:
|
I also get 38% mAP until 170 epoches on COCO dataset, and the mAP don't really change much more |
@majuncai can you post your results? Did you use |
@glenn-jocher |
@majuncai I see. The main things I noticed is that your LR scheduler has not taken effect, since it only kicks in at 80% and 90% of the epoch count. The mAP typically increases significantly after this. Also multi-scale training has a large effect. Lastly, we reinterpreted the darknet training settings so that we believe you only need 68 epochs for full training. All of these changes have been applied in the last few days. I recommend you |
Hi Glenn, thanks for your great work. For classification loss, YOLO V3 paper said they don't use softmax, but nn.CrossEntropyLoss() actually contain softmax. And the paper contains ignore_threshold. Could these affect the overall mAP? |
@XinjieInformatik yes these could affect the mAP greatly, in particular the ignore threshold. For some reason we've gotten reduced mAP with |
@glenn-jocher Hi, I am curious about how the provided yolov3.pt is obtained. Is it transformed from yolov3.weight or trained with the code you shared? |
@fereenwong |
@glenn-jocher Just finished an experiment training on the full COCO dataset from scratch, using the default hyperparameter values. My model was YOLOv3-320, and I trained to 200 epochs with multi-scale training on and rectangular training off. After running test.py, I managed to get 47.4 mAP, which unfortunately is not the 51.5 corresponding to pjreddie's experiment. I can try training again but this time to 273 epochs, although it seems that each stage before the learning rate decreased per the scheduler had already plateaued, so I don't think it would benefit much. Is there a comprehensive TODO list that you think will improve mAP? I notice you mentioned the 0.7 ignore threshold. What do you mean by this? I searched through the darknet repo and didn't get any relevant 0.7 hits. |
@ktian08 ah, thanks for the update! This is actually a bit better than I was expecting at this point, since we are still tuning hyperparameters. The last time we trained fully we trained 320 also to 100 epochs and ended up at 0.46 mAP below (no multiscale, no rect training), using older hyps from about a month aqo. Remember the plots are at 0.70 is a threshold darknet uses to not punish anchors which aren't the best but still have an iou > 0.7. We use a slightly different method in this repo, which is the Yes I also agree that training seems to be plateauing too quickly. This could be because our hyp tuning is based on epoch 0 results only, so it may be favoring aspects that aggressively increase mAP early on, which may not be best for training later epochs. Our hyp evolution code is:
|
Ah, I see. My repo currently does have the reject boolean set to True, so it is thresholding by iou_t, just by a different value. Are you saying darknet uses 0.7 for this value? I have not begun evolving hyperparameters yet, as the ones I've used were the default ones for yolov3-spp I believe. However, I've modified my train script to evolve every opt.epochs because that's how I interpreted the script rather than evolving based on the first epoch. To accomplish this, I've also changed train to output the best_result (based on 0.5 * mAP + 0.5 * f1) rather than the result from the last epoch so print_mutations has the correct value. I'll try evolving the hyperparameters based on a smaller number of epochs > 1 and let you know if I get better results. |
@ktian08 ah excellent. Hmm, your results are very different than the ones I posted. The more recent results should see almost 0.15 mAP starting at epoch 0, whereas yours start around 0.01 at epoch 0 and increase slowly from there. Clearly my plots show faster short term results, but I don't know if they are plateauing lower or higher than yours, its hard to tell. No, the 0.7 value corresponds to a different type of iou thresholding in darknet. In this repo if Unfortunately we are resource constrained so we can't evolve as much as we'd like. Ideally you'd probably want to run the evolution off of the result say the first 10 or 20 epochs, but we are running it off of epoch 0 results, which allows us to evolve many more generations, even as its unclear if epoch 0 success correlates 100% with epoch 273 success. Also beware that we have added the augmentation parameters to the |
Right, I think you start at 0.15 mAP because you load in the darknet weights as your default setting. I modified the code so I'm truly training COCO from scratch. I'll pull the new hyperparameters and try evolving within 10-20 epochs then. Thanks! |
@ktian08 ah yes this makes sense then, we are looking at apples to oranges. Regarding fitness, I set it as the average of mAP and F1, because I saw that when I set it as only mAP, the evolution would favor high R and low P to reach the highest mAP, so I added the F1 in attempt to balance it. Lines 342 to 343 in df4f25e
If you are doing lots of training BTW, you should install Nvidia Apex for mixed precision if you haven't already. This repo will automatically use it if it detects it. |
@ktian08 I've added a new issue #392 which illustrates our hyperparameter evolution efforts in greater detail. As I mentioned, with unlimited resources you would ideally evolve the full training results: python3 train.py --data data/coco.data --img-size 320 --epochs 273 --batch-size 64 --accumulate 1 --evolve But since we are resource constrained we evolve epoch 0 results instead, under the assumption that what's good for epoch 0 is good for full training. This may or may not be true, we simply are not sure at this point. python3 train.py --data data/coco.data --img-size 320 --epochs 1 --batch-size 64 --accumulate 1 --evolve |
OK, I'll try installing Apex and evolving to as many epochs as I can. Earlier, I made a mistake calculating the mAP for my experiment, as I didn't pass in the --img-size parameter to test.py and thus my model tested on size 416 images. My newly calculated mAP is 47.4. |
@ktian08 ah I see. I forgot to mention that you should use the
|
Yep, already using --save-json and best.pt! |
@ktian08 I updated the code a bit to add a This seems to show better mAP, at least during the first few epochs, both when training from darknet53 as well as when training with no backbone (0.020 to 0.025 mAP first epoch at 416 without backbone). I don't know what effect it will have long term however. I am currently training a 416 model to 273 epochs using all the default settings with the |
@glenn-jocher Does training seem to improve using --img-weights based on your experiments so far? I am currently retraining on new hyperparameters I got from evolving, but despite the promising mAPs gotten during evolution, I see that the mAP for my new experiment is pretty much the same as my control experiment ~60 epochs in. |
@ktian08 I might be seeing a similar effect. It's possible that the first few epochs are much more sensitive to the hyperparameters, and small changes in them eventually converge to the same result after 50-100 epochs. I'm not sure the conclusion to draw from this, other than hyperparameter searches based on quick results (epoch 0, epoch 1 results etc.) may not be as useful as they appear. Oddly enough I also saw about no change in mAP at the baseline |
@glenn-jocher Hmm... I trained to 20 epochs while tuning hyperparameters but when I compare the 20th epoch even I don't see the 7 mAP increase that I should've seen. Maybe --img-size reduces AP for the classes doing well during training, so that the mAP in the end is the same regardless. I noticed that pjreddie describes multi-scale training much differently in the YOLO9000 paper than what is being implemented here (scaling from /1.5 to * 1.5). He says every 10 batches he chose a new dimension from 320 to 608 as long as it was divisible by 32, allowing for a much larger range for YOLOv3-320, which might help. Does his YOLOv3 repo also implement it like this, or is it your way? |
@ktian08 here is the current comparison using all default settings (416, no multi-scale, etc). I'll update daily (about 40 epochs/day). Training both the full 273.
|
@ktian08 this should implement it as closely as possible to darknet. Every 10 batches (i.e. every 640 images) the img-size is randomly rescaled from 1/1.5 to 1*1.5, rounded to the nearest 32-multiple, so from 288-640 for img-size 416 or from 224-480 for img-size 320. I've run AlexyAB/darknet to verify and it does the same. |
@glenn-jocher Yes, we use apex for mixed precision training, but I never compare with single-gpu training for mAP result due to long training time for comparison, so I doubt whether DistributedSampler will influence mAP results |
@glenn-jocher Meanwhile, I don't like using kmeans for searching optimized anchors, as mentioned in Yolov4 paper, optimized anchors only improve 0.9%mAP, but highly depend on dataset |
@HwangQuincy our function uses kmeans for an initial guess (but you can also supply your own initial guess), then the anchors are optimized with a genetic algorithm for 1000 generations. But if you don't want to use it, then just supply whatever you prefer in the cfg. There's typically no such thing as 'anchor free', I'm pretty sure that's just marketing buzz. When you see that it typically means the anchors are correlated to the output layer stride rather than the training labels. Ok got it on the training. Can you submit a PR for mp spawn then and we will review? |
I'm pretty sure that CenterNet (Objects as points) is an anchor-free approach. Most any keypoint-style detector would work similarly and they do not rely on anchor priors. |
@HamsterHuey ah that's a good point. I haven't played around with centernet so I'm not too familiar with their method. The main thing which 'anchors' is that they are simply a normalization of the outputs. All neural networks perform best when the inputs and outputs are normalized (i.e. mean of 0 and standard deviation of 1), not only for CNNs but even for the simplest regression networks, so this practice is really widespread across all ML. |
@glenn-jocher - While it is true that anchors help regress out bboxes, they are also a bit of a hack in a way (imo) to overcome the other aspect of Yolo and similar anchor-based approaches due to their output featuremap being substantially lower dimensional (spatially) compared to the input image. So now you have a "grid-cell" which in YoloV2 was 32x smaller than the input image dimensions so a 416 x 416 image resulted in a 13 x 13 output set of "grid-cells". The problem with grid-cells is that now each grid-cell needs to be able to make predictions for all objects whose center lies within its cell, and often-times there are several objects within a grid-cell due to the large reduction-factor / network-stride associated with each cell. So you have anchor boxes to help regress out box dimensions, and a pretty complicated set of logic in the loss function that determines which anchor gets assigned a ground-truth annotation for a given grid-cell, how to handle loss calcs for the other non-assigned anchors, etc. Keypoint-styled networks like centernet discard a lot of this because their output featuremap size is much larger (only 4X smaller than input image), so they can directly have each "grid-cell" only be responsible for a single class detection. Since each cell is responsible for a single detection, there is no need to have anchor boxes or any complex assignment of ground-truth annotations to the correct anchor for loss calculations. It simplifies the loss function drastically, and most surprisingly, the bbox width and height dimensions can directly be regressed out via an L1 loss. Like @HwangQuincy , I've not been a big fan of anchor boxes as they have always seemed more of a band-aid to fix a limitation of the underlying approach to detection as you now have a coupling between your trained network weights and the dataset you trained on. The implications on transfer-learning are also not often noted, but if you need to tweak the number of anchors for your final dataset of interest, you are still stuck with the COCO anchors if you wish to rely on the pretrained COCO weights to start from. All that being said, I don't think customized anchors make a huge difference one way or the other. It's always struck me as a bit of a hacky way to overfit to a specific dataset to eke out max performance, but that's just my opinion 😃 Btw, nice job keeping up with maintaining this repo and bringing some sanity to the world of Pytorch forks of Yolo 😄 |
@HamsterHuey yes, this is a pretty good summary. Actually the original yolo was anchor free, regressing the width and height directly as you say the keypoint detectors do. Do the keypoint detectors do the regression in a normalized space, or is it actual pixel coordinates? The strides in yolov3 are 32, 16, and 8, though you could add additional layers for say stride 64 or stride 4. It would be nice to unify the outputs somehow into a more uniform strategy as you say, i.e. simply use the stride 8 output and a single anchor per grid cell for example, but there are tradeoffs involved in these other approaches, which is probably why we are where are today with the anchors. I looked at the centernet repo BTW, and it seems to produce lower mAP than this repo, at a much slower inference time, so its possible you are seeing some of the tradeoffs in practice there. I'm open to new ideas of course though. Our mission is to produce the most robust and usable training and detection repos across all possible custom datasets, so anchor robustness is part of that. BTW, we have been developing a new repo these last couple months that should be released publicly in the next few weeks. It aims to improve training robustness and user friendliness across the training process, and of course improve training results as well. This new repo includes an anchor analysis before training to inform of best possible recall (BPR) etc given the supplied settings, a step which is missing now. |
Nice to hear about your new repo. Out of curiousity, do you have a documentation somewhere of the differences between the regular Yolov3 SPP vs the ultralytics approach that helped achieve the higher results as listed in the main Readme? I'd love to get a sense of the main learnings you've had on that front. I think comparing detectors is always hard as the Yolov4 paper also shows, because there are so many pieces to maximizing performance on a given dataset. As the jump from YoloV3 -> YoloV3-SPP -> YoloV3-SPP-Ultralytics shows, there are many tweaks one can make to eke out more performance. I don't necessarily think CenterNet is better, though it also has not been optimized nearly as much as some of the work that has gone on both here with your tireless efforts, and @AlexeyAB 's efforts on his darknet fork. To your question regarding the loss function in centernet, the width and height are directly regressed out in grid-cell space (so 4X reduced coordinate space compared to the input image coordinate space). Having spent a lot of time with a custom YoloV2 implementation and now with CenterNet, I will say that some of the things that I like about the latter are that the loss function is astonishingly simple and easy to understand and compute which results in less bottlenecks when training, and also it does pretty well at inference without the need for NMS which can be nice in compute limited situations. But I think there are always tradeoffs in any approach, so I'm always curious to see how different approaches do and also to learn of new findings that improve performance. |
@HamsterHuey ah that's very interesting, I didn't realize you could skip the NMS there. Yes this is true that the loss function is very complicated here, in particular with regard to constructing the targets appropriately and with the matching anchors to labels as you mentioned. And yes you are also correct that the implementation matters a lot, as even identical architectures can return greatly different results. |
@HamsterHuey and to answer your question, we do not have a consolidated changelist unfortunately. With limited resources the documentation unfortunately takes a backseat to the development work. Hopefully at some point later this year we can write a paper to document the new repo. |
@glenn-jocher I have found the reason that cause the mAP difference, It's due to I set accumulate=num_gpus * original accumulate, when setting batch_size = original batch_size / num_gpus (it is correct when we use DistributedSampler as given in pytorch examples) |
Sorry, I have modified your original code, and the modification includes the loss computation, target(positive samples) generation, and network model definition (without .cfg) for faster new model introduction. Thus my code with mp.spawn can't immediately submitted. When I have a idle time, I will submit a PR with mp.spawn and DistributedSampler |
@HamsterHuey The anchor-free includes key-points based and center-based. As well known, Yolov1 is the first of center-based anchor-free method. I prefer to the center-based anchor-free method, and guess there is no performance gap between key-points based and center-based anchor-free method provided the annotation is in form of bounding box. As stated in ATSS (https://arxiv.org/pdf/1912.02424.pdf), there is no significant performance gap between center-based anchor-free and anchor based methods |
Hi, any suggestions to find the code about the genetic algorithm for optimized anchors? |
Lines 655 to 660 in d6d6fb5
@HwangQuincy maybe I'll try to implement mp spawn on my own, as the improvement does seem worth the effort. Can you point me to the best example you know to get started? |
@glenn-jocher Sorry to reply so late, the best example as I known is https://github.com/pytorch/examples/blob/master/imagenet/main.py, where the mp.spawn is used for distributed training. |
Ok got it, thanks! I’ll try it out this week. |
|
@feixiangdekaka yes, your observation is correct, a number multiplied by zero will be zero. |
@glenn-jocher I found that you set the parameter world_size=1 during the initialization of distributed training. What is the special significance of this Why not use default parameters so that all available processes can be used automatically |
@TAOSHss I believe these are the pytorch defaults from a tutorial code that was used for multi gpu training. We've received comments recently that mp spawns performs better in multi gpu contexts btw. |
This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days. |
@HwangQuincy is there a chance you might be able to help us with our multi-GPU strategy for https://github.com/ultralytics/yolov5? A couple users are working on a DDP PR (ultralytics/yolov5#401) with mixed results. Since you got an mp.spawn implementation working here so well I thought you might be the best expert to consult on this. Our current multi-gpu strategy with YOLOv5 is little changed from what we have here, it is DDP with world size 1. It's producing acceptable results, but I don't want to leave any improvements on the table if we can incorporate them. |
Hi,
Thanks for sharing your work !
I would like what is your configuration for the training of yolov3.cfg to get 55% MAP ?
We tried 100 epochs but we got a MAP (35%) who don't really change much more. And the test loss start diverge a little.
Why you give a very high loss gain for the confidence loss ?
Thanks in advance for your reply.
The text was updated successfully, but these errors were encountered: