Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated Test / Benchmark research #87

Open
tdurand opened this issue Jun 4, 2019 · 10 comments
Open

Automated Test / Benchmark research #87

tdurand opened this issue Jun 4, 2019 · 10 comments

Comments

@tdurand
Copy link
Member

tdurand commented Jun 4, 2019

Add a simple way to be able to test opendatacam to avoid regression when upgrading

@tdurand
Copy link
Member Author

tdurand commented Jun 7, 2019

  • how many ids there is and compare with reality

@tdurand
Copy link
Member Author

tdurand commented Jun 13, 2019

@b-g , I spend a bit of time thinking about this and here is a couple of things I'd like some feedback / inputs to be able to go forward on this. Very much looking for external look on this, I may be overcomplexifying this.... but it would be useless to spend time to build a non-meaningful benchmark

Our initial idea is to compare the number of distinct Ids of the tracker with a "reality" number what we would have previously manually determine on some given footage.

Let's take this "scene" as an example:

Screenshot from 2019-06-13 12-16-19

Observation:

YOLO / Tracker is only able to detect a fraction of the total items in this scene, for example of the upper left corner nothing is detected..

If we were to manually count the number of distinct ids on this scene, we would count everything what we see with our eye, including those cars of the upper left corner.

This isn't a problem, and even more this is good because it would give us some "score" against reality and enable us to compare different weights files of YOLO that might pick up more objects... But this could skew results for another reason.

Problem

The problem with the "metric" of the distinct ids are the id-reassignment that happens... For example, we could have this results:

  • Reality: give 100 different ids for the footage
  • Perfect tracking (no ids switch) inside the capacity of the object detector : as it is only capable of tracking part of the total number of items, let's say the the perfect tracker will get 70 ids, though a accuracy of 70%
  • Medium tracking : with the ids-reassignment happening, the bad tracker get 40ids + 40 reassignment = 80ids , so a 80% accuracy using our metric, where in reality it should be ~40%.
  • Real world tracking : what is happening right now it even worse that the previous point, the areas where lots of ids reassignment happens (yolo detect really badly) can generate tens of ids for each frame as it constanly loose them... so we could end up with more ids than the reality... and have a > 100% accuracy

Ideas of what we can do

I think we need to associate the time with the ids.. and manually label reality with: "this item is tracked for 10s" etc etc...

So then we would get (very simple example with 3 items):

Reality:

  • id: 1, timeTracked: 10s
  • id: 2 : timeTracked: 7s
  • id: 3 : timeTracked: 20s

Tracker results:

  • id: 1, timeTracked: 0.4s ( quickly reassigned)
  • id: 2, timeTracked: 0.2s ( quickly reassigned)
  • id: 3, timeTracked: 0.8s ( quickly reassigned)
  • id: 4, timeTracked: 0.4s ( quickly reassigned)
  • id: 5, timeTracked: 0.2s ( quickly reassigned)
  • id: 6, timeTracked: 0.8s ( quickly reassigned)
  • id: 7, timeTracked: 0.4s ( quickly reassigned)
  • id: 8, timeTracked: 0.2s ( quickly reassigned)
  • id: 9, timeTracked: 0.8s ( quickly reassigned)
  • id: 10. timeTracked: 3 s ( correspond to id 1 in relativy but reassigned another time)
    id: 11. timeTracked: 7 s ( correspond to id 1 under is second id after beeing lost by the tracker)
  • id: 11. timeTracked: 5 s ( may correspond to id 2 reality picked up and well tracked later on)
  • id: 12. timeTracked: 18 s ( may correspond to id 3 reality picked up and well tracked later on)

Then the question is how to compute a meaningful score from those results, I do not have much idea how to do it...

More

  • We will also need to take in account in the benchmark the "class" of the items along with the ids and the time .. Right now we are only reasoning in terms of a single class.

  • One thing to do is to investigate how challenges like MOT: https://motchallenge.net/ are doing it... which must have the same issues for benchmarking the submitted tracker algorithms

@tdurand
Copy link
Member Author

tdurand commented Jun 13, 2019

MOT Challenge tracker benchmark / evaluation framework

I looked up this, they have multiple metrics that they combine like False negative, False positive, Id switch, fragmented trajectory, etc...

Screenshot 2019-06-13 at 15 32 01

No clear explanation of how this works, but there is a paper: https://arxiv.org/pdf/1906.04567.pdf , and the code is available: https://bitbucket.org/amilan/motchallenge-devkit ( in matlab 🤓 )

I tried to get some insights from the paper, they have a nice diagram explaining some of the metrics

Screenshot 2019-06-13 at 15 36 01

I didn't fully understand everything, but it seems to be that for every frame they perform a comparison for each bbox of the tracker output with the ground truth.... Just how they do it isn't clear would need to investigate the code.. But in definitive they seem to extract from this their metrics: false positive, false negative etc etc ...

One idea could be to make our tracker compatible with the MOT in/out format and evaluate it there 😉

Pymot

Googling around lead me to this github repo: https://github.com/Videmo/pymot , which implements a MOT evaluation in python (similar to the one of the MOT challenge), and takes in input JSON, also they are a schema explaining better the metrics ( dashed lines represent tracker hypothesis, whereas ground truth are the big circles)

mot-tracks

The README is pretty great and explains:

  • how to format the ground truth and the hypothesis , which is in JSON very close to what our tracker outputs ...

  • that this format is compatible to a tool call sloth that help to annotate videos. (we will need to annotate the ground truth)

  • once we have the ground truth and the tracker output running the evaluation is just a command: `pymot.py [-h] -a GROUNDTRUTH -b HYPOTHESIS``

  • also it gives insights on how the evaluation is implemented (which I was struggling to understand in the paper) , and it's the comeback of the famous Hungarian algorithm: https://github.com/Videmo/pymot#implementation-notes

If we were to go with this it seems that most of the work would be to annotate the ground truth, which would be to take some test footage and for each frame manually label all the bbox and assign them ids , and repeat this for XX frames...

Then processing our tracker output into the format executed by this tool would be pretty simple.

Conclusion

That quick research reinforced my opinion on that benchmarking the tracker isn't a simple task... but good news are that there are already plenty of literature / tool outhere... Let's discuss how it makes sense to go forward on this to move this project from an empiric "it seems to work nice" to fit it in a more robust evaluation framework 😉

@tdurand
Copy link
Member Author

tdurand commented Jun 13, 2019

Some quick follow-up with some other finding

py-motmetrics

The lowest effort way to benchmark our tracker would be to test it with the MOT challenge data (for example this video 🤯: https://motchallenge.net/vis/CVPR19-03/gt/ ) , as we do not have to annotate ground truth videos...

I found that there is python compatible implem of the evaluation framework: https://github.com/cheind/py-motmetrics (the official one is in Matlab)

With this the only thing to do would be to write some script to convert our json output to the input of the MOT challenge, which seems pretty straithforward: https://motchallenge.net/instructions/

format:

<frame>, <id>, <bb_left>, <bb_top>, <bb_width>, <bb_height>, <conf>, <x>, <y>, <z>

 example:

  1, 3, 794.27, 247.59, 71.245, 174.88, -1, -1, -1, -1
  1, 6, 1648.1, 119.61, 66.504, 163.24, -1, -1, -1, -1
  1, 8, 875.49, 399.98, 95.303, 233.93, -1, -1, -1, -1

this would give us a way to rank out tracker in those results for example: https://motchallenge.net/results/MOT17/

Screenshot 2019-06-13 at 17 30 42

@b-g
Copy link
Member

b-g commented Jun 16, 2019

Hi @tdurand, many thanks for the brilliant write up! Sounds like best possible plan to compare our results with the ones of the MOT challenge. 👍

Q:

  • Can we use the MOT Challenge data without license issues? (I guess we either want to include a text clip or ask the user to download it and put it in the /test folder)
  • Is there also MOT data which is more car centric? Nothing wrong with pedestrians, but I think cars are still a bit more important for the project

(Happy to jump an a quick call on Monday or Tuesday evening if faster to discuss)

@tdurand
Copy link
Member Author

tdurand commented Jun 17, 2019

Thanks !

  • License of the MOT challenge is CC BY-NC-SA 3.0 , so I think there is no problem if we mention them somewhere

  • For a more car centric dataset, I also though about this, and unfortunately not... all the dataset are pedestrian centric..... that a big downside of this idea... or we would need to annotate ourself one video... to have some ground truth..

Let's discuss on to go forword on this on our next call, I don't plan to make any progress on this this week, I'll be travelling and have other issues I need to work on anyway (web version ..)

@shams3049
Copy link
Contributor

I can initiate a sub project in my Institute, and outsource it to annotate the video.

Any tools and sample video that you recommend?

@tdurand
Copy link
Member Author

tdurand commented Jun 17, 2019

hello @shams3049 , thanks for this... will get back to you next week on this 😉

@tdurand
Copy link
Member Author

tdurand commented Jun 26, 2019

Things to discuss on our next meeting later today @b-g :

  • Whether it makes sense to spend a bit of time (I think a day) to be able to rank our tracker on the MOT challenge footage for v2 as there are no car dataset and we advertise opendatacam v2 still very much car centric and it might be hard to make much sense of what our ranking will means. (difficult to communicate clearly on this)

  • Should we already think about organizing how to annotate the ground truth of one / sevaral car centric video (not for v2 as we won't have time for it)

  • Or should we only mention this issue / Benchmark somethere in the README of v2 like something we are working on for the next releases.

Bonus:

V-IOU tracker ( something really similar with the approach we have ) ranked 3 on the overall MOT 2019 challenge: bochinski/iou-tracker#7 , and author will publish python implementation soon so this is also a good indicator we took the right approach for tracking and will gives you more input on how to improve it further

@tdurand
Copy link
Member Author

tdurand commented Aug 8, 2019

Some update on this,

Making node-moving-things-tracker compatible with MOT challenge input / output

This is done, still in a separated branch but kind of ready to release, I've wrote a little documentation about it: https://github.com/opendatacam/node-moving-things-tracker/blob/mot/documentation/BENCHMARK.md

Added a new mode in the command line tool:

node main.js --mode motchallenge --input benchmark/MOT17/MOT17-04-DPM/det/det.txt

The work is living in the mot branch : https://github.com/opendatacam/node-moving-things-tracker/pull/13/files

Evaluating node-moving-things-tracker in the MOT Challenge

Using https://github.com/cheind/py-motmetrics , a python version of the official MATLAB evaluation code of the mot challenge.

Understanding how to use this tool was a bit tricky and I documented it https://github.com/opendatacam/node-moving-things-tracker/blob/mot/documentation/BENCHMARK.md .

Some learnings to discuss:

  • The py-motmetrics tool gives only a way to benchmark the Tracker out of the box and not the Detections (YOLO) .. but seems possible use it to benchmark the detections (MOT challenge has two distinct rankings for detections and tracking)

  • We can't rank node-moving-things-tracker against other results of the MOT challenge because we do not have access to the ground truth data for the "test" dataset. Or maybe I didn't find them.. (It makes total sense not to publish this before the end of the competition but afterwise)... So I think the way to rank against the other competitor would be to create an account in the MOT challenge and see if we can submit out results for some past competition...

  • I ran it only a single training dataset of the MOT17 : https://motchallenge.net/vis/MOT17-04-DPM , only feeding the detections given to our tracker and getting a compatible result

FYI get this result:

              IDF1   IDP   IDR  Rcll  Prcn GT MT PT ML    FP    FN IDs   FM  MOTA  MOTP
MOT17-04-DPM 28.6% 34.4% 24.5% 42.8% 60.0% 83  8 43 32 13558 27210 355  549 13.5% 0.224
OVERALL      28.6% 34.4% 24.5% 42.8% 60.0% 83  8 43 32 13558 27210 355  549 13.5% 0.224

I can't rank against other competitor as the ranking is done of the test dataset and not training dataset... That said our MOTA score seems quite low (13.5%) ;-) .

  • But good thing is that this gives us a way to implement a automated benchmark for the Tracker that would run when we make changes to the algorithm, and at least we can easily see if the MOTA score changes a lot from one change to another...

Next steps

I think the next steps would be to:

  • Try to create an account on the MOTChallenge to see if we can submit and rank against other algorithms

  • Make a npm run test task in the node-moving-things-tracker repo that test the tracker against some MOTChallenge dataset (MOT17 or other) and track improvements / regression

  • Start improving the tracker based on the new V-IOU paper: Tracker potential improvements node-moving-things-tracker#10 and and the CVPR_2019 results : https://motchallenge.net/results/CVPR_2019_Tracking_Challenge/ where V-IOU ranked 4th

  • Maybe produce a dataset with more cars that would make more sense for current Opendatacam use.

  • Maybe also benchmark the detections given by different weights of YOLO (I think this may be out of the score of opendatacam.. as we do not create our own weights and use the public ones)

@tdurand tdurand modified the milestones: v2.1, v3 Oct 14, 2019
@tdurand tdurand removed this from the v3 milestone Apr 28, 2020
@tdurand tdurand changed the title Automated Test / Benchmark Automated Test / Benchmark research Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants