-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mAP is wrong if all scores are equal (=not providing a score) #46
Comments
Hi! Thanks for the feedback, I try to keep the library up to date and introduce many new changes. pycocotools has lost its relevance both in speed and behavior, my library tries to support all *coco standards (lvis, coco, objects365). Let's think together how to change the code so that your desire to calculate metrics is correct. Here is approximately the method where the identical calculation is made by pycocotools. Do you suggest "not sorting" by the score value? Considering "as is"? Or sorting in some other way? faster_coco_eval/csrc/faster_eval_api/coco_eval/cocoeval.cpp Lines 364 to 398 in 1caac51
|
@voegtlel Is it still relevant? Can you check if the code from PR is suitable for your tasks? |
@voegtlel Is it still relevant? |
Hi @MiXaiLL76 , I am sorry for not responding, was out a bit longer. It is still relevant for us, thank you very much for the quick response and even implementation! We're still discussing, how the PR-curve and thus AP should look for predictions collapsed to one point (i.e. scores all equal) and what the AP should be. It seems non-trivial to derive this. A bit of background on the discussion: The original formulation (Definition 1) of AP states: discretizing it to Now, with multiple predictions collapsing to the same score, that sum does not hold. Two points on that curve are certain: For consistency with the existing implementation, your version would probably be suited (i.e. 1 for We believe, that for We discussed several options for
In general, one could argue, that with only one point on that curve, the approximation of the sum does not make sense at all, also it somehow contradicts the initial purpose of mAP. Still, to be comparable, this should give a useful value. I'd be interested to hear about your or even more other opinions. |
Looks interesting, I would try to implement this in my library. In general, I am even ready to implement separate functions in C++ to implement this task, and provide the ability to call them using parameters. My current implementation, which I showed in the PR, is just a prototype, but based on it I believe that we will be able to implement what we need. |
Ah sorry, I didn't make my point clear enough: One of those variants should be correct. But we cannot settle on the variant which is the correct one. Also referencing this paper here: https://arxiv.org/abs/2403.13064 In G.4, Fig 14, they come to a plot which resembles So, right now, I still cannot tell you which PR-curve is theoretically correct. I guess we should really solve that first from a theoretical standpoint, and then fix the functions. Maybe others find this issue and have a better argument for one of the variants. |
My colleague writes:
|
In general, this can be done and it is a pretty logical option, because colleagues from torchmetric encountered such a problem during tests (they submitted the same score everywhere) and received different results. Perhaps it is worth informing the user about such a problem if it occurs. But what do we do with the fact that your model does not produce a score? How will you test it? Do you want me to help integrate these test metrics into my library? |
(Copying this bug report from the main coco metrics cocodataset/cocoapi#678 )
Hi there,
Describe the bug
our detector does not output scores, thus we set all to
1
, which gives wrong results using the coco metrics. We know, that the metrics are written assuming that there exist scores, but I believe it should be clearly clarified in the docs that the mAP is not correct if the scores are not set.To Reproduce
More details and an analysis of the cause are following:
Example with source code
Output of the example source code
Output will be:
The cause for this is: For computing the AP, a discrete precision recall curve is computed. This curve is created prediction-by-prediction sorted by the score. But as the score is the same for all, they should actually be considered all at once, because there cannot be a different score threshold which excludes one prediction over the other (this should be independent of order).
Thus, the resulting PR-curves are different and not correct:
Reference code for plotting
The cause for this issue lies here: https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/PythonAPI/pycocotools/cocoeval.py#L378-L379
where the
tp_sum
andfp_sum
are computed as cumulative sum, but this is wrong if the scores are equal. Then the cumulative sum should contain all predictions. It may only increment if the score from one to the next prediction differs, otherwise all must be the same value or for efficiency be collapsed.Expected behavior
If there is no score, the pr-curve is reduced to a single point (precision, recall) mean of all predictions, as there is no score separating the predictions. Thus, the Average Precision equals the Precision.
Effectively, this fix could be added on top of the current implementation (e.g. a switch which allows for equal scores) in order not to modify the existing code.
Not sure, if faster-coco-eval aims to be equivalent to cocoapi and if this is even an option. But cocoapi seems to be stale, lots of PRs, no changes since 4 years.
Thanks!
The text was updated successfully, but these errors were encountered: