mAP is wrong if all scores are equal (=not providing a score) #46

voegtlel · 2024-10-31T14:18:42Z

(Copying this bug report from the main coco metrics cocodataset/cocoapi#678 )

Hi there,

Describe the bug

our detector does not output scores, thus we set all to 1, which gives wrong results using the coco metrics. We know, that the metrics are written assuming that there exist scores, but I believe it should be clearly clarified in the docs that the mAP is not correct if the scores are not set.

To Reproduce

More details and an analysis of the cause are following:

Example with source code

import faster_coco_eval

# Replace pycocotools with faster_coco_eval
faster_coco_eval.init_as_pycocotools()

import json
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval

if __name__ == "__main__":
    # GT
    gt = {
        "categories": [
            {"id": 1, "name": "a"},
        ],
        "annotations": [
            {"image_id": 1, "bbox": [0, 0, 10, 10], "category_id": 1, "id": 1, "iscrowd": 0, "area": 100, "segmentation": []},
            {"image_id": 1, "bbox": [20, 20, 30, 30], "category_id": 1, "id": 3, "iscrowd": 0, "area": 100, "segmentation": []},
            {"image_id": 1, "bbox": [30, 30, 40, 40], "category_id": 1, "id": 4, "iscrowd": 0, "area": 100, "segmentation": []},
        ],
        "images": [
            {"id": 1, "file_name": "image.jpg"},
        ],
    }
    with open("gt.json", "w") as f:
        json.dump(gt, f, indent=2)

    # Pred 1
    pred = [
        {"image_id": 1, "bbox": [0, 0, 10, 10], "category_id": 1, "score": 1, "id": 1, "segmentation": []},
        {"image_id": 1, "bbox": [10, 10, 20, 20], "category_id": 1, "score": 1, "id": 2, "segmentation": []},
        {"image_id": 1, "bbox": [20, 20, 30, 30], "category_id": 1, "score": 1, "id": 3, "segmentation": []},
    ]
    with open("pred1.json", "w") as f:
        json.dump(pred, f, indent=2)

    # Pred 2
    pred = [
        {"image_id": 1, "bbox": [0, 0, 10, 10], "category_id": 1, "score": 1, "id": 1, "segmentation": []},
        {"image_id": 1, "bbox": [20, 20, 30, 30], "category_id": 1, "score": 1, "id": 2, "segmentation": []},  # Swapped this box with the next
        {"image_id": 1, "bbox": [10, 10, 20, 20], "category_id": 1, "score": 1, "id": 3, "segmentation": []},
    ]
    with open("pred2.json", "w") as f:
        json.dump(pred, f, indent=2)

    coco = COCO("gt.json")

    pred = coco.loadRes("pred1.json")
    eval = COCOeval(coco, pred, 'bbox')
    eval.evaluate()
    eval.accumulate()
    eval.summarize()

    pred = coco.loadRes("pred2.json")
    eval = COCOeval(coco, pred, 'bbox')
    eval.evaluate()
    eval.accumulate()
    eval.summarize()

Output of the example source code

Output will be:

 [...]
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.663
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.663
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.663
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.663
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.333
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 [...]
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.554
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.554
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.554
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.554
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.333
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000

The cause for this is: For computing the AP, a discrete precision recall curve is computed. This curve is created prediction-by-prediction sorted by the score. But as the score is the same for all, they should actually be considered all at once, because there cannot be a different score threshold which excludes one prediction over the other (this should be independent of order).

Thus, the resulting PR-curves are different and not correct:

Reference code for plotting

import matplotlib.pyplot as plt
import numpy as np


def plot_pr_curves(eval_results, cats, output_dir="."):
    """
    Function to plot Precision-Recall curves based on the accumulated results from COCOeval.
    """
    # Extract the necessary evaluation parameters
    params = eval_results['params']
    precision = eval_results['precision']
    #recall = eval_results['recall']
    iouThrs = params.iouThrs  # IoU thresholds
    catIds = params.catIds    # Category IDs
    areaRngLbl = params.areaRngLbl  # Labels for area ranges
    recThrs = np.array(params.recThrs)  # Recall thresholds
    maxDets = params.maxDets  # Max detections

    k = 0  # category = a
    a = 0  # area range = all
    m = 2  # max detections = 100
    t = 0  # IoU threshold = 0.5

    pr = precision[t, :, k, a, m]

    # Create the plot
    plt.figure()
    plt.plot(recThrs, pr, marker='o', label=f"IoU={iouThrs[t]:.2f}")
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title(f"Precision-Recall Curve\nCategory: {cats[catIds[k]]['name']}, Area: {areaRngLbl[a]}, MaxDets: {maxDets[m]}")
    plt.legend()

    # Create a unique filename based on category, IoU, area, and maxDet
    plt.savefig(f"{output_dir}/PR_Curve_cat{cats[catIds[k]]['name']}_iou{iouThrs[t]:.2f}_area{areaRngLbl[a]}_maxDet{maxDets[m]}.png")
    plt.close()


if __name__ == "__main__":
    ...
    plot_pr_curves(eval.eval, coco.cats, "./")

The cause for this issue lies here: https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/PythonAPI/pycocotools/cocoeval.py#L378-L379

where the tp_sum and fp_sum are computed as cumulative sum, but this is wrong if the scores are equal. Then the cumulative sum should contain all predictions. It may only increment if the score from one to the next prediction differs, otherwise all must be the same value or for efficiency be collapsed.

Expected behavior

If there is no score, the pr-curve is reduced to a single point (precision, recall) mean of all predictions, as there is no score separating the predictions. Thus, the Average Precision equals the Precision.
Effectively, this fix could be added on top of the current implementation (e.g. a switch which allows for equal scores) in order not to modify the existing code.

Not sure, if faster-coco-eval aims to be equivalent to cocoapi and if this is even an option. But cocoapi seems to be stale, lots of PRs, no changes since 4 years.

Thanks!

The text was updated successfully, but these errors were encountered:

MiXaiLL76 · 2024-10-31T18:43:51Z

Hi! Thanks for the feedback, I try to keep the library up to date and introduce many new changes.

pycocotools has lost its relevance both in speed and behavior, my library tries to support all *coco standards (lvis, coco, objects365).

Let's think together how to change the code so that your desire to calculate metrics is correct.

Here is approximately the method where the identical calculation is made by pycocotools. Do you suggest "not sorting" by the score value? Considering "as is"? Or sorting in some other way?

faster_coco_eval/csrc/faster_eval_api/coco_eval/cocoeval.cpp

Lines 364 to 398 in 1caac51

    
           for (auto detection_sorted_index : detection_sorted_indices) 
        
           { 
        
             const ImageEvaluation &evaluation = 
        
                 evaluations[evaluation_indices[detection_sorted_index]]; 
        
             const auto num_detections = 
        
                 evaluation.detection_matches.size() / num_iou_thresholds; 
        
             const auto detection_index = iou_threshold_index * num_detections + 
        
                                          image_detection_indices[detection_sorted_index]; 
        
             assert(evaluation.detection_matches.size() > detection_index); 
        
             assert(evaluation.detection_ignores.size() > detection_index); 
        
             const int64_t detection_match = 
        
                 evaluation.detection_matches[detection_index]; 
        
             const bool detection_ignores = 
        
                 evaluation.detection_ignores[detection_index]; 
        
             const auto true_positive = detection_match > 0 && !detection_ignores; 
        
             const auto false_positive = detection_match == 0 && !detection_ignores; 
        
             if (true_positive) 
        
             { 
        
               ++true_positives_sum; 
        
             } 
        
             if (false_positive) 
        
             { 
        
               ++false_positives_sum; 
        
             } 
        
             const double recall = 
        
                 static_cast<double>(true_positives_sum) / num_valid_ground_truth; 
        
             recalls->emplace_back(recall); 
        
             const int64_t num_valid_detections = 
        
                 true_positives_sum + false_positives_sum; 
        
             const double precision = num_valid_detections > 0 
        
                                          ? static_cast<double>(true_positives_sum) / num_valid_detections 
        
                                          : 0.0; 
        
             precisions->emplace_back(precision); 
        
           }

MiXaiLL76 · 2024-10-31T19:06:58Z

You probably expect the output to be something like this:

Overall, if this is the case, it's a small fix that I might include in the next update with some flag or check that if all scores are the same, automatically include it.

MiXaiLL76 · 2024-10-31T20:52:33Z

@voegtlel clone PR #47

Check what you wanted here)

MiXaiLL76 · 2024-11-02T12:48:24Z

@voegtlel Is it still relevant? Can you check if the code from PR is suitable for your tasks?

MiXaiLL76 · 2024-11-06T11:06:17Z

@voegtlel Is it still relevant?

voegtlel · 2024-11-07T16:47:20Z

Hi @MiXaiLL76 , I am sorry for not responding, was out a bit longer. It is still relevant for us, thank you very much for the quick response and even implementation! We're still discussing, how the PR-curve and thus AP should look for predictions collapsed to one point (i.e. scores all equal) and what the AP should be. It seems non-trivial to derive this.

A bit of background on the discussion: The original formulation (Definition 1) of AP states:

$AP = \frac{1}{1-0} \int_0^1 p(r)dr$

discretizing it to

$AP = \sum_{i=1}^n p(i) \Delta r(i)$

Now, with multiple predictions collapsing to the same score, that sum does not hold. Two points on that curve are certain:
$(recall=0, precision=1)$ and $(recall=R, precision=P)$ (where $P$ is the precision of all those predictions and $R$ is the recall of all those predictions).

For consistency with the existing implementation, your version would probably be suited (i.e. 1 for $recall<R$, 0 for $recall>R$). But that would mean: $AP=R$, which is very strange.

We believe, that for $recall<R$, $precision=P$. Question is, what holds for $recall>R$?

We discussed several options for $recall>R$:

$precision=0$ (which means $AP=P\cdot R$)
$precision=P$ (which means $AP=P$)
$precision$ is undefined (thus the integral can only be computed up to $R$, which means $AP=P$ due to the smaller integral for average)

In general, one could argue, that with only one point on that curve, the approximation of the sum does not make sense at all, also it somehow contradicts the initial purpose of mAP. Still, to be comparable, this should give a useful value.

I'd be interested to hear about your or even more other opinions.

MiXaiLL76 · 2024-11-07T19:45:05Z

Looks interesting, I would try to implement this in my library.
Could you give more examples for each specific state? (I am not a mathematician by profession, and it is much easier for me to perceive formulas and problems when there are examples of data)
This will help me in testing. It can be in the same format as given above.

In general, I am even ready to implement separate functions in C++ to implement this task, and provide the ability to call them using parameters.

My current implementation, which I showed in the PR, is just a prototype, but based on it I believe that we will be able to implement what we need.

voegtlel · 2024-11-08T12:00:00Z

Ah sorry, I didn't make my point clear enough: One of those variants should be correct. But we cannot settle on the variant which is the correct one.

Also referencing this paper here: https://arxiv.org/abs/2403.13064 In G.4, Fig 14, they come to a plot which resembles $AP=P\cdot R$.

So, right now, I still cannot tell you which PR-curve is theoretically correct. I guess we should really solve that first from a theoretical standpoint, and then fix the functions. Maybe others find this issue and have a better argument for one of the variants.

MiXaiLL76 · 2024-11-14T08:15:38Z

My colleague writes:

It seems to me that it is simply incorrect to assign score = 1 to all detections here.
I will justify it) Let's imagine that we are comparing two algorithms, one with a score, and the other without. Let's say we have chosen a point on the Recall-Precision graph for an algorithm that has a score. How can we compare it at this point with an algorithm that does not have a score for detections? Since an algorithm without a score for detections has only one pair of Recall-Precision values, it seems to me that it would be obvious to take this value for comparison. It turns out that at any point on the Recall-Precision graph when comparing an algorithm without a score, we can take the same Recall-Precision value, which means that we will see a rectangular area on the graph. The area of the rectangle = the product of the sides, in our case - Recall * Precision. Something like that.

There is also an option to calculate static precision/recall by the number of TP, FP and FN, and as a result, instead of AP, take F1. With this F1, you can do the following: for example, calculate the average by the IoU threshold grid. I think this is a completely normal option for the case when there are no confidence rates.

voegtlel · 2024-11-14T10:16:26Z

Thanks for this opinion!

I agree in general, that the mAP of a model which over-predicts boxes but with scores is not easily comparable to a model which does not predict scores, but only the boxes directly.

In terms of the PR-curve, my personal opinion is, that the PR-Curve should look like your colleague describes, i.e. a rectangle with the corners $(0, 0), (0, P), (R, P), (R, 0)$ (with $P, R$ being the precision and recall of the computed point). But I'd say, that for $recall>R$, the value is actually not defined. This effectively breaks the PR-curve. One could assume $precision=0$ for $recall>R$ (which is what COCO-mAP does), but imho that's not correct. But it also does not feel correct to assume NaN and excluding it from the mean (this would mean $mAP=P \cdot R / R=P$ , because you only integrate for the defined area).

I experimented a bit with the comparability or $mAP$ and $mP$, and to me it does not seem to compare fairly, no matter how you compute the $mAP$ without scores. E.g. when plotting the PR-curve, we see that our P/R point is far outside of the comparison curve, but as soon as we introduce artificial scores, it's becoming complicated: Due to averaging over classes, which each have their own respective PR point, the curve is dragged towards 0 in steps for each class, because there are no more values for $recall>R_{i}$. Then averaging over IoU worsens that effect even more. So effectively, this looses its comparability.

So, to summarize this: I now believe it to be better to completely disallow having no or the same scores. I suggest that the coco metrics should rather assert that the scores are all different when computing or otherwise raise an exception.

What do you think? What does you colleague think about this?

More explanation as pictures:

This plot shows an imaginary PR-plot for a single class, red being a predictor which does not over-predict boxes, and artificially introduced scores just to get a curve, the last point being the actual Precision/Recall if taking all boxes. The blue is a comparable predictor, which over-predicts boxes including scores to choose a threshold.

This is an imaginary PR-plot for another class:

Now, if you average over these classes, you get something like this (the star reflects the average of the PR for the classes of the red predictor):

(Sorry for my bad paint skills)

If computing the AP of that, it will imho not reflect the quality of each predictor (as the red one can be much better than the blue one), but using the average of the PR-point of the red predictor also seems unfair.

MiXaiLL76 · 2024-11-14T12:54:35Z

So, to summarize this: I now believe it to be better to completely disallow having no or the same scores. I suggest that the coco metrics should rather assert that the scores are all different when computing or otherwise raise an exception.
What do you think? What does you colleague think about this?

In general, this can be done and it is a pretty logical option, because colleagues from torchmetric encountered such a problem during tests (they submitted the same score everywhere) and received different results.

Perhaps it is worth informing the user about such a problem if it occurs.

But what do we do with the fact that your model does not produce a score? How will you test it? Do you want me to help integrate these test metrics into my library?

MiXaiLL76 mentioned this issue Oct 31, 2024

add equalScore param #47

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mAP is wrong if all scores are equal (=not providing a score) #46

mAP is wrong if all scores are equal (=not providing a score) #46

voegtlel commented Oct 31, 2024 •

edited

Loading

MiXaiLL76 commented Oct 31, 2024 •

edited

Loading

MiXaiLL76 commented Oct 31, 2024

MiXaiLL76 commented Oct 31, 2024

MiXaiLL76 commented Nov 2, 2024

MiXaiLL76 commented Nov 6, 2024

voegtlel commented Nov 7, 2024 •

edited

Loading

MiXaiLL76 commented Nov 7, 2024 •

edited

Loading

voegtlel commented Nov 8, 2024 •

edited

Loading

MiXaiLL76 commented Nov 14, 2024

voegtlel commented Nov 14, 2024 •

edited

Loading

MiXaiLL76 commented Nov 14, 2024

mAP is wrong if all scores are equal (=not providing a score) #46

mAP is wrong if all scores are equal (=not providing a score) #46

Comments

voegtlel commented Oct 31, 2024 • edited Loading

MiXaiLL76 commented Oct 31, 2024 • edited Loading

MiXaiLL76 commented Oct 31, 2024

MiXaiLL76 commented Oct 31, 2024

MiXaiLL76 commented Nov 2, 2024

MiXaiLL76 commented Nov 6, 2024

voegtlel commented Nov 7, 2024 • edited Loading

MiXaiLL76 commented Nov 7, 2024 • edited Loading

voegtlel commented Nov 8, 2024 • edited Loading

MiXaiLL76 commented Nov 14, 2024

voegtlel commented Nov 14, 2024 • edited Loading

MiXaiLL76 commented Nov 14, 2024

voegtlel commented Oct 31, 2024 •

edited

Loading

MiXaiLL76 commented Oct 31, 2024 •

edited

Loading

voegtlel commented Nov 7, 2024 •

edited

Loading

MiXaiLL76 commented Nov 7, 2024 •

edited

Loading

voegtlel commented Nov 8, 2024 •

edited

Loading

voegtlel commented Nov 14, 2024 •

edited

Loading