A repository to understand ranking metrics as described by Musgrave et al. (2020).
Resources used:
Other ranking metrics are described in Assessing ranking metrics in top-N recommendation, by Valcarce et al. (2020). These are largely not used but give a good introduction into established metrics for ranking metrics. In this repository we only use recall@k, since it is useful for combinations with confidence values, like ERC (Error vs. Reject Curve). Professional researchers additionally use MAP@R in combination with ERC.
The following examples replicate the toy example of Musgrave et al. in A Metric Learning Reality Check. Plots are generated by running the tests in test_reality_check.py. Examples show how MAP@R is rewarding well clusterd embedding spaces.
The code for calculating the metrics can be found in embed_metrics.py and thanks to faiss it's not as long. Faiss takes care of finding the nearest neighbors for a query.
As mentioned above recall@1 and MAP@R can be used to see the effekt of model confidences or uncertainties. The assumption stated is: If a model can properly predict confidence values on ambigious inputs excluding low confidence values will increase the metric. This is can be proven by using Error vs. rejct curve (ERC).
The following example shows 3 different metrics, all using confidences as indicator for the embedding spaces cluster quality.
The above plots opacity is corresponding to the simulated confidences. A few errors where injected, for which we can controll the confidence and see how scores behave when we change confidences for these erronenous samples.
As we increase the models confidence on the x axis the scores drop and errors increase, because the model provides increasing confidence on erronenous samples. The rest of confidences is keept the same. For more detail on the illustrations see tests in test_uncertainty.py.
Please note that Confidence vs Recall@1 only works with confidences in probabilistic ranges $ c \in [0, 1] $. ERCs will still work, since they just sort the by confidence (regardless of range).
In probabilistic embeddings
The score is the highest at the point
When models estimate posterior distributions in the embedding space, credible intervals can be used
to show how good models are at retrieving similar data (same class or same instance) using an interval
over the confidence parameter