You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to ask a question regarding the F1 evaluation metric used in your paper (similar to #3). The paper mentions that the "average of the maximum F1 from each n−1 subset" is used to calculate the F1 metric. I am slightly unsure as to how this works, but think it could mean the following:
For each classification output, compare the predicted label against the labels from the annotators. Compute the maximum F1 per sample (which should be the same as accuracy), as shown in the example below:
Sample
Predicted Label
Ann1
Ann2
Ann3
Maximum F1
1
Relevant
Irrelevant
None
Irrelevant
0
2
Relevant
Relevant
Relevant
Relevant
1
3
Irrelevant
None
Irrelevant
Relevant
1
Take the average of all maximum F1 scores: (0 + 1 + 1)/3 = 2/3 =~ 0.67
Is my understanding of the evaluation metric correct?
Thank you for your time.
The text was updated successfully, but these errors were encountered:
atreyasha
changed the title
Questions regarding F1 evaluation metric
Question regarding F1 evaluation metric
Mar 18, 2022
Hi @AbhilashaRavichander,
I would like to ask a question regarding the F1 evaluation metric used in your paper (similar to #3). The paper mentions that the "average of the maximum F1 from each n−1 subset" is used to calculate the F1 metric. I am slightly unsure as to how this works, but think it could mean the following:
For each classification output, compare the predicted label against the labels from the annotators. Compute the maximum F1 per sample (which should be the same as accuracy), as shown in the example below:
Take the average of all maximum F1 scores: (0 + 1 + 1)/3 = 2/3 =~ 0.67
Is my understanding of the evaluation metric correct?
Thank you for your time.
The text was updated successfully, but these errors were encountered: