-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
518 fix AP Scorer #519
518 fix AP Scorer #519
Conversation
@worleydl @nathancday would love your input on this as you guys have poked a bit at AP as well.... |
Do we still need/want #516 if we have this??? |
}, k); | ||
|
||
var score = total / k; | ||
// count up the total number of relevant (not judged) documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "(not judged)" mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The set of judged documents include those that have been judged irrelevant. The divisor is |R|, where R is the set of all documents d such that judgment(d) > 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is confusing for me, how is a doc known to be relevant if it is not judged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A doc in the set of all judged documents may have a score of 0, which is not relevant. This is strictly an issue with the data structure in Quepid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this something else that we need to fix then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Counting how many relevant documents there are is only necessary when computing AP or recall. It is not expensive. Having the documents with judgments of 0 in the set of all judged documents is just fine (and probably used in other parts of the UI).
I like this PR for the trec_eval AP, thanks David, Ieft some code nits. I think TREC version should be the default AP I'd like to keep the legacy implementation though for side by side, we can name it something else, maybe |
'eachDoc(function(doc, i) {', | ||
'if (hasDocRating(i) && (docRating(i)) > 0) {', | ||
'count++;', | ||
'total += count/(i+1)', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use total += avgRating(i+1)
? and not need the count
var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. avgRating, which uses baseAvg, sums up the judgment values. AP computed on multivalued relevance judgments converts those judgments to binary values. Using avgRating would not compute AP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point I was assuming it would always be a binary scale, which would be the same as P@k
I'm not sure I understand what you mean by AP rolling. What is that metric supposed to be modeling? The evaluation metric community have produced a variety of metrics to model different use cases, such as reciprocal rank to evaluate known item (one shot) search, the various gain metrics for information gathering searches. AP's intent is to capture the performance at 100% recall, and is used to enable interpolating precision values at fixed recall percentages, allowing us to predict, at 50% recall, what the precision of the system will be. The math for all of the evaluation metrics assumes that R, the set of relevant documents, is complete. All unjudged documents are included in the set of irrelevant documents, with a judgment value of 0. Pooling to gather relevance judgments is a mechanism for maximizing utility of the judges. It does not have anything to do with the expected size of the set R. When used, it does make it more likely that the systems being evaluated, for the specific TREC track submission set that is being judged, will have good representation across the retrieved documents for all systems. |
My goal with the rolling-AP idea was to have an inherent positional discount for binary rating scales. I wrote this code before I learned about trec-eval and incorrectly assumed average precision was the average of precision values leading up to @k I understood the goal of pooling to be about capturing all relevant documents, by combining diverse rankers with the assumption that anything not capture by any ranker is not relevant. I'm skeptical about getting this exhaustive level of judgments on minimal collections for clients. I'm also skeptical about assuming unjudged docs as irrelevant in these minimal collection, there are various methods published in that area. But I agree most IR metrics assume exhaustive recall. Thanks for bringing Quepid to line up with TREC for AP. It looks great, looking forward to learning with you. |
Description
Fix computation of AP to match trec_eval.
Update the ScorerFactory default scorer to use the new implementation.
Motivation and Context
AP is the sum of the precisions at each rank where a relevant document occurs, divided by the total number of relevant documents (any not in the ranked list are treated as occurring at rank infinity with a precision of 0).
closes #518
How Has This Been Tested?
Manually verified AP computation in Quepid compared to the output of trec_eval.
Screenshots or GIFs (if appropriate):
Types of changes
Checklist: