Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

518 fix AP Scorer #519

Merged
merged 9 commits into from
Jul 18, 2022
Merged

518 fix AP Scorer #519

merged 9 commits into from
Jul 18, 2022

Conversation

david-fisher
Copy link
Contributor

@david-fisher david-fisher commented May 25, 2022

Description

Fix computation of AP to match trec_eval.
Update the ScorerFactory default scorer to use the new implementation.

Motivation and Context

AP is the sum of the precisions at each rank where a relevant document occurs, divided by the total number of relevant documents (any not in the ranked list are treated as occurring at rank infinity with a precision of 0).

closes #518

How Has This Been Tested?

Manually verified AP computation in Quepid compared to the output of trec_eval.

Screenshots or GIFs (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • [] Improvement (non-breaking change which improves existing functionality)
  • [] New feature (non-breaking change which adds new functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • [] My change requires a change to the documentation.
  • [] I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • [] I have added tests to cover my changes.
  • All new and existing tests passed.

@david-fisher david-fisher added the bug Something isn't working label May 25, 2022
@david-fisher david-fisher requested a review from epugh May 25, 2022 12:15
@epugh epugh requested review from nathancday and worleydl May 25, 2022 15:36
@epugh
Copy link
Member

epugh commented May 25, 2022

@worleydl @nathancday would love your input on this as you guys have poked a bit at AP as well....

@epugh
Copy link
Member

epugh commented May 25, 2022

Do we still need/want #516 if we have this???

}, k);

var score = total / k;
// count up the total number of relevant (not judged) documents
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "(not judged)" mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set of judged documents include those that have been judged irrelevant. The divisor is |R|, where R is the set of all documents d such that judgment(d) > 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is confusing for me, how is a doc known to be relevant if it is not judged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A doc in the set of all judged documents may have a score of 0, which is not relevant. This is strictly an issue with the data structure in Quepid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something else that we need to fix then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Counting how many relevant documents there are is only necessary when computing AP or recall. It is not expensive. Having the documents with judgments of 0 in the set of all judged documents is just fine (and probably used in other parts of the UI).

@nathancday
Copy link
Member

nathancday commented May 25, 2022

I like this PR for the trec_eval AP, thanks David, Ieft some code nits. I think TREC version should be the default AP

I'd like to keep the legacy implementation though for side by side, we can name it something else, maybe ap_rolling. IMO the TREC AP version is more useful when you can assume all of the relevant docs labeled, which TREC does via pooling, but I don't think we can assume always in our minimal test collections.

'eachDoc(function(doc, i) {',
'if (hasDocRating(i) && (docRating(i)) > 0) {',
'count++;',
'total += count/(i+1)',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use total += avgRating(i+1)? and not need the count var?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. avgRating, which uses baseAvg, sums up the judgment values. AP computed on multivalued relevance judgments converts those judgments to binary values. Using avgRating would not compute AP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point I was assuming it would always be a binary scale, which would be the same as P@k

@david-fisher
Copy link
Contributor Author

I like this PR for the trec_eval AP, thanks David, Ieft some code nits. I think TREC version should be the default AP

I'd like to keep the legacy implementation though for side by side, we can name it something else, maybe ap_rolling. IMO the TREC AP version is more useful when you can assume all of the relevant docs labeled, which TREC does via pooling, but I don't think we can assume always in our minimal test collections.

I'm not sure I understand what you mean by AP rolling. What is that metric supposed to be modeling? The evaluation metric community have produced a variety of metrics to model different use cases, such as reciprocal rank to evaluate known item (one shot) search, the various gain metrics for information gathering searches. AP's intent is to capture the performance at 100% recall, and is used to enable interpolating precision values at fixed recall percentages, allowing us to predict, at 50% recall, what the precision of the system will be.

The math for all of the evaluation metrics assumes that R, the set of relevant documents, is complete. All unjudged documents are included in the set of irrelevant documents, with a judgment value of 0.

Pooling to gather relevance judgments is a mechanism for maximizing utility of the judges. It does not have anything to do with the expected size of the set R. When used, it does make it more likely that the systems being evaluated, for the specific TREC track submission set that is being judged, will have good representation across the retrieved documents for all systems.

@nathancday
Copy link
Member

My goal with the rolling-AP idea was to have an inherent positional discount for binary rating scales. I wrote this code before I learned about trec-eval and incorrectly assumed average precision was the average of precision values leading up to @k sum(p@1, ..., p@k) / k, not just the precise ones. I think there is similar utility in the output values.

I understood the goal of pooling to be about capturing all relevant documents, by combining diverse rankers with the assumption that anything not capture by any ranker is not relevant. I'm skeptical about getting this exhaustive level of judgments on minimal collections for clients. I'm also skeptical about assuming unjudged docs as irrelevant in these minimal collection, there are various methods published in that area. But I agree most IR metrics assume exhaustive recall.

Thanks for bringing Quepid to line up with TREC for AP. It looks great, looking forward to learning with you.

@nathancday nathancday self-requested a review May 26, 2022 13:22
@epugh epugh temporarily deployed to quepid-br-518-fix-ap June 3, 2022 22:19 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AP@10 computation is incorrect
3 participants