518 fix AP Scorer #519

david-fisher · 2022-05-25T12:15:07Z

Description

Fix computation of AP to match trec_eval.
Update the ScorerFactory default scorer to use the new implementation.

Motivation and Context

AP is the sum of the precisions at each rank where a relevant document occurs, divided by the total number of relevant documents (any not in the ranked list are treated as occurring at rank infinity with a precision of 0).

closes #518

How Has This Been Tested?

Manually verified AP computation in Quepid compared to the output of trec_eval.

Screenshots or GIFs (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
[] Improvement (non-breaking change which improves existing functionality)
[] New feature (non-breaking change which adds new functionality)
[] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
[] My change requires a change to the documentation.
[] I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
[] I have added tests to cover my changes.
All new and existing tests passed.

epugh · 2022-05-25T15:37:03Z

@worleydl @nathancday would love your input on this as you guys have poked a bit at AP as well....

epugh · 2022-05-25T15:37:31Z

Do we still need/want #516 if we have this???

nathancday · 2022-05-25T20:04:20Z

db/scorers/[email protected]

 }, k);
-
-var score = total / k;
+// count up the total number of relevant (not judged) documents


what does "(not judged)" mean here?

The set of judged documents include those that have been judged irrelevant. The divisor is |R|, where R is the set of all documents d such that judgment(d) > 0.

I think the comment is confusing for me, how is a doc known to be relevant if it is not judged?

A doc in the set of all judged documents may have a score of 0, which is not relevant. This is strictly an issue with the data structure in Quepid.

is this something else that we need to fix then?

No. Counting how many relevant documents there are is only necessary when computing AP or recall. It is not expensive. Having the documents with judgments of 0 in the set of all judged documents is just fine (and probably used in other parts of the UI).

nathancday · 2022-05-25T20:27:30Z

I like this PR for the trec_eval AP, thanks David, Ieft some code nits. I think TREC version should be the default AP

I'd like to keep the legacy implementation though for side by side, we can name it something else, maybe ap_rolling. IMO the TREC AP version is more useful when you can assume all of the relevant docs labeled, which TREC does via pooling, but I don't think we can assume always in our minimal test collections.

nathancday · 2022-05-25T20:31:19Z

app/assets/javascripts/factories/ScorerFactory.js

+	  'eachDoc(function(doc, i) {',
+	  'if (hasDocRating(i) && (docRating(i)) > 0) {',
+          'count++;',
+	  'total += count/(i+1)',


can you use total += avgRating(i+1)? and not need the count var?

No. avgRating, which uses baseAvg, sums up the judgment values. AP computed on multivalued relevance judgments converts those judgments to binary values. Using avgRating would not compute AP.

good point I was assuming it would always be a binary scale, which would be the same as P@k

david-fisher · 2022-05-26T12:16:08Z

I like this PR for the trec_eval AP, thanks David, Ieft some code nits. I think TREC version should be the default AP

I'd like to keep the legacy implementation though for side by side, we can name it something else, maybe ap_rolling. IMO the TREC AP version is more useful when you can assume all of the relevant docs labeled, which TREC does via pooling, but I don't think we can assume always in our minimal test collections.

I'm not sure I understand what you mean by AP rolling. What is that metric supposed to be modeling? The evaluation metric community have produced a variety of metrics to model different use cases, such as reciprocal rank to evaluate known item (one shot) search, the various gain metrics for information gathering searches. AP's intent is to capture the performance at 100% recall, and is used to enable interpolating precision values at fixed recall percentages, allowing us to predict, at 50% recall, what the precision of the system will be.

The math for all of the evaluation metrics assumes that R, the set of relevant documents, is complete. All unjudged documents are included in the set of irrelevant documents, with a judgment value of 0.

Pooling to gather relevance judgments is a mechanism for maximizing utility of the judges. It does not have anything to do with the expected size of the set R. When used, it does make it more likely that the systems being evaluated, for the specific TREC track submission set that is being judged, will have good representation across the retrieved documents for all systems.

nathancday · 2022-05-26T13:21:59Z

My goal with the rolling-AP idea was to have an inherent positional discount for binary rating scales. I wrote this code before I learned about trec-eval and incorrectly assumed average precision was the average of precision values leading up to @k sum(p@1, ..., p@k) / k, not just the precise ones. I think there is similar utility in the output values.

I understood the goal of pooling to be about capturing all relevant documents, by combining diverse rankers with the assumption that anything not capture by any ranker is not relevant. I'm skeptical about getting this exhaustive level of judgments on minimal collections for clients. I'm also skeptical about assuming unjudged docs as irrelevant in these minimal collection, there are various methods published in that area. But I agree most IR metrics assume exhaustive recall.

Thanks for bringing Quepid to line up with TREC for AP. It looks great, looking forward to learning with you.

david-fisher added 2 commits May 25, 2022 08:02

update to compute AP as trec_eval does.

117ff02

update example to use the AP implementation.

4e16197

david-fisher added the bug Something isn't working label May 25, 2022

david-fisher requested a review from epugh May 25, 2022 12:15

david-fisher added 5 commits May 25, 2022 09:10

update to address javascript test issues.

b8d20a3

update to address javascript test issues.

9c02b53

update to address javascript test issues.

d77868d

update to address javascript test issues.

9a868d7

update to address javascript test issues.

11cb2e6

epugh requested review from nathancday and worleydl May 25, 2022 15:36

update to use the other iterator over judged documents.

da59ed3

nathancday reviewed May 25, 2022

View reviewed changes

nathancday self-requested a review May 26, 2022 13:22

nathancday approved these changes May 26, 2022

View reviewed changes

epugh temporarily deployed to quepid-br-518-fix-ap June 3, 2022 22:19 Inactive

Merge branch 'master' into 518-fix-AP

60aed06

epugh approved these changes Jul 18, 2022

View reviewed changes

epugh merged commit 1afbb75 into master Jul 18, 2022

david-fisher mentioned this pull request Jan 16, 2023

BUG 518 AP@10 change was not actually merged #604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

518 fix AP Scorer #519

518 fix AP Scorer #519

david-fisher commented May 25, 2022 •

edited

Loading

epugh commented May 25, 2022

epugh commented May 25, 2022

nathancday May 25, 2022

david-fisher May 26, 2022

nathancday May 26, 2022

david-fisher May 26, 2022

epugh May 26, 2022

david-fisher May 26, 2022

nathancday commented May 25, 2022 •

edited

Loading

nathancday May 25, 2022

david-fisher May 26, 2022

nathancday May 26, 2022

david-fisher commented May 26, 2022

nathancday commented May 26, 2022

518 fix AP Scorer #519

518 fix AP Scorer #519

Conversation

david-fisher commented May 25, 2022 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Screenshots or GIFs (if appropriate):

Types of changes

Checklist:

epugh commented May 25, 2022

epugh commented May 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nathancday commented May 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-fisher commented May 26, 2022

nathancday commented May 26, 2022

david-fisher commented May 25, 2022 •

edited

Loading

nathancday commented May 25, 2022 •

edited

Loading