Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExpressionFunctionValues should cache per-hit value [LUCENE-8574] #9620

Closed
asfimport opened this issue Nov 25, 2018 · 21 comments
Closed

ExpressionFunctionValues should cache per-hit value [LUCENE-8574] #9620

asfimport opened this issue Nov 25, 2018 · 21 comments

Comments

@asfimport
Copy link

asfimport commented Nov 25, 2018

The original version of ExpressionFunctionValues had a simple per-hit cache, so that nested expressions that reference the same common variable would compute the value for that variable the first time it was referenced and then use that cached value for all subsequent invocations, within one hit.  I think it was accidentally removed in #8660?

This is quite important if you have non-trivial expressions that reference the same variable multiple times.

E.g. if I have these expressions:

x = c + d
c = b + 2 
d = b * 2

Then evaluating x should only cause b's value to be computed once (for a given hit), but today it's computed twice.  The problem is combinatoric if b then references another variable multiple times, etc.

I think to fix this we just need to restore the per-hit cache?


Migrated from LUCENE-8574 by Michael McCandless (@mikemccand), resolved Jul 09 2020
Attachments: LUCENE-8574.patch, unit_test.patch
Pull requests: apache/lucene-solr#1560, apache/lucene-solr#1613

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

+1, I remember jack added this explicitly to prevent the situation where the same subexpr gets called multiple times for the same document and it wastefully recomputes the whole thing more than once.

@asfimport
Copy link
Author

Alan Woodward (@romseygeek) (migrated from JIRA)

There's a top-level cache in ExpressionValueSource but I think I missed the per-hit cache.  It should be easy enough to add a caching DoubleValues implementation and wrap each variable.

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

I don't think we need any caching DoubleValues implementation. We can just simply put the code back.

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

Hopefully this is enough? If we have to handle the case where advanceExact() is called over and over again we have to add 'currentDoc' back too so it works more like jack's original cache. But we should avoid adding abstractions here that will make things harder on the compiler.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

+1, patch looks good.

@asfimport
Copy link
Author

asfimport commented Jun 12, 2020

ASF subversion and git services (migrated from JIRA)

Commit 2991acf in lucene-solr's branch refs/heads/master from Patrick Zhai
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2991acf

#10431: Upgrade HPPC to 0.8.2 (#1560)

@asfimport
Copy link
Author

asfimport commented Jun 12, 2020

Patrick Zhai (@zhaih) (migrated from JIRA)

Sry, wrong commit message pointed to here, the correct issue should be #10431.

BTW, is this patch ever merged?

@asfimport
Copy link
Author

Patrick Zhai (@zhaih) (migrated from JIRA)

I've checked the current release and couldn't see this patch merged. And I think there's no other changes introducing similar functionality (not so sure). Probably we should merge this?

@asfimport
Copy link
Author

Patrick Zhai (@zhaih) (migrated from JIRA)

I've attached a unit test showing a case that current code could not handle. And it seems the patch attached to this issue could not handle it as well (since DoubleValues generated for the same LeafReaderContext is not the same, we still get tons of DoubleValues created).

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

please, lets keep the boolean and not bring NaN into this.

@asfimport
Copy link
Author

Patrick Zhai (@zhaih) (migrated from JIRA)

Ah, yes will use boolean instead of NaN. I'm just verifying whether the patch works so quickly inserted few lines of code without much thinking.

But how should we fix this issue correctly? Since the easy fix patch seems not solving the problem.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

please, lets keep the boolean and not bring NaN into this.

+1

And it seems the patch attached to this issue could not handle it as well (since DoubleValues generated for the same LeafReaderContext is not the same, we still get tons of DoubleValues created).

Hmm, good catch!  So we somehow need to ensure that we use the same DoubleValues instance per-segment per-binding?  But how can we safely do that, i.e. we can't know that this current caller will consume the same DoubleValues in the same docid progression?

@asfimport
Copy link
Author

Patrick Zhai (@zhaih) (migrated from JIRA)

Made a PR(apache/lucene-solr#1613) about this issue

Basically making a new DoubleValuesSource class which pass in valueCache to a custom getValues to enforce only one DoubleValue per name along the whole generation process.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

@zhaih does this fix your combinatoric adversarial test case (testFibonacciExpr)?

I.e. before the fix, that test runs for way too long (days), but after, it runs quickly?

I left some small comments on the PR.  Thanks for tackling this!

@asfimport
Copy link
Author

Patrick Zhai (@zhaih) (migrated from JIRA)

@mikemccand Yes it fixes the test case. Before the fix the test will cause OutOfMemoryException and after the fix it finishes in a reasonable time (like < 1s when I run it solely)

Thanks for reviewing that, I've addressed those comments!

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks @zhaih I like that test case!  Fibonacci should be simple computation :)

I think the PR looks great.  Since the pre-existing ArrayList issue is now fixed we just need to rebase and let the tests run again.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks @zhaih, the PR looks great, and now passes precommit.  I'll commit soon!

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 60e0d8a in lucene-solr's branch refs/heads/master from Michael McCandless
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=60e0d8a

LUCENE-8574: the DoubleValues for dependent bindings for an expression are now cached and reused and no longer inefficiently recomputed per hit

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 1791d7e in lucene-solr's branch refs/heads/branch_8x from Michael McCandless
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1791d7e

LUCENE-8574: the DoubleValues for dependent bindings for an expression are now cached and reused and no longer inefficiently recomputed per hit

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks @zhaih!

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment