Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Support for String and IP and boolean in the TOP aggregate function #110346

Open
2 of 3 tasks
nik9000 opened this issue Jul 1, 2024 · 2 comments
Open
2 of 3 tasks
Assignees
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@nik9000
Copy link
Member

nik9000 commented Jul 1, 2024

The ESQL aggregate function TOP supports numerics and dates. We'd love to get support for ip, boolean, and keyword/text fields. These are all unique and different.

  • ip fields are a fixed length string type. We could use exactly the same technique to support ip.
  • boolean fields are single bit fields and their sorting is funky. In this case I think having a small a small counter for the number of falses and the number of trues you've seen is probably fine. That's kind of how VALUES works for booleans - though it deduplicates so instead of a counter it's just two boolean values.
  • text/keyword fields have a variable length. They are super useful to support, but they are probably the most difficult. Maybe we'd store "ordinal" values in the min heap and we'd use a side car structure to assign ordinals. Or something.
@nik9000 nik9000 added >enhancement needs:triage Requires assignment of a team area label :Analytics/ES|QL AKA ESQL labels Jul 1, 2024
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed needs:triage Requires assignment of a team area label labels Jul 1, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@nik9000
Copy link
Member Author

nik9000 commented Jul 1, 2024

An example of how keyword would be useful is,

FROM logs
| STATS TOP(hostname, 10, "desc") BY service

This gets you a list of which hosts are hosting what service. Except if there are more than 10 hosts with the service, then it just gets you the first 10. Which is probably good for a UI that'll display a bunch of services. It can just say host1, host2, host3, ..... Then if you need all the hosts for that service it can run a second query.

In this case we really don't care too much about the sorting aspect of TOP, if there were an aggregate function that just gave us the 10 values that it ran into the fastest, that'd be even better. But we don't have that. So we'd use TOP.

@ivancea ivancea self-assigned this Jul 4, 2024
elasticsearchmachine pushed a commit that referenced this issue Jul 10, 2024
- Added support for Booleans on Max and Min
- Added some helper methods to BitArray (`set(index, value)` and `fill(from, to, value)`). This way, the container is more similar to other BigArrays, and it's easier to work with

Part of #110346, as Max
and Min are dependencies of Top.
ivancea added a commit that referenced this issue Aug 20, 2024
Support Version, Keyword and Text in Max an Min aggregations.

The current implementation of both max and min does:

For non-grouping:
- Store a BytesRef
- When there's a max/min, copy it to the internal array. Grow it if needed

For grouping:
- Keep an array of BytesRef (null by default: there's no "initial/default value" here, as there's no "MAX" value for a string)
- Each BytesRef stores their own array, which will be grown as needed to copy the new max/min

Some notes:
- It's not shrinking the arrays, as to avoid having to copy, and potentially grow it again
- It's using raw arrays. But maybe it should use BigArrays to compute in the circuit breaker?

Part of #110346
lkts pushed a commit to lkts/elasticsearch that referenced this issue Aug 20, 2024
Support Version, Keyword and Text in Max an Min aggregations.

The current implementation of both max and min does:

For non-grouping:
- Store a BytesRef
- When there's a max/min, copy it to the internal array. Grow it if needed

For grouping:
- Keep an array of BytesRef (null by default: there's no "initial/default value" here, as there's no "MAX" value for a string)
- Each BytesRef stores their own array, which will be grown as needed to copy the new max/min

Some notes:
- It's not shrinking the arrays, as to avoid having to copy, and potentially grow it again
- It's using raw arrays. But maybe it should use BigArrays to compute in the circuit breaker?

Part of elastic#110346
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Sep 4, 2024
Support Version, Keyword and Text in Max an Min aggregations.

The current implementation of both max and min does:

For non-grouping:
- Store a BytesRef
- When there's a max/min, copy it to the internal array. Grow it if needed

For grouping:
- Keep an array of BytesRef (null by default: there's no "initial/default value" here, as there's no "MAX" value for a string)
- Each BytesRef stores their own array, which will be grown as needed to copy the new max/min

Some notes:
- It's not shrinking the arrays, as to avoid having to copy, and potentially grow it again
- It's using raw arrays. But maybe it should use BigArrays to compute in the circuit breaker?

Part of elastic#110346
davidkyle pushed a commit to davidkyle/elasticsearch that referenced this issue Sep 5, 2024
Support Version, Keyword and Text in Max an Min aggregations.

The current implementation of both max and min does:

For non-grouping:
- Store a BytesRef
- When there's a max/min, copy it to the internal array. Grow it if needed

For grouping:
- Keep an array of BytesRef (null by default: there's no "initial/default value" here, as there's no "MAX" value for a string)
- Each BytesRef stores their own array, which will be grown as needed to copy the new max/min

Some notes:
- It's not shrinking the arrays, as to avoid having to copy, and potentially grow it again
- It's using raw arrays. But maybe it should use BigArrays to compute in the circuit breaker?

Part of elastic#110346
@ivancea ivancea assigned nik9000 and unassigned ivancea Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

3 participants