Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea: Optimize Bytes to String conversion in the Value type #1021

Open
bruceg opened this issue Sep 6, 2024 · 0 comments
Open

idea: Optimize Bytes to String conversion in the Value type #1021

bruceg opened this issue Sep 6, 2024 · 0 comments
Labels
type: enhancement A value-adding code change that enhances its existing functionality vrl: typedef Changes to the type system

Comments

@bruceg
Copy link
Member

bruceg commented Sep 6, 2024

The VRL type enum Value uses bytes::Bytes as its store of string values, since the originating data may not in fact be valid UTF-8 strings. However, we frequently want to use these bytes as a string, which leads to repeated calls to String::from_utf8_lossy on these bytes value. It may be preferable to avoid repeating this conversion by caching the result in some form in a new Bytes wrapper type.

While discussing this idea, we came up with several possible ways of doing this:

  1. Add an additional Value::String variant for when we can prove the source data is a valid String. This, unfortunately, breaks all kinds of assumptions about the Bytes variant being the only place string data is stored, and so is most likely to lead to bugs.
  2. Make a new Bytes enum and parse the incoming bytes when it is created to determine if it is valid UTF-8 or not. Conversion from str values would, of course, not need parsing and could be stored directly. Replace calls to String::from_utf8_lossy(&bytes) with new member functions on that type that intelligently return without re-parsing the data.
  3. Make a new Bytes struct containing a OnceLock or ArcSwap sidecar to store the cached value of the conversion that will be initialized on demand instead of at creation time (except for when the source is statically known to be a valid string, of course). Replace calls as above. Note that we can't use a straight Option to cache the conversion as that would require any function using Bytes to borrow the value mutably which will require far too many code changes and may even be impossible where multiple immutable borrows are required.

Since most values that will be stored in the Bytes variant will be valid UTF-8 strings, there was some debate over whether storing the resulting conversion in the case of invalid bytes would be worth the overhead of the additional data size. The Bytes is already the largest variant in the Value type and any addition to its size will grow memory requirements, and so must be considered carefully.

It would be good to spend some time profiling how often this conversion happens and benchmarking to ensure the benefits are not outweighed by the additional memory overheads. However, the string conversions are scattered throughout the code bases, making it very difficult to identify any of them as a hot spot. One thought for profiling would be to set up counters for how often Bytes are created and converted and then log these values before exit.

Work on this idea has started in the bruceg/vrl-bytes-idea branch. Expect this to be rebased frequently if any further progress is made.

@bruceg bruceg added type: enhancement A value-adding code change that enhances its existing functionality vrl: typedef Changes to the type system labels Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement A value-adding code change that enhances its existing functionality vrl: typedef Changes to the type system
Projects
None yet
Development

No branches or pull requests

1 participant