Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reserve memory for the identifiers hashtable. #4107

Merged
merged 3 commits into from
Jul 18, 2024

Conversation

chandlerc
Copy link
Contributor

@chandlerc chandlerc commented Jul 3, 2024

This uses a heuristic reserve to greatly reduce hashtable growth of the
identifiers hashtable. The design of the hashtable itself is optimized
around compact memory use and is especially slow to grow and so this has
an outsized impact.

The heuristic was computed using scripts/source_stats.py and looking
at C++ codebases. We may want to periodically re-evaluate it as Carbon
code emerges and we have better data on its distributions of tokens.

This also required fixing the Reserve method on CanonicalValueStore
that wasn't actually used anywhere and so didn't even compile correctly.
I added it to the relevant unit test so it is at least compiled locally
to its definition.

This uses a heuristic reserve to greatly reduce hashtable growth of the
identifiers hashtable. The design of the hashtable itself is optimized
around compact memory use and is especially slow to grow and so this has
an outsized impact.

The heuristic was computed using `scripts/source_stats.py` and looking
at C++ codebases. We may want to periodically re-evaluate it as Carbon
code emerges and we have better data on its distributions of tokens.

This also required fixing the `Reserve` method on `CanonicalValueStore`
that wasn't actually used anywhere and so didn't even compile correctly.
I added it to the relevant unit test so it is at least compiled locally
to its definition.
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, although I wonder if it's worth some time to shorten the comment.

Comment on lines 634 to 671
// Once we have scanned the source text to build up our line data structures,
// and before we begin lexing, use the data about the line count in the source
// to make rough estimated reservations of memory in the hot data structures
// used by the lexer. In practice, scanning for lines is one of the easiest
// parts of the lexer to accelerate, and we can use its results to minimize
// the cost of incrementally growing data structures during the hot path of
// the lexer.
//
// Hashtables are an especially useful data structure to reserve as they are
// particularly expensive to *grow*. However, we don't want to waste memory on
// small input files by over reserving. The far and away hottest hash table is
// for interning identifiers. Fortunately, when analyzing several C++
// codebases we have found a roughly normal distribution of unique identifiers
// in the file centered at 0.5 * lines, and in the vast majority of cases
// bounded below 1.0 * lines. For example, here is LLVM's distribution
// computed with `scripts/source_stats.py` and rendered in an ASCII-art
// histogram:
//
// # Unique IDs per 10 lines ## (median: 5)
// 1 ids [ 29] ▍
// 2 ids [ 282] ███▊
// 3 ids [1492] ███████████████████▉
// 4 ids [2674] ███████████████████████████████████▌
// 5 ids [3011] ████████████████████████████████████████
// 6 ids [2267] ██████████████████████████████▏
// 7 ids [1549] ████████████████████▋
// 8 ids [ 817] ██████████▉
// 9 ids [ 301] ████
// 10 ids [ 98] █▎
//
// (Trimmed to only cover 1 - 10 unique IDs per 10 lines of code.)
//
// Based on this distribution, we reserve space for as many identifiers as
// lines. This will typically reserve *more* than enough space due to the load
// factor and requirement to have a power-of-two size, but we expect this
// hashtable to be worth over-shooting and thus having a low load factor and
// lower latency for access even at the cost of memory.
buffer_.value_stores_->identifiers().Reserve(buffer_.line_infos_.size());
Copy link
Contributor

@jonmeow jonmeow Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the comment here is detailed in order to provide as much information as possible in order to discourage changing this, but where are you viewing the balance when it comes to reading the function?

For me, these really long comments can make the surrounding code more difficult to read; instead of being able to skim the logic with comments filling in details I'm missing, I think it's now mainly this one comment with some other code on the side, and it takes more effort to see later code as not related to this comment.

Perhaps some details could be moved to the PR description / commit message, rather than including the full design in the code itself? As one possible rewording for you to consider:

Use the number of lines to reserve identifier space, because the line number count is easy to accelerate and offers a good heuristic. Reserving the hashtable now avoids expensive re-hashing on resize later. The cost of over-shooting memory is worth the benefits of low latency and load factors.

One unique identifier per line should be roughly double the median, and cover around 90% of files. For the analysis, see http://github.com/carbon-language/pull/4107.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not about discouraging change, but helping future changes understand how we got here and what data.

In this specific case, I think the result also seems hard to believe, and so I wanted to provide more detail about what was going on here. I'm also hesitant to reference PR descriptions or other things that seem less durable than comments in the source code to carry this kind of information.

That said, the disruption to the function body makes lots of sense, and we don't need to inline this. What do you think about moving the specific justification to a comment on a function to compute the reservation size, and then here we can just explain the idea of reserving before we begin lexing?

// 9 ids [ 301] ████
// 10 ids [ 98] █▎
//
// (Trimmed to only cover 1 - 10 unique IDs per 10 lines of code.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth having a ">10" entry in the histogram instead, to show how long the long tail is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script doesn't build the histogram for me in a way that's easy to do that... I can mention the number of things in the tail though?

Comment on lines 645 to 646
// for interning identifiers. Fortunately, when analyzing several C++
// codebases we have found a roughly normal distribution of unique identifiers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For background, can these codebases be mentioned somewhere, such as in the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't provide a lot of detail, but I'll add what I can (basically, internal codebases I have access to at Google).

Copy link
Contributor Author

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL, hopefully addressed comment length and also now rebased and ready to land.

Comment on lines 634 to 671
// Once we have scanned the source text to build up our line data structures,
// and before we begin lexing, use the data about the line count in the source
// to make rough estimated reservations of memory in the hot data structures
// used by the lexer. In practice, scanning for lines is one of the easiest
// parts of the lexer to accelerate, and we can use its results to minimize
// the cost of incrementally growing data structures during the hot path of
// the lexer.
//
// Hashtables are an especially useful data structure to reserve as they are
// particularly expensive to *grow*. However, we don't want to waste memory on
// small input files by over reserving. The far and away hottest hash table is
// for interning identifiers. Fortunately, when analyzing several C++
// codebases we have found a roughly normal distribution of unique identifiers
// in the file centered at 0.5 * lines, and in the vast majority of cases
// bounded below 1.0 * lines. For example, here is LLVM's distribution
// computed with `scripts/source_stats.py` and rendered in an ASCII-art
// histogram:
//
// # Unique IDs per 10 lines ## (median: 5)
// 1 ids [ 29] ▍
// 2 ids [ 282] ███▊
// 3 ids [1492] ███████████████████▉
// 4 ids [2674] ███████████████████████████████████▌
// 5 ids [3011] ████████████████████████████████████████
// 6 ids [2267] ██████████████████████████████▏
// 7 ids [1549] ████████████████████▋
// 8 ids [ 817] ██████████▉
// 9 ids [ 301] ████
// 10 ids [ 98] █▎
//
// (Trimmed to only cover 1 - 10 unique IDs per 10 lines of code.)
//
// Based on this distribution, we reserve space for as many identifiers as
// lines. This will typically reserve *more* than enough space due to the load
// factor and requirement to have a power-of-two size, but we expect this
// hashtable to be worth over-shooting and thus having a low load factor and
// lower latency for access even at the cost of memory.
buffer_.value_stores_->identifiers().Reserve(buffer_.line_infos_.size());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not about discouraging change, but helping future changes understand how we got here and what data.

In this specific case, I think the result also seems hard to believe, and so I wanted to provide more detail about what was going on here. I'm also hesitant to reference PR descriptions or other things that seem less durable than comments in the source code to carry this kind of information.

That said, the disruption to the function body makes lots of sense, and we don't need to inline this. What do you think about moving the specific justification to a comment on a function to compute the reservation size, and then here we can just explain the idea of reserving before we begin lexing?

// 9 ids [ 301] ████
// 10 ids [ 98] █▎
//
// (Trimmed to only cover 1 - 10 unique IDs per 10 lines of code.)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script doesn't build the histogram for me in a way that's easy to do that... I can mention the number of things in the tail though?

Comment on lines 645 to 646
// for interning identifiers. Fortunately, when analyzing several C++
// codebases we have found a roughly normal distribution of unique identifiers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't provide a lot of detail, but I'll add what I can (basically, internal codebases I have access to at Google).

@chandlerc chandlerc requested a review from jonmeow July 3, 2024 18:02
@chandlerc
Copy link
Contributor Author

Ping? Just looking for confirmation that moving the comment out-of-line to a separate area and cleaning up things was enough that this LG for landing.

@jonmeow
Copy link
Contributor

jonmeow commented Jul 18, 2024

I believe we'd discussed this in person a couple weeks ago, I'm fine with you merging.

@jonmeow jonmeow added this pull request to the merge queue Jul 18, 2024
Merged via the queue into carbon-language:trunk with commit 44c85e0 Jul 18, 2024
7 checks passed
template <typename IdT>
auto CanonicalValueStore<IdT>::Reserve(size_t size) -> void {
// Compute the resulting new insert count using the size of values -- the
// set doesn't have a fast to compute current size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a word missing after "fast"?

Copy link
Contributor

@jonmeow jonmeow Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just "fast-to-compute", or "quickly computable"

brymer-meneses pushed a commit to brymer-meneses/carbon-lang that referenced this pull request Aug 15, 2024
This uses a heuristic reserve to greatly reduce hashtable growth of the
identifiers hashtable. The design of the hashtable itself is optimized
around compact memory use and is especially slow to grow and so this has
an outsized impact.

The heuristic was computed using `scripts/source_stats.py` and looking
at C++ codebases. We may want to periodically re-evaluate it as Carbon
code emerges and we have better data on its distributions of tokens.

This also required fixing the `Reserve` method on `CanonicalValueStore`
that wasn't actually used anywhere and so didn't even compile correctly.
I added it to the relevant unit test so it is at least compiled locally
to its definition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants