-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reserve memory for the identifiers hashtable. #4107
Conversation
This uses a heuristic reserve to greatly reduce hashtable growth of the identifiers hashtable. The design of the hashtable itself is optimized around compact memory use and is especially slow to grow and so this has an outsized impact. The heuristic was computed using `scripts/source_stats.py` and looking at C++ codebases. We may want to periodically re-evaluate it as Carbon code emerges and we have better data on its distributions of tokens. This also required fixing the `Reserve` method on `CanonicalValueStore` that wasn't actually used anywhere and so didn't even compile correctly. I added it to the relevant unit test so it is at least compiled locally to its definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, although I wonder if it's worth some time to shorten the comment.
toolchain/lex/lex.cpp
Outdated
// Once we have scanned the source text to build up our line data structures, | ||
// and before we begin lexing, use the data about the line count in the source | ||
// to make rough estimated reservations of memory in the hot data structures | ||
// used by the lexer. In practice, scanning for lines is one of the easiest | ||
// parts of the lexer to accelerate, and we can use its results to minimize | ||
// the cost of incrementally growing data structures during the hot path of | ||
// the lexer. | ||
// | ||
// Hashtables are an especially useful data structure to reserve as they are | ||
// particularly expensive to *grow*. However, we don't want to waste memory on | ||
// small input files by over reserving. The far and away hottest hash table is | ||
// for interning identifiers. Fortunately, when analyzing several C++ | ||
// codebases we have found a roughly normal distribution of unique identifiers | ||
// in the file centered at 0.5 * lines, and in the vast majority of cases | ||
// bounded below 1.0 * lines. For example, here is LLVM's distribution | ||
// computed with `scripts/source_stats.py` and rendered in an ASCII-art | ||
// histogram: | ||
// | ||
// # Unique IDs per 10 lines ## (median: 5) | ||
// 1 ids [ 29] ▍ | ||
// 2 ids [ 282] ███▊ | ||
// 3 ids [1492] ███████████████████▉ | ||
// 4 ids [2674] ███████████████████████████████████▌ | ||
// 5 ids [3011] ████████████████████████████████████████ | ||
// 6 ids [2267] ██████████████████████████████▏ | ||
// 7 ids [1549] ████████████████████▋ | ||
// 8 ids [ 817] ██████████▉ | ||
// 9 ids [ 301] ████ | ||
// 10 ids [ 98] █▎ | ||
// | ||
// (Trimmed to only cover 1 - 10 unique IDs per 10 lines of code.) | ||
// | ||
// Based on this distribution, we reserve space for as many identifiers as | ||
// lines. This will typically reserve *more* than enough space due to the load | ||
// factor and requirement to have a power-of-two size, but we expect this | ||
// hashtable to be worth over-shooting and thus having a low load factor and | ||
// lower latency for access even at the cost of memory. | ||
buffer_.value_stores_->identifiers().Reserve(buffer_.line_infos_.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the comment here is detailed in order to provide as much information as possible in order to discourage changing this, but where are you viewing the balance when it comes to reading the function?
For me, these really long comments can make the surrounding code more difficult to read; instead of being able to skim the logic with comments filling in details I'm missing, I think it's now mainly this one comment with some other code on the side, and it takes more effort to see later code as not related to this comment.
Perhaps some details could be moved to the PR description / commit message, rather than including the full design in the code itself? As one possible rewording for you to consider:
Use the number of lines to reserve identifier space, because the line number count is easy to accelerate and offers a good heuristic. Reserving the hashtable now avoids expensive re-hashing on resize later. The cost of over-shooting memory is worth the benefits of low latency and load factors.
One unique identifier per line should be roughly double the median, and cover around 90% of files. For the analysis, see http://github.com/carbon-language/pull/4107.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not about discouraging change, but helping future changes understand how we got here and what data.
In this specific case, I think the result also seems hard to believe, and so I wanted to provide more detail about what was going on here. I'm also hesitant to reference PR descriptions or other things that seem less durable than comments in the source code to carry this kind of information.
That said, the disruption to the function body makes lots of sense, and we don't need to inline this. What do you think about moving the specific justification to a comment on a function to compute the reservation size, and then here we can just explain the idea of reserving before we begin lexing?
toolchain/lex/lex.cpp
Outdated
// 9 ids [ 301] ████ | ||
// 10 ids [ 98] █▎ | ||
// | ||
// (Trimmed to only cover 1 - 10 unique IDs per 10 lines of code.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth having a ">10" entry in the histogram instead, to show how long the long tail is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script doesn't build the histogram for me in a way that's easy to do that... I can mention the number of things in the tail though?
toolchain/lex/lex.cpp
Outdated
// for interning identifiers. Fortunately, when analyzing several C++ | ||
// codebases we have found a roughly normal distribution of unique identifiers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For background, can these codebases be mentioned somewhere, such as in the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't provide a lot of detail, but I'll add what I can (basically, internal codebases I have access to at Google).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL, hopefully addressed comment length and also now rebased and ready to land.
toolchain/lex/lex.cpp
Outdated
// Once we have scanned the source text to build up our line data structures, | ||
// and before we begin lexing, use the data about the line count in the source | ||
// to make rough estimated reservations of memory in the hot data structures | ||
// used by the lexer. In practice, scanning for lines is one of the easiest | ||
// parts of the lexer to accelerate, and we can use its results to minimize | ||
// the cost of incrementally growing data structures during the hot path of | ||
// the lexer. | ||
// | ||
// Hashtables are an especially useful data structure to reserve as they are | ||
// particularly expensive to *grow*. However, we don't want to waste memory on | ||
// small input files by over reserving. The far and away hottest hash table is | ||
// for interning identifiers. Fortunately, when analyzing several C++ | ||
// codebases we have found a roughly normal distribution of unique identifiers | ||
// in the file centered at 0.5 * lines, and in the vast majority of cases | ||
// bounded below 1.0 * lines. For example, here is LLVM's distribution | ||
// computed with `scripts/source_stats.py` and rendered in an ASCII-art | ||
// histogram: | ||
// | ||
// # Unique IDs per 10 lines ## (median: 5) | ||
// 1 ids [ 29] ▍ | ||
// 2 ids [ 282] ███▊ | ||
// 3 ids [1492] ███████████████████▉ | ||
// 4 ids [2674] ███████████████████████████████████▌ | ||
// 5 ids [3011] ████████████████████████████████████████ | ||
// 6 ids [2267] ██████████████████████████████▏ | ||
// 7 ids [1549] ████████████████████▋ | ||
// 8 ids [ 817] ██████████▉ | ||
// 9 ids [ 301] ████ | ||
// 10 ids [ 98] █▎ | ||
// | ||
// (Trimmed to only cover 1 - 10 unique IDs per 10 lines of code.) | ||
// | ||
// Based on this distribution, we reserve space for as many identifiers as | ||
// lines. This will typically reserve *more* than enough space due to the load | ||
// factor and requirement to have a power-of-two size, but we expect this | ||
// hashtable to be worth over-shooting and thus having a low load factor and | ||
// lower latency for access even at the cost of memory. | ||
buffer_.value_stores_->identifiers().Reserve(buffer_.line_infos_.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not about discouraging change, but helping future changes understand how we got here and what data.
In this specific case, I think the result also seems hard to believe, and so I wanted to provide more detail about what was going on here. I'm also hesitant to reference PR descriptions or other things that seem less durable than comments in the source code to carry this kind of information.
That said, the disruption to the function body makes lots of sense, and we don't need to inline this. What do you think about moving the specific justification to a comment on a function to compute the reservation size, and then here we can just explain the idea of reserving before we begin lexing?
toolchain/lex/lex.cpp
Outdated
// 9 ids [ 301] ████ | ||
// 10 ids [ 98] █▎ | ||
// | ||
// (Trimmed to only cover 1 - 10 unique IDs per 10 lines of code.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script doesn't build the histogram for me in a way that's easy to do that... I can mention the number of things in the tail though?
toolchain/lex/lex.cpp
Outdated
// for interning identifiers. Fortunately, when analyzing several C++ | ||
// codebases we have found a roughly normal distribution of unique identifiers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't provide a lot of detail, but I'll add what I can (basically, internal codebases I have access to at Google).
Ping? Just looking for confirmation that moving the comment out-of-line to a separate area and cleaning up things was enough that this LG for landing. |
I believe we'd discussed this in person a couple weeks ago, I'm fine with you merging. |
template <typename IdT> | ||
auto CanonicalValueStore<IdT>::Reserve(size_t size) -> void { | ||
// Compute the resulting new insert count using the size of values -- the | ||
// set doesn't have a fast to compute current size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a word missing after "fast"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's just "fast-to-compute", or "quickly computable"
This uses a heuristic reserve to greatly reduce hashtable growth of the identifiers hashtable. The design of the hashtable itself is optimized around compact memory use and is especially slow to grow and so this has an outsized impact. The heuristic was computed using `scripts/source_stats.py` and looking at C++ codebases. We may want to periodically re-evaluate it as Carbon code emerges and we have better data on its distributions of tokens. This also required fixing the `Reserve` method on `CanonicalValueStore` that wasn't actually used anywhere and so didn't even compile correctly. I added it to the relevant unit test so it is at least compiled locally to its definition.
This uses a heuristic reserve to greatly reduce hashtable growth of the
identifiers hashtable. The design of the hashtable itself is optimized
around compact memory use and is especially slow to grow and so this has
an outsized impact.
The heuristic was computed using
scripts/source_stats.py
and lookingat C++ codebases. We may want to periodically re-evaluate it as Carbon
code emerges and we have better data on its distributions of tokens.
This also required fixing the
Reserve
method onCanonicalValueStore
that wasn't actually used anywhere and so didn't even compile correctly.
I added it to the relevant unit test so it is at least compiled locally
to its definition.