Reserve memory for the identifiers hashtable. #4107

chandlerc · 2024-07-03T04:24:47Z

This uses a heuristic reserve to greatly reduce hashtable growth of the
identifiers hashtable. The design of the hashtable itself is optimized
around compact memory use and is especially slow to grow and so this has
an outsized impact.

The heuristic was computed using scripts/source_stats.py and looking
at C++ codebases. We may want to periodically re-evaluate it as Carbon
code emerges and we have better data on its distributions of tokens.

This also required fixing the Reserve method on CanonicalValueStore
that wasn't actually used anywhere and so didn't even compile correctly.
I added it to the relevant unit test so it is at least compiled locally
to its definition.

This uses a heuristic reserve to greatly reduce hashtable growth of the identifiers hashtable. The design of the hashtable itself is optimized around compact memory use and is especially slow to grow and so this has an outsized impact. The heuristic was computed using `scripts/source_stats.py` and looking at C++ codebases. We may want to periodically re-evaluate it as Carbon code emerges and we have better data on its distributions of tokens. This also required fixing the `Reserve` method on `CanonicalValueStore` that wasn't actually used anywhere and so didn't even compile correctly. I added it to the relevant unit test so it is at least compiled locally to its definition.

jonmeow

Looks good, although I wonder if it's worth some time to shorten the comment.

jonmeow · 2024-07-03T16:03:39Z

toolchain/lex/lex.cpp

+  // Once we have scanned the source text to build up our line data structures,
+  // and before we begin lexing, use the data about the line count in the source
+  // to make rough estimated reservations of memory in the hot data structures
+  // used by the lexer. In practice, scanning for lines is one of the easiest
+  // parts of the lexer to accelerate, and we can use its results to minimize
+  // the cost of incrementally growing data structures during the hot path of
+  // the lexer.
+  //
+  // Hashtables are an especially useful data structure to reserve as they are
+  // particularly expensive to *grow*. However, we don't want to waste memory on
+  // small input files by over reserving. The far and away hottest hash table is
+  // for interning identifiers. Fortunately, when analyzing several C++
+  // codebases we have found a roughly normal distribution of unique identifiers
+  // in the file centered at 0.5 * lines, and in the vast majority of cases
+  // bounded below 1.0 * lines. For example, here is LLVM's distribution
+  // computed with `scripts/source_stats.py` and rendered in an ASCII-art
+  // histogram:
+  //
+  //   # Unique IDs per 10 lines ## (median: 5)
+  //   1 ids   [  29]  ▍
+  //   2 ids   [ 282]  ███▊
+  //   3 ids   [1492]  ███████████████████▉
+  //   4 ids   [2674]  ███████████████████████████████████▌
+  //   5 ids   [3011]  ████████████████████████████████████████
+  //   6 ids   [2267]  ██████████████████████████████▏
+  //   7 ids   [1549]  ████████████████████▋
+  //   8 ids   [ 817]  ██████████▉
+  //   9 ids   [ 301]  ████
+  //   10 ids  [  98]  █▎
+  //
+  //   (Trimmed to only cover 1 - 10 unique IDs per 10 lines of code.)
+  //
+  // Based on this distribution, we reserve space for as many identifiers as
+  // lines. This will typically reserve *more* than enough space due to the load
+  // factor and requirement to have a power-of-two size, but we expect this
+  // hashtable to be worth over-shooting and thus having a low load factor and
+  // lower latency for access even at the cost of memory.
+  buffer_.value_stores_->identifiers().Reserve(buffer_.line_infos_.size());


I assume the comment here is detailed in order to provide as much information as possible in order to discourage changing this, but where are you viewing the balance when it comes to reading the function?

For me, these really long comments can make the surrounding code more difficult to read; instead of being able to skim the logic with comments filling in details I'm missing, I think it's now mainly this one comment with some other code on the side, and it takes more effort to see later code as not related to this comment.

Perhaps some details could be moved to the PR description / commit message, rather than including the full design in the code itself? As one possible rewording for you to consider:

Use the number of lines to reserve identifier space, because the line number count is easy to accelerate and offers a good heuristic. Reserving the hashtable now avoids expensive re-hashing on resize later. The cost of over-shooting memory is worth the benefits of low latency and load factors.

One unique identifier per line should be roughly double the median, and cover around 90% of files. For the analysis, see http://github.com/carbon-language/pull/4107.

Not about discouraging change, but helping future changes understand how we got here and what data.

In this specific case, I think the result also seems hard to believe, and so I wanted to provide more detail about what was going on here. I'm also hesitant to reference PR descriptions or other things that seem less durable than comments in the source code to carry this kind of information.

That said, the disruption to the function body makes lots of sense, and we don't need to inline this. What do you think about moving the specific justification to a comment on a function to compute the reservation size, and then here we can just explain the idea of reserving before we begin lexing?

jonmeow · 2024-07-03T16:10:13Z

toolchain/lex/lex.cpp

+  //   9 ids   [ 301]  ████
+  //   10 ids  [  98]  █▎
+  //
+  //   (Trimmed to only cover 1 - 10 unique IDs per 10 lines of code.)


Would it be worth having a ">10" entry in the histogram instead, to show how long the long tail is?

The script doesn't build the histogram for me in a way that's easy to do that... I can mention the number of things in the tail though?

jonmeow · 2024-07-03T16:10:42Z

toolchain/lex/lex.cpp

+  // for interning identifiers. Fortunately, when analyzing several C++
+  // codebases we have found a roughly normal distribution of unique identifiers


For background, can these codebases be mentioned somewhere, such as in the PR?

I can't provide a lot of detail, but I'll add what I can (basically, internal codebases I have access to at Google).

chandlerc

PTAL, hopefully addressed comment length and also now rebased and ready to land.

chandlerc · 2024-07-03T17:31:35Z

toolchain/lex/lex.cpp

+  // Once we have scanned the source text to build up our line data structures,
+  // and before we begin lexing, use the data about the line count in the source
+  // to make rough estimated reservations of memory in the hot data structures
+  // used by the lexer. In practice, scanning for lines is one of the easiest
+  // parts of the lexer to accelerate, and we can use its results to minimize
+  // the cost of incrementally growing data structures during the hot path of
+  // the lexer.
+  //
+  // Hashtables are an especially useful data structure to reserve as they are
+  // particularly expensive to *grow*. However, we don't want to waste memory on
+  // small input files by over reserving. The far and away hottest hash table is
+  // for interning identifiers. Fortunately, when analyzing several C++
+  // codebases we have found a roughly normal distribution of unique identifiers
+  // in the file centered at 0.5 * lines, and in the vast majority of cases
+  // bounded below 1.0 * lines. For example, here is LLVM's distribution
+  // computed with `scripts/source_stats.py` and rendered in an ASCII-art
+  // histogram:
+  //
+  //   # Unique IDs per 10 lines ## (median: 5)
+  //   1 ids   [  29]  ▍
+  //   2 ids   [ 282]  ███▊
+  //   3 ids   [1492]  ███████████████████▉
+  //   4 ids   [2674]  ███████████████████████████████████▌
+  //   5 ids   [3011]  ████████████████████████████████████████
+  //   6 ids   [2267]  ██████████████████████████████▏
+  //   7 ids   [1549]  ████████████████████▋
+  //   8 ids   [ 817]  ██████████▉
+  //   9 ids   [ 301]  ████
+  //   10 ids  [  98]  █▎
+  //
+  //   (Trimmed to only cover 1 - 10 unique IDs per 10 lines of code.)
+  //
+  // Based on this distribution, we reserve space for as many identifiers as
+  // lines. This will typically reserve *more* than enough space due to the load
+  // factor and requirement to have a power-of-two size, but we expect this
+  // hashtable to be worth over-shooting and thus having a low load factor and
+  // lower latency for access even at the cost of memory.
+  buffer_.value_stores_->identifiers().Reserve(buffer_.line_infos_.size());


Not about discouraging change, but helping future changes understand how we got here and what data.

In this specific case, I think the result also seems hard to believe, and so I wanted to provide more detail about what was going on here. I'm also hesitant to reference PR descriptions or other things that seem less durable than comments in the source code to carry this kind of information.

That said, the disruption to the function body makes lots of sense, and we don't need to inline this. What do you think about moving the specific justification to a comment on a function to compute the reservation size, and then here we can just explain the idea of reserving before we begin lexing?

chandlerc · 2024-07-03T17:32:19Z

toolchain/lex/lex.cpp

+  //   9 ids   [ 301]  ████
+  //   10 ids  [  98]  █▎
+  //
+  //   (Trimmed to only cover 1 - 10 unique IDs per 10 lines of code.)


The script doesn't build the histogram for me in a way that's easy to do that... I can mention the number of things in the tail though?

chandlerc · 2024-07-03T17:32:46Z

toolchain/lex/lex.cpp

+  // for interning identifiers. Fortunately, when analyzing several C++
+  // codebases we have found a roughly normal distribution of unique identifiers


I can't provide a lot of detail, but I'll add what I can (basically, internal codebases I have access to at Google).

chandlerc · 2024-07-18T08:57:53Z

Ping? Just looking for confirmation that moving the comment out-of-line to a separate area and cleaning up things was enough that this LG for landing.

jonmeow · 2024-07-18T15:41:17Z

I believe we'd discussed this in person a couple weeks ago, I'm fine with you merging.

josh11b · 2024-07-18T22:51:57Z

toolchain/base/value_store.h

+template <typename IdT>
+auto CanonicalValueStore<IdT>::Reserve(size_t size) -> void {
+  // Compute the resulting new insert count using the size of values -- the
+  // set doesn't have a fast to compute current size.


Is there a word missing after "fast"?

I think it's just "fast-to-compute", or "quickly computable"

This uses a heuristic reserve to greatly reduce hashtable growth of the identifiers hashtable. The design of the hashtable itself is optimized around compact memory use and is especially slow to grow and so this has an outsized impact. The heuristic was computed using `scripts/source_stats.py` and looking at C++ codebases. We may want to periodically re-evaluate it as Carbon code emerges and we have better data on its distributions of tokens. This also required fixing the `Reserve` method on `CanonicalValueStore` that wasn't actually used anywhere and so didn't even compile correctly. I added it to the relevant unit test so it is at least compiled locally to its definition.

github-actions bot requested a review from jonmeow July 3, 2024 04:25

github-actions bot added the toolchain label Jul 3, 2024

jonmeow approved these changes Jul 3, 2024

View reviewed changes

improve comments

d77ede7

chandlerc force-pushed the reserve-lexer2 branch from 9b2235f to d77ede7 Compare July 3, 2024 17:46

more comment improvements

37ce127

chandlerc commented Jul 3, 2024

View reviewed changes

chandlerc requested a review from jonmeow July 3, 2024 18:02

jonmeow added this pull request to the merge queue Jul 18, 2024

Merged via the queue into carbon-language:trunk with commit 44c85e0 Jul 18, 2024
7 checks passed

josh11b reviewed Jul 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reserve memory for the identifiers hashtable. #4107

Reserve memory for the identifiers hashtable. #4107

chandlerc commented Jul 3, 2024 •

edited

Loading

jonmeow left a comment

jonmeow Jul 3, 2024 •

edited

Loading

chandlerc Jul 3, 2024

jonmeow Jul 3, 2024

chandlerc Jul 3, 2024

jonmeow Jul 3, 2024

chandlerc Jul 3, 2024

chandlerc left a comment

chandlerc Jul 3, 2024

chandlerc Jul 3, 2024

chandlerc Jul 3, 2024

chandlerc commented Jul 18, 2024

jonmeow commented Jul 18, 2024

josh11b Jul 18, 2024

jonmeow Jul 19, 2024 •

edited

Loading

		// for interning identifiers. Fortunately, when analyzing several C++
		// codebases we have found a roughly normal distribution of unique identifiers

Reserve memory for the identifiers hashtable. #4107

Reserve memory for the identifiers hashtable. #4107

Conversation

chandlerc commented Jul 3, 2024 • edited Loading

jonmeow left a comment

Choose a reason for hiding this comment

jonmeow Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chandlerc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chandlerc commented Jul 18, 2024

jonmeow commented Jul 18, 2024

Choose a reason for hiding this comment

jonmeow Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

chandlerc commented Jul 3, 2024 •

edited

Loading

jonmeow Jul 3, 2024 •

edited

Loading

jonmeow Jul 19, 2024 •

edited

Loading