use proto.Buffer API for protobuf codec and cache proto.Buffer structs #1010

apolcyn · 2016-12-03T01:58:23Z

replaces #955, which was overly invasive.

#955 used pools "per-transport" pools, only as a way to reduce contention on the proto.Buffer caches.

If we use a simple global cache of proto.Buffer instead, the change doesn't have to affect anything outside of the protoCodec, but experimentally contention goes up and the benefit on streaming QPS is removed. But sharding this pool seems to fix this contention - streaming QPS get the slightly greater than 15% improvement that it got earlier with the "per-transport" pools.

apolcyn · 2016-12-09T04:06:19Z

The first commit here used a non-atomic global counter to index into one of sharded pool. (used a token only to alloc and free from same pool).
As far as I can see the non-thread-safety of the earlier commit shouldn't be a correctness problem, but it shows up in the automatic race detector. The latest update changes the counter to an atomic one, which appears to cost about 3% QPS. Looking at the profile, actually about this much new CPU appears to be spent.
3.02% 2.83% worker worker [.] sync/atomic.AddUint32

It still makes an overall QPS improvement over 10% though, there's a small new cost but earlier was breaking the race detection test...

apolcyn · 2016-12-28T01:54:46Z

on top of all of the changes referenced in #1031, the latest two commits reduce object alloc counts by from about 15-20% and reduce space allocation ~300MB.
It fixes the alloc seen in proto.Buffer.enc_len_struct by adding 4 to the pre-calculated marshalled size (to prevent an alloc in builtin append function).
It also saves a slice alloc in proto Codec marshal that appears unnecssary.

QPS isn't improved beyoned noise much

apolcyn · 2016-12-28T05:32:21Z

more on what the last two commits do:
with all of the changes referenced in #1031, the QPS streaming benchmark alloc space profile looks like: (showing functions that alloc within - flat in the profile)

2548.58MB of 2574.68MB total (98.99%)
Dropped 78 nodes (cum <= 12.87MB)
      flat  flat%   sum%        cum   cum%
 1365.56MB 53.04% 53.04%  1365.56MB 53.04%  golang.org/x/net/http2.parseDataFrame
  462.51MB 17.96% 71.00%   462.51MB 17.96%  github.com/golang/protobuf/proto.(*Buffer).enc_len_thing
  460.01MB 17.87% 88.87%   935.51MB 36.34%  google.golang.org/grpc.encode
  245.50MB  9.54% 98.40%   245.50MB  9.54%  google.golang.org/grpc/transport.(*http2Server).handleData
      13MB   0.5% 98.91%   475.51MB 18.47%  google.golang.org/grpc.protoCodec.Marshal
       1MB 0.039% 98.95%   941.51MB 36.57%  google.golang.org/grpc.(*Server).processStreamingRPC
       1MB 0.039% 98.99%   940.51MB 36.53%  google.golang.org/grpc/benchmark.(*testServer).StreamingCall

after these changes it changes to:

2298.58MB of 2325.21MB total (98.85%)
Dropped 79 nodes (cum <= 11.63MB)
      flat  flat%   sum%        cum   cum%
 1440.57MB 61.95% 61.95%  1440.57MB 61.95%  golang.org/x/net/http2.parseDataFrame
     226MB  9.72% 71.67%      226MB  9.72%  google.golang.org/grpc/transport.(*http2Server).handleData
  162.50MB  6.99% 78.66%   479.01MB 20.60%  google.golang.org/grpc.(*serverStream).SendMsg
  159.50MB  6.86% 85.52%   160.50MB  6.90%  google.golang.org/grpc.protoCodec.Marshal
     155MB  6.67% 92.19%      155MB  6.67%  google.golang.org/grpc.(*parser).recvMsg
     154MB  6.62% 98.81%   314.50MB 13.53%  google.golang.org/grpc.encode
       1MB 0.043% 98.85%   635.01MB 27.31%  google.golang.org/grpc.(*Server).processStreamingRPC

462.51MB 17.96% 71.00% 462.51MB 17.96% github.com/golang/protobuf/proto.(*Buffer).enc_len_thing appears to have been allocing when appending the length of the protobuf struct to the marshalled proto, in https://github.com/golang/protobuf/blob/62e782f47e4c57935994133e44f1716d281504e4/proto/encode.go#L1311. (previously the buffer was set with the size returned from proto.Size, but it appears that this length isn't enough.
cc @carl-mastrangelo it seems kind of strange that setting a proto.Buffer's buffer to proto.Size isn't enough?

(On the removal of the second make([]byte) in protoCodec.Marshal, it was just unnecessary and straightforward to remove.)

carl-mastrangelo · 2016-12-29T21:51:11Z

@apolcyn Perhaps I don't understand this change: It appears that this change makes the marshaller / unmarshaller just reuese the proto.Buffer objects, rather than reuse the underlying slices. Is that correct? IIRC it's the slices that are actually expensive to allocate, not the protobuf wrappers around them.

I'm curious how this improves performance. The only way I could imagine this helping at all is when encoding proto2, which allocates new ints on the heap, and caches them inside the proto.Buffer. Even then, it only helps with encoding, not decoding, and they disappear after a GC. That would hurt most of the gains from this PR.

Am I missing something?

apolcyn · 2017-01-03T19:10:37Z

for previous comment, discussed offline but main benefit of the memory-saving seems to be by caching the proto.Buffer objects rather than the slices (these appear much more expensive than the slices at least on small messages).

apolcyn · 2017-01-03T19:10:43Z

Heads ups on another change for this PR: cc @carl-mastrangelo actually revisiting the use of sync.Pool rather than the cache created in this PR is indeed showing the sync.Pool doing about as well (sometimes better but improvements within noise).

A small micro-benchmark comparison of the two pools (repeatedly doing concurrent marshals/unmarshals) show similar results, with maybe a slight (within 10%) improvement with the custom cache in here. But if anything overall effect seems better with sync.Pool

This can probably be a lot simpler using just the sync.Pool

MakMukhi

Ending this review on these comments so far. Some of them apply to the rest of the code too.

MakMukhi · 2017-02-10T00:18:58Z

codec.go

+		protoSizeFieldLength = 4
+	)
+
+	var protoMsg = v.(proto.Message)


can get away with protoMsg := v.(proto.Message)

MakMukhi · 2017-02-10T00:20:10Z

codec.go

+
+	var protoMsg = v.(proto.Message)
+	// adding 4 to proto.Size avoids an extra allocation when appending the 4 byte length
+	// field in 'proto.Buffer.enc_len_thing'


Same nitpick about comment; Should look like a complete sentence.

MakMukhi · 2017-02-10T00:25:06Z

codec.go

+
+func (p protoCodec) Marshal(v interface{}) ([]byte, error) {
+	const (
+		protoSizeFieldLength = 4


Could this be a global const? That's save allocation to a const every time the function is called.

I think this should be getting compiled in. Keeping since it' only used locally? (e.g., go bench memory stats of show no difference)

You can confirm by looking at the highly readable assembly produced

I actually noticed that go proto objects in micro benchmarks aren't doing the realloc in enc_len_thing as is happening in the large full grpc benchmarks. I took this out for right now. Is this only needed on certain protos, fixed in new versions of go proto compiler?

MakMukhi · 2017-02-10T00:26:29Z

codec.go

+
+	buffer := protoBufferPool.Get().(*proto.Buffer)
+
+	newSlice := make([]byte, sizeNeeded)


Since sizeNeeded isn't used anywhere else, it might make more sense to just remove that variable:
newSlice := make([]byte, proto.Size(protoMsg) + protoSizeFieldLength)

MakMukhi · 2017-02-10T00:27:50Z

codec.go

+	err := buffer.Marshal(protoMsg)
+	if err != nil {
+		return nil, err
+	}


if err := buffer.Marshal(protoMsg); err != nil {
return nil, err
}

I imagine, if there were an error while Marshaling, we'd still want to put the buffer back in the pool?

woops thanks good catch I missed that. put all of the cleanup into a defer

MakMukhi · 2017-02-10T00:47:43Z

codec_test.go

+}
+
+func marshalAndUnmarshal(protoCodec Codec, expectedBody []byte, t *testing.T) {
+	original := &codec_perf.Buffer{}


original := &codec_perf.Buffer {
Body: expectedBody
}

Also, can we reuse this variable instead of creating variable deserialized? That's just a nitpick again.

MakMukhi · 2017-02-10T00:48:54Z

codec_test.go

+
+	result := deserialized.GetBody()
+
+	if bytes.Compare(result, expectedBody) != 0 {


Since result is not used anywhere else maybe it's better to use deserialized.GetBody{} inline, here.

MakMukhi · 2017-02-10T00:49:50Z

codec_test.go

+
+// Try to catch possible race conditions around use of pools
+func TestConcurrentUsage(t *testing.T) {
+	const numGoRoutines = 100


const (
numGoRoutines = ..
...
)

MakMukhi · 2017-02-10T00:52:08Z

codec_test.go

+	}
+
+	var wg sync.WaitGroup
+	codec := protoCodec{}


This is inconsistent with how the previous function initializes protoCodec( via newProtoCodec). However, this looks cleaner. May be, remove the function newProtoCodec() altogether?

MakMukhi · 2017-02-10T00:54:39Z

codec_test.go

+
+// This tries to make sure that buffers weren't stomped on
+// between marshals on codecs taking from the same pool.
+func TestStaggeredMarshalAndUnmarshalUsingSamePool(t *testing.T) {


This function also could use styling comments mentioned above.

reworded this

carl-mastrangelo · 2017-02-11T00:02:11Z

codec.go

+
+	protoMsg := v.(proto.Message)
+	buffer := protoBufferPool.Get().(*proto.Buffer)
+	defer func() {


You might consider rewriting this without the defers. They have become faster recently, but they aren't free.

Also benchmark it.

At least local ProtoCodec micro benchmarks showed no difference beyond usual noise with and without defers. If it sounds ok, thinking we should keep them in?

carl-mastrangelo · 2017-02-11T00:03:11Z

codec.go

+
+	// Adding 4 to proto.Size avoids an extra allocation when appending the 4 byte length
+	// field in 'proto.Buffer.enc_len_thing'.
+	newSlice := make([]byte, proto.Size(protoMsg)+protoSizeFieldLength)


Size is really expensive to calculate. It may be faster to not call it and just guess.

I replaced the proto.Size calculation with a guess based on length of buffer.GetBytes from the last call. Micro benchmarks had something around 20% improvement! E.g., under SetParallelism(1), usual times went from ~130-140ns to ~100-110ns.

carl-mastrangelo · 2017-02-11T00:04:39Z

codec_test.go

+}
+
+func BenchmarkProtoCodec10Goroutines(b *testing.B) {
+	benchmarkProtoCodecConcurrentUsage(10, b)


This can be done using sub benchmarks. See https://golang.org/pkg/testing/#B.Run

Also, for that matter: b.SetParallelism

carl-mastrangelo · 2017-02-11T00:05:55Z

codec_test.go

+	codec := &protoCodec{}
+	var wg sync.WaitGroup
+	for i := 0; i < goroutines; i++ {
+		wg.Add(1)


nit: Add is a synchronous call, which may make the benchmarks noiser than they could be. Consider calling it just once.

all of this changed to use the RunParallel framework

carl-mastrangelo · 2017-02-11T00:06:15Z

codec_test.go

+func fastMarshalAndUnmarshal(protoCodec Codec, protoStruct interface{}, b *testing.B) {
+	marshalledBytes, err := protoCodec.Marshal(protoStruct)
+	if err != nil {
+		b.Fatalf("protoCodec.Marshal(_) returned an error")


carl-mastrangelo · 2017-02-13T21:37:18Z

codec_test.go

+		i += 2
+		b.Run(fmt.Sprintf("BenchmarkProtoCodec_SetParallelism(%v)", p), func(b *testing.B) {
+			codec := &protoCodec{}
+			b.SetParallelism(p)


This is racy no? p is modified across the loop

a bit stuck, isn't p loop-local? ~~or at least closed over each round?~~

scratch the closed part :) but I thought should be loop local. Also moved to inside inner function..

carl-mastrangelo · 2017-02-13T21:37:51Z

codec_test.go

+	var err error
+
+	if m1, err = codec1.Marshal(&proto1); err != nil {
+		t.Fatalf("protoCodec.Marshal(%v) failed", proto1)


Errorf

If any conditional here doesn't depend on the previous one, it should be Errorf

carl-mastrangelo · 2017-02-13T21:39:49Z

codec.go

+func (p protoCodec) Unmarshal(data []byte, v interface{}) error {
+	buffer := protoBufferPool.Get().(*proto.Buffer)
+	buffer.SetBuf(data)
+	err := buffer.Unmarshal(v.(proto.Message))


I don't know if it is somehow possible, but Go will check the type every time. You might consider benchmarking without this interface punning to see if it helps at all. If it does, you could modify the code to somehow avoid the type pun.

I tried this one btw but unfortunately the differences in runs seemed usual with noise

carl-mastrangelo · 2017-02-13T21:41:11Z

codec.go

+		return nil, err
+	}
+	out := buffer.Bytes()
+	atomic.StoreUint32(&lastMarshaledSize, uint32(len(out)))


This will be a contention point right? If marshalling hundreds of thousands of RPCs per second this will contend heavily.

The latest update wraps the cached proto.Buffer structs into a larger struct with a cached size too, so atomics are no longer used.
Tried the sync.Pool too, though benchmark speeds weren't decreased, the mutex contention profiler seemed to show a lot higher contended mutex counts and times with the separate pools rather than single pools. (something like 2mins vs. 1 min.)

Results I'm getting are a bit noisy or hard to make complete sense of, but the latest updates seem to definitely make an improvement in micro benchmarks, over the previous proto.Size calls.

apolcyn · 2017-02-15T00:01:27Z

it seems i ended up needing to go back to not using testing.B.Run(), since it was failing tests here on go platforms <= 1.6.

apolcyn · 2017-02-22T02:28:58Z

ftr latest update removed use of defer. Re-running benchmarks shows a small but seemingly noticeable difference of something like 95ns -> 90ns. And seems simple to remove it..

…pool use

…rk setup-in-test bug

… cap

apolcyn · 2017-03-28T00:21:08Z

FTR the previous travis run seemed to have hit a flake in TestFlowControlLogicRace on go-1.5.

But travis is now green - this PR was rebased on master, with go-1.5 builds no longer running.

grpc#1010) * use a global sharded pool of proto.Buffer caches in protoCodec * fix goimports * make global buffer pool index counter atomic * hack to remove alloc in encode_len_struct * remove extra slice alloc in proto codec marshal * replce magic number for proto size field length with constant * replace custom cache with sync.Pool * remove 1 line functions in codec.go and add protoCodec microbenchmarks * add concurrent usage test for protoCodec * fix golint.gofmt,goimport checks * fix issues in codec.go and codec_test.go * use go parallel benchmark helpers * replace proto.Codec with a guess of size needed * update Fatalf -> Errorf in tests * wrap proto.Buffer along with cached last size into larger struct for pool use * make wrapped proto buffer only a literal * fix style and imports * move b.Run into inner function * reverse micro benchmark op order to unmarshal-marshal and fix benchmark setup-in-test bug * add test for large message * remove use of defer in codec.marshal * revert recent changes to codec bencmarks * move sub-benchmarks into >= go-1.7 only file * add commentfor marshaler and tweak benchmark subtests for easier usage * move build tag for go1.7 on benchmarks to inside file * move build tag to top of file * comment Codec, embed proto.Buffer into cached struct and add an int32 cap

apolcyn mentioned this pull request Dec 3, 2016

Proto buf per conn pool #955

Closed

apolcyn force-pushed the proto_buf_global_pools branch from b17cf19 to c7a3c17 Compare December 9, 2016 03:28

This was referenced Dec 21, 2016

remove proto object allocs on streaming benchmark server #1028

Merged

change objects in recvBuffer queue from interface to concrete type to reduce allocs #1029

Merged

apolcyn mentioned this pull request Dec 28, 2016

dont always create new slice to prepend grpc header #1036

Closed

apolcyn changed the title ~~use a global sharded pool of proto.Buffer caches in protoCodec~~ use proto.Buffer API for protobuf codec and cache proto.Buffer structs Jan 5, 2017

apolcyn assigned iamqizhao Jan 5, 2017

apolcyn assigned menghanl and MakMukhi and unassigned iamqizhao and menghanl Feb 7, 2017

MakMukhi reviewed Feb 10, 2017

View reviewed changes

carl-mastrangelo reviewed Feb 11, 2017

View reviewed changes

carl-mastrangelo reviewed Feb 13, 2017

View reviewed changes

apolcyn force-pushed the proto_buf_global_pools branch from eff7748 to 0293616 Compare February 14, 2017 02:23

apolcyn added 20 commits March 27, 2017 15:03

remove 1 line functions in codec.go and add protoCodec microbenchmarks

a1e068f

add concurrent usage test for protoCodec

a21119d

fix golint.gofmt,goimport checks

e16c62c

fix issues in codec.go and codec_test.go

586bba2

use go parallel benchmark helpers

510b880

replace proto.Codec with a guess of size needed

34bf3db

update Fatalf -> Errorf in tests

c8c475c

wrap proto.Buffer along with cached last size into larger struct for …

8431a0d

…pool use

make wrapped proto buffer only a literal

f2b8177

fix style and imports

225b3ae

move b.Run into inner function

f841695

reverse micro benchmark op order to unmarshal-marshal and fix benchma…

776137a

…rk setup-in-test bug

add test for large message

93d4020

remove use of defer in codec.marshal

d30dcba

revert recent changes to codec bencmarks

126640b

move sub-benchmarks into >= go-1.7 only file

18485fe

add commentfor marshaler and tweak benchmark subtests for easier usage

93c3e12

move build tag for go1.7 on benchmarks to inside file

5ef6c9b

move build tag to top of file

42d712a

comment Codec, embed proto.Buffer into cached struct and add an int32…

1517ac9

… cap

apolcyn force-pushed the proto_buf_global_pools branch from 16f4d51 to 1517ac9 Compare March 27, 2017 22:08

apolcyn mentioned this pull request Apr 3, 2017

flake seen in TestStreamsQuotaRecovery #1164

Closed

apolcyn merged commit 0e8b58d into grpc:master Apr 13, 2017

menghanl added 1.3 Type: Bug Type: Performance Performance improvements (CPU, network, memory, etc) and removed Type: Bug labels Apr 18, 2017

lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019


		buffer := protoBufferPool.Get().(*proto.Buffer)

		newSlice := make([]byte, sizeNeeded)


		result := deserialized.GetBody()

		if bytes.Compare(result, expectedBody) != 0 {

use proto.Buffer API for protobuf codec and cache proto.Buffer structs #1010

use proto.Buffer API for protobuf codec and cache proto.Buffer structs #1010

Conversation

apolcyn commented Dec 3, 2016

apolcyn commented Dec 9, 2016

apolcyn commented Dec 28, 2016

apolcyn commented Dec 28, 2016

carl-mastrangelo commented Dec 29, 2016

apolcyn commented Jan 3, 2017

apolcyn commented Jan 3, 2017

MakMukhi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apolcyn Feb 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apolcyn commented Feb 15, 2017

apolcyn commented Feb 22, 2017

apolcyn commented Mar 28, 2017

apolcyn Feb 14, 2017 •

edited

Loading