Improve sharding of inequality filters #63

MattFaus · 2015-06-11T19:39:53Z

As described in the commit messages, improves sharding for jobs that filter based on property inequalities. We have seen dramatic performance improvements for mapreduce jobs that do time-based filtering due to the natural peak-and-trough of web traffic. Tony's screenshots were lost in our code review system, but they had the classic "wave pattern" to "straight line" progression.

Summary: This improves the sharding algorithm when inequality filters are used with DatastoreInputReader. Right now inequality filters are handled by splitting the property range uniformly. This results in really uneven sharding in many cases. Note that we are limited to one property on which we can use an inequality filter. If we use the `__scatter__` property, we implicitly use this up on the keys! More info: https://code.google.com/p/appengine-mapreduce/wiki/ScatterPropertyImplementation https://developers.google.com/appengine/docs/python/datastore/queries#Python_Restrictions_on_queries For more detail, see the emails here: https://groups.google.com/a/khanacademy.org/forum/#!topic/analytics-blackhole/6_MLbJbL10o Now, we split the property range into 16 (we can change this) times more sub-ranges and redistribute the work across shards. For instance, if we have 256 shards, this artificially creates 256 * 16 = 4096 ranges and spreads them out. The shards will get the following range numbers: - Shard 0: gets 0, 256, 512, ..., 3840 - Shard 1: gets 1, 257, 513, ..., 3841 - ... - Shard 255: gets 255, 511, 1023, ..., 4095 I chose this approach because it's not data dependent. Other approaches would typically require splitting the ranges non-uniformly, which would in turn require sampling the data. The implementation chains together iterators and implements the appropriate JSON serialization. I found this pretty elegant, but I'm open to other ideas! After lots of debugging, this works locally, i.e. learning gain pipelines produce the same results as before. I'm going to run it on a znd to check that the sharding actually improves. Test Plan: Run locally (for correctness) and on znd (for sharding) Reviewers: mattfaus, tom CC: jascha, eliana, jace, benjaminhaley Differential Revision: http://phabricator.khanacademy.org/D6147

@tom

Summary: This compresses the JSON-serialized input readers with zlib. A follow-up to: http://phabricator.khanacademy.org/D6147 Per @tom's comments, I'm going to leave `json_util.py` as is to minimize changes, since removing spaces in the `,` and `:` separators doesn't give us much. Simple enough! Test Plan: Run locally and on znd Reviewers: tom, mattfaus Reviewed By: mattfaus CC: jace, eliana, tom Differential Revision: http://phabricator.khanacademy.org/D6215

sophiebits · 2015-06-11T23:14:51Z

I found in my email some of the original graphs for these commits:

Improve sharding of inequality filters

Tony Liu added 2 commits June 11, 2015 12:12

tkaitchuck added a commit that referenced this pull request Jun 24, 2015

Merge pull request #63 from Khan/improve_sharding_of_inequality_filters

73d03be

Improve sharding of inequality filters

tkaitchuck merged commit 73d03be into GoogleCloudPlatform:master Jun 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve sharding of inequality filters #63

Improve sharding of inequality filters #63

MattFaus commented Jun 11, 2015

sophiebits commented Jun 11, 2015

Improve sharding of inequality filters #63

Improve sharding of inequality filters #63

Conversation

MattFaus commented Jun 11, 2015

sophiebits commented Jun 11, 2015