Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sharding of inequality filters #63

Conversation

MattFaus
Copy link
Contributor

As described in the commit messages, improves sharding for jobs that filter based on property inequalities. We have seen dramatic performance improvements for mapreduce jobs that do time-based filtering due to the natural peak-and-trough of web traffic. Tony's screenshots were lost in our code review system, but they had the classic "wave pattern" to "straight line" progression.

Tony Liu added 2 commits June 11, 2015 12:12
Summary:
This improves the sharding algorithm when inequality filters are used with DatastoreInputReader. Right now inequality filters are handled by splitting the property range uniformly. This results in really uneven sharding in many cases.

Note that we are limited to one property on which we can use an inequality filter. If we use the `__scatter__` property, we implicitly use this up on the keys! More info:
https://code.google.com/p/appengine-mapreduce/wiki/ScatterPropertyImplementation
https://developers.google.com/appengine/docs/python/datastore/queries#Python_Restrictions_on_queries

For more detail, see the emails here: https://groups.google.com/a/khanacademy.org/forum/#!topic/analytics-blackhole/6_MLbJbL10o

Now, we split the property range into 16 (we can change this) times more sub-ranges and redistribute the work across shards. For instance, if we have 256 shards, this artificially creates 256 * 16 = 4096 ranges and spreads them out. The shards will get the following range numbers:

- Shard 0: gets 0, 256, 512, ..., 3840
- Shard 1: gets 1, 257, 513, ..., 3841
- ...
- Shard 255: gets 255, 511, 1023, ..., 4095

I chose this approach because it's not data dependent. Other approaches would typically require splitting the ranges non-uniformly, which would in turn require sampling the data. The implementation chains together iterators and implements the appropriate JSON serialization. I found this pretty elegant, but I'm open to other ideas!

After lots of debugging, this works locally, i.e. learning gain pipelines produce the same results as before. I'm going to run it on a znd to check that the sharding actually improves.

Test Plan: Run locally (for correctness) and on znd (for sharding)

Reviewers: mattfaus, tom

CC: jascha, eliana, jace, benjaminhaley

Differential Revision: http://phabricator.khanacademy.org/D6147
Summary:
This compresses the JSON-serialized input readers with zlib.

A follow-up to: http://phabricator.khanacademy.org/D6147

Per @tom's comments, I'm going to leave `json_util.py` as is to minimize changes, since removing spaces in the `,` and `:` separators doesn't give us much. Simple enough!

Test Plan: Run locally and on znd

Reviewers: tom, mattfaus

Reviewed By: mattfaus

CC: jace, eliana, tom

Differential Revision: http://phabricator.khanacademy.org/D6215
@sophiebits
Copy link

I found in my email some of the original graphs for these commits:

image

tkaitchuck added a commit that referenced this pull request Jun 24, 2015
@tkaitchuck tkaitchuck merged commit 73d03be into GoogleCloudPlatform:master Jun 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants