Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PLAY-2375] -- Sanitize Storage URIs #54

Merged
merged 27 commits into from
Apr 9, 2020

Conversation

shroudofthurin
Copy link
Contributor

Add the ability for each backend to sanitize it's URI in such a way that is most appropriate for the way it encodes secret data.

@ustudio/reviewers for your consideration.

Copy link
Member

@spiralman spiralman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things. Also, I only happened to remember the download_url_key query parameter in the Swift/CloudFiles backends. There might be other query parameters or URI components that are sensitive in those or other implementations. We should read through the implementations and ensure we have all sensitive data (including optional things) removed in the sanitized URL for each one.

Also, when we finish this, we need to remember to back-port it to the 0.12 branch for legacy Python 2 projects.

parsed_url = parsed_uri._replace(
netloc=":".join((parsed_uri.hostname, str(parsed_uri.port))))
else:
parsed_url = parsed_uri._replace(netloc=parsed_uri.hostname)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be clearer, and less redundant, if it were rewritten along the lines of:

new_netloc = parsed_uri.hostname

if parsed_uri.port is not None:
    new_netloc = ":".join((new_netloc, str(parsed_uri.port)))

# _replace, etc...

tests/test_swift_storage.py Show resolved Hide resolved
tests/test_cloudfiles_storage.py Show resolved Hide resolved
@@ -129,6 +129,10 @@ def get_download_url(self, seconds: int = 60, key: Optional[str] = None) -> str:
raise NotImplementedError(
"{} does not implement 'get_download_url'".format(self._class_name()))

def get_sanitized_uri(self) -> str:
raise NotImplementedError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure how I feel about this, but, since almost all of the implementations ended up being the same, should we put a default implementation for this in the base class?

One danger with doing that is that it, if a particular backend has sensitive data elsewhere in the URL (besides the netloc) and the implementor forgets to override this implementation, it's possible that the default sanitized URL could return sensitive data.

One solution to that would be to have the default implementation be overly restrictive (say, only returning the scheme, hostname, port and path, no query parameters, etc), and then have implementations override it which want to include extra values in the URL.

Again, I'm on the fence about which of these options is best (including just leaving it as it's currently implemented). Perhaps @tjensen or @joshmarshall has an opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spiralman @shroudofthurin I think I'm okay with the default implementation being in the base class and even making it overly restrictive.

One thing we could do is have the base class take an optional whitelist of identifiers indicating URI parts that are safe to include in the sanitized URI. Then, subclasses that want to include the query parameters and username can tell the base to include them. e.g.

class DerivedStorage(Storage):
    def __init__(self):
         super().__init__(sanitize_whitelist=[USERNAME, QUERY])

...could result in storage URIs that look like: proto://username@netloc/path?query=parameters

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjensen @spiralman @shroudofthurin While I agree that the current implementations are very similar, this is explicit boilerplate I believe to be a Good Thing[tm], since it's such an important consideration for any project using storage URIs. Thus my preference would be that it should raise, as it does in the code now (although I might prefer renaming the "sanitized_uri" helper to "sanitize_basic_auth" or something more specific.) If we do use a default behavior instead of raising, I recommend that we at least log a warning when a storage class with the "default" sterilization behavior is imported / registered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree with renaming the sanitize_uri helper to something more explicit (basic_auth feels a little confusing, because we're not actually using it for HTTP Basic Auth). The spec calls it "user information", so maybe remove_user_info?

Anyway, I'm 50/50 on whether there should be a default implementation or it should raise, so it seems like we have a hung jury. :-) I guess I'd recommend @shroudofthurin make the deciding vote, since it's her PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spiralman I think I will go with your original solution:

One solution to that would be to have the default implementation be overly restrictive (say, only returning the scheme, hostname, port and path, no query parameters, etc), and then have implementations override it which want to include extra values in the URL.

@shroudofthurin
Copy link
Contributor Author

@spiralman I think I've addressed all of your comments. Would you mind taking another look at this PR when you get the chance? Thanks 🍡

[PLAY-2375]
Copy link
Member

@spiralman spiralman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things.


new_query = dict(parse_qsl(parsed_uri.query))

if "download_url_key" in new_query:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is specific to Swift/CloudFiles, and should be in those implementations, not here.

@@ -267,3 +267,20 @@ def test_local_storage_get_download_url_returns_none_on_empty_base(self) -> None

def test_local_storage_rejects_multiple_query_values_for_download_url_key_setting(self) -> None:
self.assert_rejects_multiple_query_values("/foo/bar/object.mp4", "download_url_base")

def test_local_storage_get_sanitized_uri(self) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name should indicate what the expected behavior is. In this case, there's nothing "secret" in a local file:// URI, so there is nothing to remove.

return new_uri.geturl()


def sanitize_filepath(parsed_uri: ParseResult) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the behvior of this function. It seems like it removes the username/password and all query parameters from the URL. However, that doesn't really match what the name says (I would expect it to remove sensitive information from the path part of the URI, based on the name).

@shroudofthurin
Copy link
Contributor Author

@spiralman back to you

spiralman
spiralman previously approved these changes Apr 7, 2020
Copy link
Member

@spiralman spiralman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@spiralman
Copy link
Member

@shroudofthurin I realized that this is missing a version number bump. Normally, we bump the minor version, for a new feature, since this is still pre-1.0. That'd make the next version 0.14.0, but if you and @joshmarshall prefer to bump the bugfix version, that's fine with me (SemVer is a little goofy before 1.0).

I'm not making this a blocking comment so that you can get it merged during the day in your timezone.

@tjensen
Copy link
Member

tjensen commented Apr 8, 2020

@spiralman If we bump the master version from 0.13.3 to 0.14, what should we bump the Python 2-only v0.12 branch version to?

@spiralman
Copy link
Member

@tjensen Yeah, that's where it's kinda goofy. My feeling is only bump the bugfix number in the 0.12 branch, but bump the minor version in master: since Python 2 is no longer officially supported, I think we'll need to maintain the 0.12 branch on an "as needed" basis, which means SemVer won't work out very well. Given that usage of it should drop off, hopefully that's OK?

I'm fine with other ideas for bumping the version. Since we haven't ever moved to 1.0, SemVer is somewhat hazy, so I suspect it mostly just matters that we bump it to a bigger number.

Copy link
Contributor

@joshustudio joshustudio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shroudofthurin Just a few remaining documentation questions!

#### `get_sanitized_uri()` ####

Removes the username/password, as well as all query paramters, form the URL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"parameters"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add documentation for remove_user_info() too, since other users can create and register their own storage providers?

README.md Show resolved Hide resolved
…ble documentation, included the latest Python 2.7 release as well, as per reviewer's request.

[PLAY-2375]
@shroudofthurin
Copy link
Contributor Author

@joshustudio I've made the requested changes 🍡

@shroudofthurin
Copy link
Contributor Author

@joshustudio Added documentation for the url_parser module. Back to you.

Copy link
Contributor

@joshustudio joshustudio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shroudofthurin shroudofthurin merged commit 28349fb into master Apr 9, 2020
@shroudofthurin shroudofthurin deleted the play-2375-sanitize-storage-uris branch April 9, 2020 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants