Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for sharing state between pathlib subclasses #100479

Closed
barneygale opened this issue Dec 23, 2022 · 10 comments
Closed

Support for sharing state between pathlib subclasses #100479

barneygale opened this issue Dec 23, 2022 · 10 comments
Labels
topic-pathlib type-feature A feature request or enhancement

Comments

@barneygale
Copy link
Contributor

barneygale commented Dec 23, 2022

Feature or enhancement

This enhancement proposes that we allow state to be shared between related instances of subclasses of pathlib.PurePath and pathlib.Path.

Pitch

Now that #68320 is resolved, users can subclass pathlib.PurePath and pathlib.Path directly:

import pathlib

class MyPath(pathlib.Path):
    def my_custom_method(self):
        pass

etc = MyPath('/etc')
etc_hosts = etc / 'hosts'
etc_hosts.my_custom_method()

However, some user implementations of classes - such as TarPath or S3Path - would require underlying tarfile.TarFile or botocore.Resource objects to be shared between path objects (etc and etc_hosts in the example above). Such sharing of resources is presently rather difficult, as there's no single instance method used to derive new path objects.

This feature request proposes that we add a new PurePath.makepath() method, which is called whenever one path object is derived from another, such as in joinpath(), iterdir(), etc. The default implementation of this method looks something like:

def makepath(self, *args):
    return type(self)(*args)

Users may redefine this method in a subclass, in conjunction with a customized initializer:

class SessionPath(pathlib.Path):
    def __init__(self, *args, session_id):
        super().__init__(*args)
        self.session_id = session_id

    def makepath(self, *args):
        return type(self)(*args, session_id=self.session_id)

etc = SessionPath('/etc', session_id=42)
etc_hosts = etc / 'hosts'
print(etc_hosts.session_id)  # 42

I propose the name "makepath" for this method due to its close relationship with the existing "joinpath" method: a.joinpath(b) == a.makepath(a, b).

Performance

edit: this change has been taken care of elsewhere, and so implementing this feature request should no longer have much affect on performance.

This change will affect the performance of some pathlib operations, because it requires us to remove the _from_parsed_parts() constructor, which is an internal optimization used in cases where path parsing and normalization can be skipped (for example, in the parents sequence). I suggest that, within the standard library, pathlib is not a particularly performance-sensitive module -- few folks reach to pathlib for reason of speed. Within pathlib itself, the savings from optimizing these "pure" methods are usually drowned out by the I/O costs of "impure" methods. With the appeal of this feature in mind, I believe the performance cost is justified.

However, if the performance degradation is considered unacceptable, there's a possible alternative: add a normalize keyword argument to the path initializer and to makepath(). This would require some serious internal surgery to make work, and might be difficult to communicate to users, so at this stage it's not my preferred route forward.

Previous discussion

https://discuss.python.org/t/make-pathlib-extensible/3428/47 (and replies)

Linked PRs

@barneygale barneygale added the type-feature A feature request or enhancement label Dec 23, 2022
@barneygale
Copy link
Contributor Author

This will be needed to implement zipfile.Path using pathlib. Observe that it already has a _next() method that works exactly like what I'm proposing here:

def _next(self, at):
return self.__class__(self.root, at)

@AlexWaygood
Copy link
Member

AlexWaygood commented Dec 24, 2022

Exactly how big would the performance regression be? Have you tried running some benchmarks?

@barneygale
Copy link
Contributor Author

Not yet; I'll do this before I mark the PR as 'ready' and share the numbers.

@AlexWaygood
Copy link
Member

My instinct is to say that it would be a shame to make pathlib performance worse, since it's already got a reputation in some quarters for having surprisingly bad performance in some ways compared to os.path. But it's hard to evaluate without knowing how big the performance regression would be, and in what situations you'd hit the regression :)

few folks reach to pathlib for reason of speed.

In my opinion, we should be trying to fix this rather than exacerbating the problem :)

@barneygale
Copy link
Contributor Author

barneygale commented Dec 24, 2022

Absolutely fair!

Do you know of particular cases of (or complaints about) pathlib being slow? I'm aware that recursive globbing in pathlib can be slow, but not much else. We're increasingly calling os.path functions, some of which have C implementations, which should improve performance a little.

I'm of course keen for pathlib to be as fast as possible, but I don't want to prevent the future addition of an AbstractPath class either.

@AlexWaygood
Copy link
Member

AlexWaygood commented Dec 24, 2022

Do you know of particular cases of (or complaints about) pathlib being slow?

I have no idea how many of these complaints, if any, still apply to the current implementation of pathlib, nor how many of these are realistically resolvable :)

@barneygale
Copy link
Contributor Author

Mega, thank you. I'll study these carefully and write benchmarks.

Some initial thoughts:

  • The performance cost of constructing Path objects (vs constructing strings) stands out. This would surely benefit from an optimization pass, though it's unlikely we'll ever approach str construction performance.
  • The change to the parents implementation in GH-100479: Add optional blueprint argument to pathlib.PurePath #100481 may improve performance for some of these - I'll check.
  • A C implementation of pathlib would be wonderful but I suspect quite difficult! C implementations of select os.path functions would also help.
  • Caching stat() results would break user code, but there's a possible alternative: we could introduce something like a status() method that returns a new object with exists, is_dir, etc, attributes. Users could call path.status() once and check its attributes, rather than calling path.is_dir(), path.is_symlink(), etc, each of which would trigger their own stat() call.

@barneygale
Copy link
Contributor Author

Hey @AlexWaygood - I've addressed the performance aspects of this elsewhere (#101664, #101665, #101667, #102789), so addressing this feature request is pretty much performance-neutral. Would you be up for reviewing #100481? Thank you!

@AlexWaygood
Copy link
Member

I'll try to take a look soon!

barneygale added a commit to barneygale/cpython that referenced this issue Apr 9, 2023
barneygale added a commit to barneygale/cpython that referenced this issue Apr 12, 2023
barneygale added a commit to barneygale/cpython that referenced this issue Apr 13, 2023
barneygale added a commit to barneygale/cpython that referenced this issue May 2, 2023
barneygale added a commit to barneygale/cpython that referenced this issue May 2, 2023
barneygale added a commit to barneygale/cpython that referenced this issue May 5, 2023
barneygale added a commit that referenced this issue May 5, 2023
Add `pathlib.PurePath.with_segments()`, which creates a path object from arguments. This method is called whenever a derivative path is created, such as from `pathlib.PurePath.parent`. Subclasses may override this method to share information between path objects.

Co-authored-by: Alex Waygood <[email protected]>
@barneygale
Copy link
Contributor Author

Implemented in 3.12: #103975 / d00d942

barneygale added a commit to barneygale/cpython that referenced this issue May 5, 2023
carljm added a commit to carljm/cpython that referenced this issue May 5, 2023
* main:
  pythongh-99113: Add PyInterpreterConfig.own_gil (pythongh-104204)
  pythongh-104146: Remove unused var 'parser_body_declarations' from clinic.py (python#104214)
  pythongh-99113: Add Py_MOD_PER_INTERPRETER_GIL_SUPPORTED (pythongh-104205)
  pythongh-104108: Add the Py_mod_multiple_interpreters Module Def Slot (pythongh-104148)
  pythongh-99113: Share the GIL via PyInterpreterState.ceval.gil (pythongh-104203)
  pythonGH-100479: Add `pathlib.PurePath.with_segments()` (pythonGH-103975)
  pythongh-69152: Add _proxy_response_headers attribute to HTTPConnection (python#26152)
  pythongh-103533: Use PEP 669 APIs for cprofile (pythonGH-103534)
  pythonGH-96803: Add three C-API functions to make _PyInterpreterFrame less opaque for users of PEP 523. (pythonGH-96849)
jbower-fb pushed a commit to jbower-fb/cpython-jbowerfb that referenced this issue May 8, 2023
)

Add `pathlib.PurePath.with_segments()`, which creates a path object from arguments. This method is called whenever a derivative path is created, such as from `pathlib.PurePath.parent`. Subclasses may override this method to share information between path objects.

Co-authored-by: Alex Waygood <[email protected]>
jbower-fb pushed a commit to jbower-fb/cpython-jbowerfb that referenced this issue May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-pathlib type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants