Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the earthaccess.EarthAccessFile wrapper need not subclass anything #620

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

itcarroll
Copy link
Collaborator

@itcarroll itcarroll commented Jun 26, 2024

Fixes #610, closes #563.

This PR removes any base class from the definition of earthaccess.EarthAccessFile (EAF). Previously, EAF inherited from fsspec.spec.AbstractBufferedFile (ABF) so was capable of using methods defined on ABF. But EAF held an instance of an ABF at self.f and handed off __getattr__ requests to that object. Under this setup, self.read returns super().read if read is defined on ABF (and read is defined on ABF) else self.read returns self.f.read. That is a bug. It was probably assumed that __getattr__ would catch all method calls, but it only handles what __getattribute__ can't find.

We've scraped by with this setup because self.f is also an ABF and either does not override ABF on a called method or the override does little more than itself call super(). The latter is the case for self.f.read when f is a fsspec.implementations.http.HTTPFile. It is not the case when f is a fsspec.implementations.http.HTTPStreamFile.

This PR also updates some type hints and relevant documentation.

  • the type hint on f was wrong, it is an ABF not a fsspec.AbstractFileSystem
  • type hints previously given as an ABF are now given as EAF (b/c it is no longer an instance of ABF)
  • the EAF is added to mkdocs un Modules/Store

ToDo if integration tests look okay:

  • add to changelog
  • check for any changes needed to docs

📚 Documentation preview 📚: https://earthaccess--620.org.readthedocs.build/en/620/

@mfisher87
Copy link
Member

I need to play with this a little bit to better understand what's going on, but I may not have time until the next hack day.

@mfisher87 mfisher87 added the hackathon An issue we'd like to visit during a hackathon event label Jul 9, 2024
@mfisher87 mfisher87 changed the title the earthaccess.EarthAccessFile wrapper need not subclass anything the earthaccess.EarthAccessFile wrapper need not subclass anything Jul 9, 2024
@chuckwondo
Copy link
Collaborator

I need to play with this a little bit to better understand what's going on, but I may not have time until the next hack day.

@mfisher87, have you had a chance to do this?

@itcarroll or @betolink, I suppose the larger question for me is, what's the point of this class to begin with? Why do we even need it?

@itcarroll
Copy link
Collaborator Author

Thanks for checking on this one @chuckwondo! My guess on the need for this class was something to do with deserializing into a useable object in case authentication timed out. A guess only though, as I don't know when EarthAccessFile.__reduce__ would be called.

@betolink
Copy link
Member

betolink commented Sep 3, 2024

@chuckwondo

I suppose the larger question for me is, what's the point of this class to begin with? Why do we even need it?

@jrbourbeau can explain in detail but the gist of it is that this class allows a serialization trick, if we open granules from our laptop but we are offloading an xarray operation to a Dask cluster in us-west-2, it will re-open the files in place using s3://url instead of the https://cloud-front-tea url. I think James mentioned that the speed improvement was ~2x vs only using HTTPS.

@itcarroll
Copy link
Collaborator Author

@jrbourbeau The essential question is whether removing fsspec.spec.AbstractBufferedFile from the MRO of EarthAccessFile will break the usage @betolink describes. I don't think we have a test using coiled. (We still have the file-like instance attached as the f attribute and used in the __getattr__ redirection.)

@chuckwondo
Copy link
Collaborator

chuckwondo commented Sep 3, 2024

I suppose the larger question for me is, what's the point of this class to begin with? Why do we even need it?

@jrbourbeau can explain in detail but the gist of it is that this class allows a serialization trick, if we open granules from our laptop but we are offloading an xarray operation to a Dask cluster in us-west-2, it will re-open the files in place using s3://url instead of the https://cloud-front-tea url. I think James mentioned that the speed improvement was ~2x vs only using HTTPS.

I didn't notice the make_instance function earlier, but now that I've looked at it, your description is now more clear to me.

However, I'm questioning the whole idea of pickling the fsspec.spec.AbstractBufferedFile to begin with. Opening files from your laptop, then distributing the open files across a cluster seems like you may be inviting more trouble than it's worth. Pickling potentially oddly complex objects to distribute them to other processes can be more problematic than passing around simpler things, such as the URLs from which the files were opened, rather than the opened files themselves.

I may very well be thinking about this poorly, or simply be too inexperienced with using dask (and the like), but without seeing an example of the types of things you think this would be helpful for, offhand I would ask you why you're not simply distributing the URLs rather than the open files?

Is it so that you don't have to also potentially distribute credentials across such clusters as well?

@mfisher87
Copy link
Member

have you had a chance to do this?

I have not yet, sorry :X

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hackathon An issue we'd like to visit during a hackathon event
Projects
None yet
4 participants