Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support str, path object or file-like object on file read #302

Open
hsuominen opened this issue Aug 16, 2024 · 5 comments
Open

Support str, path object or file-like object on file read #302

hsuominen opened this issue Aug 16, 2024 · 5 comments

Comments

@hsuominen
Copy link

Describe the functionality you would like to see.

For a number of applications it would be preferable if file reading supported file-like objects as well as strings or paths.

@ericpre
Copy link
Member

ericpre commented Aug 16, 2024

Can you elaborate on the use case please? What type of format are you thinking of?

@hsuominen
Copy link
Author

Appreciate the quick response. Basically I'm hoping that rosettasciio would support a similar interface to e.g. pandas or imageio:

https://imageio.readthedocs.io/en/stable/_autosummary/imageio.v3.imread.html#imageio.v3.imread
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv

This would enable smoother use in distributed applications where the actual loading of the file is done without access to the original filesystem on which the file is stored, and would just be passed as a file-like object:
(copied from pandas docs):

By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

@ericpre
Copy link
Member

ericpre commented Aug 17, 2024

Off the top of my head, there may be already a few formats that can do that but I suspect that rosettasciio supports a wider variety of type of file than imageio and pandas and depending on the type, it may behave differently.

Here is a list of the different type of files

There should be some low hanging fruit as it should be easy to implement for some type.

@CSSFrancis
Copy link
Member

@hsuominen is the idea that you are loading data that isn't on the computer doing the operation?

I think zarr might be a good place to start. https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.LRUStoreCache

This store implementation uses a LRU cache over an s3 bucket which might be interesting if aws is hosting data.

@hsuominen
Copy link
Author

hsuominen commented Aug 20, 2024

@hsuominen is the idea that you are loading data that isn't on the computer doing the operation?

yes that's right.

Our intent is to get the data out of proprietary formats and into e.g. zarr (which looks great), but we need to run this extraction on compute that doesn't have the files sitting locally. There are fairly easy workarounds (e.g. using a TempFile) but thought it would be good to get this discussion going as I can see others eventually running into similar needs.

Looking specifically at some of the file formats we are interested in, the changes needed in some cases would be pretty trivial (as @ericpre hinted):

with open(filename, "rb") as f:
dm = DigitalMicrographReader(f)

but likely harder in others:

file = h5py.File(filename, "r")
dictionaries = []
try:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants