Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hello from fsspec #579

Closed
martindurant opened this issue Jan 21, 2021 · 21 comments
Closed

Hello from fsspec #579

martindurant opened this issue Jan 21, 2021 · 21 comments

Comments

@martindurant
Copy link

https://filesystem-spec.readthedocs.io/en/latest/ and related packages seems to cover much of the same ground as this repo. I don't know how I didn't come across it before!

There is probably scope to share and make each-others' libraries better. While I have a look at what is here, I would kindly ask anyone interested to look over fsspec. A few key features I would point out:

  • a wider range of backends (s3, gcs, azure, ftp, local, memory, ssh/sftp, hdfs, web/httphdfs, zip, jupyter, github, git, databricks, dask-distributed, dropbox, google-drive, smb, archive, dvc)
  • serlialisable fs and file instances, which can therefore be distributed across processes and machines
  • concurrent (async) operations on select backends
  • caching of file-listings, file contents (in memory) and whole or parts of files (on disc)
  • chainable URL inference (extreme example: access the contents of a zip file on s3, where only a remote dask worker has credentials to view, and cache them locally)

fsspec is being used by some high-profile projects such as dask and pandas.

@jdonaldson
Copy link

There's not much in the way of docs over at fsspec yet. It would be great to have a way to specify a file or range of files from a location via a (pseudo) glob, and just have them stream into a local instance (ML training). Supporting a shared pseudo glob format (e.g. s3://some_bucket/*.csv) would be very useful. Right now, we're rolling our own on that front.

@martindurant
Copy link
Author

There's not much in the way of docs over at fsspec yet.

The great majority of users never see fsspec directly, so the docs, such as they are, are more dev-oriented.

I wasn't entirely sure what your feature request is, but you have the following two options:

f2 = fsspec.filesystem("s3", ...)
fs.get("s3://bucket/file*glob/", "/local/dir/")

to copy files to local (this works in batch/concurrently) or

ofs = fsspec.open_files("s3://bucket/file*glob", "rb", ...)

to create OpenFile instances to be used in a with context, where they become normal file-like objects (optional extra arguments here for text mode, compression, caching).

@jdonaldson
Copy link

Ok, this is pretty nice. Thanks for the tips here!

I might humbly suggest to sell this library a bit better via docs, My impression was that fsspec was more of an internal library (and as such likely to change without much notice).

@martindurant
Copy link
Author

Yes, indeed, that is how it has mostly been; but actually that makes it less likely to change, as it would break APIs in those other libraries.

I am the worst at keeping docs up to date and complete. I'll put it onto my list...

@jdonaldson
Copy link

Well, many people understand internal libraries as "this code could change/break suddenly since it is subservient to another project".

This really isn't a "big" doc issues, there's already good api docs for fsspec. Just add a few blurbs on the readme talking about the basic use cases you're handling with fsspec. Talk about where the library came from and where it's going. If you're nice, add some links back to smart_open under a list of alternative libraries, etc.

@jdonaldson
Copy link

You've already done most of this, it's just buried in a subheader in the api docs : https://filesystem-spec.readthedocs.io/en/latest/features.html

@martindurant
Copy link
Author

I opened an issue to reformat the docs, as many use cases ended up on the features page, which has grown steadily with time.

@martindurant
Copy link
Author

If you're nice, add some links back to smart_open under a list of alternative libraries, etc.

The initial reason for this issue was to see if there was a chance to work together, reduce duplication or confusion for users wishing to pick a library.

It should be pointed out that fsspec's origins were explicitly as a layer for Dask, so we have some unique concerns, particularly around serialising of file-system objects and open files (or OpenFiles).

@martindurant
Copy link
Author

Sorry for not having tackled this yet - I was reminded to do so by the discussion at https://news.ycombinator.com/item?id=27523377#27523893

@martindurant
Copy link
Author

@isidentical , it occurs to me that your team might be in a better position than me to write the brief overview text discussed above, if you have the time and appetite.

@jdonaldson
Copy link

Thanks, I've mentioned this library to other colleagues, and they're put off by the lack of a "friendly" landing page that everyone seems to do these days.

@martindurant
Copy link
Author

fsspec/filesystem_spec#674

@mpenkov mpenkov closed this as completed Apr 24, 2022
@mxmlnkn
Copy link

mxmlnkn commented Apr 6, 2024

Hi there,

I wanted to give a quick greeting without opening a new issue because I feel like there is also some overlap with ratarmount and/or rather the library backend ratarmountcore. I started it motivated by the lack of performance of archivemount with TAR files. Therefore, I'd see the focus and advantages in access performance, which is enabled by the persisting SQLite index and also custom backends such as indexed_bzip2 and rapidgzip to enable parallelized decompression and constant-time random access to compressed streams. Both, fsspec and smart_open might also benefit from these backends.

Ratarmount is or was rather narrowly focused on tar support, hence the name, but over time support for other archives was added based on user requests. I'm also very close to finishing a libarchive backend. This further increases the overlap with fsspec. It even has some kind of filesystem interface called MountSource, but because of its focus on read-only access, it is much more terse.

The focus is also more on random access as opposed to streaming access as is the case for smart_open, and for now, there are no backends for web protocols. However, multiple users tried to stack ratarmount on top of S3 or HTTP in some way or another to achieve similar things.

Another project that might be comparable with all three mentioned projects might be NVIDIA's aistore.

@martindurant
Copy link
Author

Thanks @mxmlnkn , a lot to read and think about there. In fact, I have been tracking indexed_gzip (and zstd, bzip2) as a target for kerchunk, which would indeed allow parallel decompression access within archives (including zip and tar.*); it sounds like ratarmount has something similar (or perhaps more general-purpose). I hadn't thought of that also in combination with a filesystem. Many moving parts! So we need to think about how to make these things work together, and ideally come together to build a better get-my-bytes story for all.

@piskvorky
Copy link
Owner

@mxmlnkn I've used indexed_gzip and indexed_bzip2 with great success. Small world! First time I hear of ratarmount, looks cool, let me check it out :) And thanks for your continued work in this ecosystem.

@mxmlnkn
Copy link

mxmlnkn commented Apr 7, 2024

Thanks @mxmlnkn , a lot to read and think about there. In fact, I have been tracking indexed_gzip (and zstd, bzip2) as a target for kerchunk, which would indeed allow parallel decompression access within archives (including zip and tar.*); it sounds like ratarmount has something similar (or perhaps more general-purpose).

Yes, it has something similar. indexed_bzip2, and its standalone tool called ibzip2, can decompress bzip2 in parallel. It works like tools such as lbzip2, i.e., it looks for magic bytes for the independent 900K-sized bzip2 blocks and decompresses them in parallel. It then gathers offset information and stores it in the index for fast seeking. This makes it work with arbitrary bz2 files without defining a new format.

rapidgzip is an iteration on that whole architecture of indexed_bzip2 and also comes with a command line tool for parallelized decompression of gzip. It is also intended as a more memory-hungry but faster replacement for indexed_gzip, so there are some tradeoffs. Because gzip has no independent blocks, the internals are a bit more complicated. A paper was published explaining it. It iterates on the work of pugz and its corresponding paper. It also has window compression similar to gztool, which improves memory usage over indexed_gzip for very large gzip files. For example, I was testing this with the ~100 GB large wikidata.json.gz dataset, which decompresses to ~1 TB. The additional metadata for seeking (32 KiB per seek point) takes up almost 10 GB and with compression ~1 GB.

But to be honest, the parallelization is probably not worth it for anything residing on a disk because gzip decompression is sufficiently fast already, especially with implementations such as ISA-L, which comes close in performance to zstd. (The comparison benchmarks on the zstd site don't show that though, they show comparisons with the much slower zlib ;) ) It might be worth it for data on good SSDs or cloud access via very fast networks though. I benchmarked it in-memory on HPC systems to show scaling up to 20 GB/s gzip decompression bandwidth. It also has another caveat that the intricate algorithm for parallelization adds overhead, i.e., you will need at least 2 or 3 cores to be faster than single-core decompression. This is also partly because the generic parallel implementation cannot use ISA-L and instead uses a custom gzip implementation that lacks decades of finetuning.

In the end, if you want random access to data you compressed yourself, you are probably best off using bgzip / BGZF, which adds metadata for seeking in the gzip stream "extra" headers. Rapidgzip can detect such files and is then >2x faster than for generic gzip files. pzstd would also be an option. The normal zstd compression tool does not add information for seeking, unfortunately, and even limiting the frame size is not yet possible. In that way, zstd and xz are still worse options than bzip2 and gzip, when it comes to seekability and generic parallel decompressability.

I read a bit into kerchunk and zarr. With any new format, there always is the issue of adoption, therefore I find it important to mention improvements over existing formats. The website doesn't mention much regarding that, but the video presentation from PyData Global 2021 has a list of features that are missing in HDF5. It also sounds similar to Parquet in a way. A comparison to that might also be helpful.

@martindurant
Copy link
Author

Would people here be interested in having a coordination meeting? There are a few repos and a lot of interesting code and ideas here. I'm not really sure how big a niche there is for all this: whether fsspec should explicitly integrate, or if ad-hoc solutions are enough.

I read a bit into kerchunk and zarr. With any new format, there always is the issue of adoption, therefore I find it important to mention improvements over existing formats. The website doesn't mention much regarding that, but the video presentation from PyData Global 2021 has a list of features that are missing in HDF5. It also sounds similar to Parquet in a way. A comparison to that might also be helpful.

Agreed, more motivation/documentation would be very helpful! I'll answer your immediate questions here, and perhaps that's a start. Kerchunk is not a new format, but a way of presenting various binary array formats as if they were zarr. Zarr itself is missing a "why" section ( https://zarr.readthedocs.io/en/stable/getting_started.html#highlights hmm), but it is a "cloud native" N-D array format, where the array data is chunked along each dimension, allowing for remote storage and parallel processing with the likes of dask, with the metadata stored in small JSON files. It is well integrated with xarray and some others. Zarr has been around for a while, and is well established in some specific (scientific) fields like climatology and microscopy. Yes, it shared "cloud native" with parquet, but the latter is columnar and nearly always used with tabular/2d data.

So kerchunk allow HDF5 files (and grib, fits, netcdf3) to be viewed as zarr, and get all the advantages of that without copying/recoding the original data. You can even form logical datasets out of potentially thousands of source files, so that instead of some search interface to find the right files for a job, you simply do coordinate selection/slicing of the zarr or xarray object. This trick has been done in python only, with one working POC for JS.

@mxmlnkn
Copy link

mxmlnkn commented Apr 15, 2024

I tried to get myself an overview of the ecosystem, but I'm running out of steam and the further away from ratarmount and raw compression it goes, the more out-of-depth I am. Here is some kind of arranged chart. You can click on it to get the SVG. Please correct me if I understood or categorized anything wrong.

ecosystem-optimized

Two frameworks with significant overlap in my opinion that haven't been mentioned here are: pyfilesystem2 and fox-it/dissect. The former has received no commits for over a years, and fox-it/dissect seems to have shot out from nothing in October 2022 if you look at the star history.

It feels like everyone has started from different niches (forensics, cloud data analysis, local big-data archival) and while adding features arrived at the observed overlaps. The kinds of overlaps I observe are:

  • Cloud interfaces. Although, it seems like fsspec is filesystem-based while smart_open is file-based, so the overlap is probably somewhat limited. In general, I find smart_open to be refreshingly focused on one thing only (with smart_open() as file). The actual dependencies seem to be mostly the same, but fsspec wraps them in standalone implementations such as gcsfs and adlfs.
  • Archive and filesystem abtractions: dissect, fsspec, libarchive, pyfilesystem2, ratarmountcore.
    • Ratarmountcore initially only wanted to replace the slow archivemount. This goal creates a conscious overlap between ratarmountcore and archivemount/libarchive because the latter by design is not intended for random access or parallelization.
      • Many of the novel access methods like indexed_bzip2 and rapidgzip are separate and can be reused.
      • An exception is SQLiteIndexed(Tar), which is basically a reimplementation of devsnd/tarindexer but optimized for very large archives. Performance is the reason for the SQLite index, which enables bounded memory usage and log(N) lookup times. This is probably something that could and should be extracted into a kind of indexed_tar module. I intended ratarmountcore to be that module, but it has grown too fat to be that, especially when circular dependencies should be avoided. The overlap to tarindexer is intentional because it seems to have been only a proof-of-concept and development has halted 9 years ago. Such an indexed_tar module could then benefit fox-it/dissect and fsspec, which both use Python's tarfile module and/or libarchive directly and therefore may have scaling issues. Note that the SQLite index can also be held completely in memory by specifying the path :memory:, which probably is a good default for a backend/library as opposed to the command line tool, for which the on-disk index is the better default.
      • There is some small overlap to kerchunk because also can access some (uncompressed) archives with offset ranges and this is also stored in a JSON index as far as I understood it. Note that I also had other solutions for the index such as pickle and json before I finally arrived at SQLite, which was sufficiently scalable. Before that, I had trouble with short-lived unnecessary memory peaks and (de)serialization times.
    • I guess PyFilesystem2 wanted to be the filesystem abstraction layer that got reinvented in ratarmountcore, fsspec, and dissect and, according to the stars, it even has quite some tractions. However, it seems to be effectively unmaintained, so that fsspec seems to be a better contender for that interface.
      • Note that I think that fsspec interface is a bit daunting to me as a possible implementer because of the myriad of methods. It would be helpful to strictly delineate some pure virtual/abstract methods that have to be implemented from the syntactic-sugar predefined methods that simply forward to other methods.
    • As mentioned above, fsspec seems to be a good contender for the unified filesystem interface. Furthermore, the cloud implementations look good. Personally, I would like to integrate these cloud backends in ratarmountcore, so that users do not have to stack FUSE mounts like using ratarmount on top of s3fs-fuse. However, I'm not fully sure whether I'll also have to use smart_open for single-file cloud access or whether I can also do that with fsspec...
  • I would hazard a guess that fox-it/dissect existed for some longer time before publication as closed source. The focus on forensics gives it another spin and explains some support like for event logs, which I probably won't include in ratarmount anytime soon.
    • Some choices are questionable however. For example, dissect.squashfs seems to be a complete pure-Python reimplementation of SquashFS even though there should already be the exact same thing with PySquashfsImage. This is another overlap.
    • They do have some really nice and novel Python implementations for many filesystems. This is kinda cool, and I'd like to include those also in ratarmount, but as everything is AGPL-licensed, I would have to write permissively-licensed replacements in order to use them in MIT-licensed ratarmount :(
      • It would have been nicer if all those filesystem implementations would have landed in libarchive, but I guess the streaming-based interface of libarchive is not a good fit anyway. There is a decade-old issue for that, but I think it would be such a major interface change that it is unlikely to happen, even though I find it crucial for libarchive to become usable as a backend for a FUSE layer or similar applications.
    • I guess fsspec might also want to wrap these dissect filesystem implementations.
  • FUSE implementations
    • All of the above archive/filesystem abstractions come with FUSE bindings. However, fusefs and archivemount seem to be dead, and even fuse-archive seems to have at least stalled. I guess it is just the obvious step to add a FUSE adaptor, and it generally isn't that much of work but very framework-specialized work, so it probably doesn't count as "overlap".
    • However, some complexity comes from the command line interface. I am probably partial, but I feel like ratarmount has the most functional CLI. I wouldn't call it clean, as it simply grew with features such as union mounting, bind mounting, recursive mounting, file versioning, ... The CLI is also what keeps me from releasing a 1.0 version, but I guess I'll still simply do it for the next release because it feels somewhat feature-complete at this point.
      • The other CLIs seem to be rather experimental. For testing, I use:
        • target-mount -o modules=subdir,subdir=fs rapidgzip-0.10.1.tar.gz mounted. See also this issue for the monkey patch necessary on my system. The subdir option is a trick to get rid of the super folders that get added, which are probably something forensic-specific.
        • python3 -c 'from fsspec.implementations.tar import TarFileSystem as tafs; fs = tafs("rapidgzip-0.10.1.tar"); import fsspec.fuse; fsspec.fuse.run(fs, "", "mounted")'. See also this issue I experienced on my very first try. I didn't find an actual CLI, that's why I'm calling python3 with a short script. Also, I'm not sure whether there is automatic file type detection based on extensions, or even magic bytes, that chooses the correct fsspec implementation for me. Is there such a thing?
        • Out of interest, I'd like to redo the scaling benchmarks on the ratarmount landing page with the fsspec and dissect FUSE bindings.
  • Compression interfaces. There might be some small overlap with numcodecs and the libarchive filters interface. Libarchive is separated into the filesystem implementations and several filers, to undo streaming compression and even some to convert, e.g., cpio files into rpm and such. Numcodecs seems to abstract away something similar but specifically for Zarr.

Would people here be interested in having a coordination meeting?

An online meeting might be interesting, but if it doesn't happen, then I guess I'll hopefully, slowly but steadily, progress as outlined above with the checkboxes whenever I have some free time and motivation. I don't feel it to be necessary or desirable to merge any project completely into another one. However, some backends as outlined above can probably be reused. A verbal meeting might also be easier to digest than my wall of text ;)

I did some similar albeit much much shorter contemplations for: mxmlnkn/ratarmount#109

@martindurant
Copy link
Author

Thanks for the detailed summary - quite a lot to digest there! First things first, fsspec should definitely wrap dissect; I was totally unaware of its existence.

I agree that there's no huge motivation to try to merge projects, but it would be great if they can work together. fsspec has a history of hooking into third party libraries, providing no extra functionality but a familiar interface to users.

Actually, the set of non-FS data inputs available in dissect feels a lot like something the Intake project would be interested in, but that's another issue. I see in your issue you are also interested in data storage formats with internal compression - you might be interested in kerchunk as a way to find the encoded chunks within them.

@martindurant
Copy link
Author

I should also mention that yes, fsspec has FUSE support, but it's super flaky and breaks under serious load. There haven't been enough users requesting better service to justify trying to make it better.

@martindurant
Copy link
Author

Just spotted the bullet

Add the fsspec cloud backends, so that you can do stuff like: ratarmount s3://...archive.tar.gz mounted

If you can mount archives (or anything!) with fsspec as a backend, it would be valuable that way around too :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants