Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Rust implementation instead? #91

Open
milesgranger opened this issue Mar 16, 2020 · 4 comments
Open

Use Rust implementation instead? #91

milesgranger opened this issue Mar 16, 2020 · 4 comments

Comments

@milesgranger
Copy link

Hi there!

I was curious if you'd be open to having the Rust implementation of snappy instead of the C dependency which can lead to troubles, especially when using python-snappy in environments like AWS Lambda?

I've made cramjam which does just this but would be willing to attempt migrating this project in a similar way. It wouldn't require any system dependencies and packing them in a wheel (OSX, Linux & Windows supported), of course, wouldn't require any compiler for the user. As it is right now, cramjam which includes snappy results in about ~1.5MB for linux wheels.

Anyway, let me know what you think and I'd be willing to start messing around with it. 👍

@martindurant
Copy link
Member

xref: dask/fastparquet#488

I think that in general this is a good idea in general. To geta good response here, it would need to show that

  • all the tests can pass, including with the framing format (i.e., files)
  • that the performance is equivalent to current or better
  • that indeed the install size is not large and the build process simple. Note that many will be installing using conda, where the size of the compiled binary snappy is <100kb ( https://anaconda.org/conda-forge/snappy/files )
  • that you can build on all platforms

@milesgranger
Copy link
Author

milesgranger commented Mar 16, 2020

Seems reasonable.

I see in snappy_formats.py it has hadoop_snappy and framed references as available formats. In my light reading from the snappy framing format, I can't find anything that speaks to a hadoop specification.

To my understanding, there is the raw, used for streaming, and the framed (entire in-memory streams like you mentioned) formats of snappy. Can I assume the hadoop format reference is a reference to the raw format? Those are the only two formats in the Rust implementation. If this isn't the case, then I guess there is no point in starting.

Also, could you specify what is considered a "large" install size, is ~1MB too big?

@martindurant
Copy link
Member

Even though I may be a maintainer here, I don't actually follow the snappy specs... So long as the existing de/compresss functions and their stream counterparts srtill product identical output, I would be happy!

@milesgranger
Copy link
Author

Hi there, working from home has left me with less time than expected.

I've added a new commit to cramjam which supports framed and raw use of snappy compression. So if I make a new release I can confirm that it will match what python-snappy currently does for its use of compress and stream_compress.

>>> import io
>>> import snappy
>>> import cramjam
>>> data = b'hi, hello there'
>>> raw = io.BytesIO(data)
>>> output = io.BytesIO()
>>> snappy.stream_compress(raw, output)
>>> output.seek(0)
0
>>> output.read()
b'\xff\x06\x00\x00sNaPpY\x01\x13\x00\x00\x82\x8f\x01\xb8hi, hello there'
>>> cramjam.snappy_compress(data)
b'\xff\x06\x00\x00sNaPpY\x01\x13\x00\x00\x82\x8f\x01\xb8hi, hello there'
>>> snappy.compress(data)
b'\x0f8hi, hello there'
>>> cramjam.snappy_compress_raw(data)
b'\x0f8hi, hello there'
>>> 

One of my concerns with making a PR to python-snappy is it will remove a lot of existing code and there are some bits in here, like https://github.com/andrix/python-snappy/blob/602e9c10d743f71bef0bac5e4c4dffa17340d7b3/snappy/snappy.py#L67 which, to be honest, I don't know what it does 😅 or how to maintain the existing UncompressError API / situations in which it should be raised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants