Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose blockreader #28

Open
frrad opened this issue Mar 13, 2020 · 4 comments
Open

expose blockreader #28

frrad opened this issue Mar 13, 2020 · 4 comments

Comments

@frrad
Copy link

frrad commented Mar 13, 2020

I'm working on a library to do random access in xz files with multiple blocks. I would love to use your library to do the heavy lifting instead of reinventing the wheel. I need to use some of the internal pieces though, including the blockReader

xz/reader.go

Line 261 in 067145b

type blockReader struct {
and related parts.

Would you be open to a PR which refactors things so your original package keeps the same public interface but uses a new ulikunitz/xz/lib/xzinternals which makes public some of these currently private structs?

@ulikunitz
Copy link
Owner

The block reader is intentionally left private, so that it is possible to change it without affecting users of the package. If the details would make public I would be forced to maintain the public interface. You are always free to branch the code, but I understand that you want to keep it upstream to benefit the improvements. Doing it will require to convince me that your approach can even work.

In a pure Lempel-Ziv compressed file random access doesn't make sense, because each part of compressed file depends on the window before it. So it makes only sense if parts of the file are compressed separately that are independent of each other. Now xz is a container format, which allows you to do something like this. Actually that could also be implemented using the LZMA2 format which is used inside the xz block.

But that means you will have to create xz files as a sequence of small blocks. So you would need not only to modify the reader but also the writer, which results in the requirement to also publish the block writer interface, which is not mentioned in your request.

How do you want to deal with the creation of the files?

@frrad
Copy link
Author

frrad commented Mar 14, 2020

Hi! I considered forking and if we can't find a way to do this that works for you I still might. As you say I would prefer to upstream my changes so I can benefit from future improvements that you make.

For my use case I don't need programmatic creation of the xz file. The plan is just to create it with xv --block-size ??. For reading, I plan to read the index and use that to only decompress the necessary blocks as described in this blog post https://blastedbio.blogspot.com/2013/04/random-access-to-blocked-xz-format-bxzf.html.

I understand your concerns about having to maintain the public interface to the internals. If you're open to this change we could work together to make the package boundary more streamlined and not just the sort of sloppy make-everything-public strategy I have in #29 right now.

Alternatively, if you're interested perhaps I could add the functionality I want directly to your package? This could potentially look like a new Seek method on your Reader type or maybe a new ReadSeeker type. One potential selling point here is that I suspect that the work required to understand the index for random access will make parallel decompression relatively straightforward to implement.

@frrad frrad mentioned this issue Apr 8, 2020
@frrad
Copy link
Author

frrad commented Apr 8, 2020

Sent a PR for the addition that I described above #31. I would love to work to add this to your lib instead of having to fork if you're open to it.

@ulikunitz
Copy link
Owner

Appreciated. As I wrote it may take until the weekend to review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants