Add ReaderAt #31

frrad · 2020-04-08T04:53:22Z

This is the change I proposed a while ago in #28 to add a random access xz reader.

It's functional but not super polished yet. Would you be open to an addition like this?

ulikunitz · 2020-04-08T06:24:48Z

Thanks for the update. It may take me until the weekend to review your code.

ulikunitz · 2020-04-12T18:19:25Z

Hi Frederick,

Before I can review the code I need to understand what you want to achieve. Apparently you want to have an xz decompressor that supports the ReaderAt interface. It would be helpful to understand what the actual use case is you want to support here. Questions are how large will be the file segments you want to read with ReadAt. Do you want to read the same file from multiple go routines? The answer to those questions is important because they will guide the design of the solution.

The problem with Lempel-Ziv files that they are sequential, because Lempel-Ziv decompression requires that you need to have the complete decompression window available to read the whole file. The only way around this is to split the compressed file stream in independent blocks, so we would need to change the Writer in such a way that is produces those independent blocks. We would need also to decide, whether the blocking should happen on LZMA or xz level.

I'm also thinking it would be easier to make the normal Reader an io.Seeker. Using Seek and Read the ReadAt function would be easy to implement.

If we have agreed the solution design we could agree than on the changes that need to be made to the code.

Kind regards,

Ulrich

frrad · 2020-04-12T19:56:43Z

Hi Ulrich, thanks for taking the time to review. Responses inline

Before I can review the code I need to understand what you want to achieve. Apparently you want to have an xz decompressor that supports the ReaderAt interface. It would be helpful to understand what the actual use case is you want to support here.

My application needs access to a very large lookup table. Fortunately, the table compresses very well. I would like to be able to access the contents of this table without having to ever store the entire thing either on disk or in memory.

Questions are how large will be the file segments you want to read with ReadAt.

Since I will be creating the original compressed archive, I will be able to control this. It's a tradeoff where larger segments will get better overall compression ratio, but cause slower lookups. Right now I am using 500KiB blocks which seems to give a still-reasonable compression ratio on my data. I haven't really dialed this in yet though.

Do you want to read the same file from multiple go routines? The answer to those questions is important because they will guide the design of the solution.

Right now I am not reading the same file from different goroutines, however it would be nice to have this option in the future if it doesn't add too much complexity to the design here.

The problem with Lempel-Ziv files that they are sequential, because Lempel-Ziv decompression requires that you need to have the complete decompression window available to read the whole file. The only way around this is to split the compressed file stream in independent blocks, so we would need to change the Writer in such a way that is produces those independent blocks. We would need also to decide, whether the blocking should happen on LZMA or xz level.

For my use case it is not important that the writer be able to create these files. I create these read-only lookup tables separately and provide them to my application. I have been creating them like this: pv index.csv | xz --block-size=500KiB -T 0 > index.csv.xz which create a stream of 500KB blocks.

I'm also thinking it would be easier to make the normal Reader an io.Seeker. Using Seek and Read the ReadAt function would be easy to implement.

I know I suggested adding Seek to the Reader initially, but when I actually started implementing this I discovered a rather obvious problem. In order to give consumers of the decompresser the ability to do random reads, the decompressor needs the ability to do random reads on the underlying data. This means that the func NewReader(xz io.Reader) (r *Reader, err error) signature for creating this object is no longer really appropriate and it felt better to just create a new separate type for this. The choice of ReaderAt instead of ReadSeeker for this type is debatable of course. I chose it because it felt more natural to me but don't really have a strong opinion there.

If we have agreed the solution design we could agree than on the changes that need to be made to the code.

Hope this helps you understand my use case and that you think it's an okay fit for your library. I think it could be a useful feature for other users as well

edit: One more thought - even if the random access feature itself is not widely useful, I think it will create the opportunity for increasing decompression speed on files with multiple blocks. It should be easy to tweak this PR so that each block is decompressed in its own goroutine.

Fred

ulikunitz · 2020-04-12T20:54:04Z

Now I have a lot of questions.

The file stores comma-separated values? Do you have an example of the content?

How do you know the offset to read?

How large is the file? How many records?

It is true that parallel decompression requires independent blocks. That is something I could look into.

frrad · 2020-04-12T22:44:47Z

The file stores comma-separated values? Do you have an example of the content?

It is just a huge CSV, ordered by one of the columns. Here's a sample:

head index.csv
DOI,ID
10.0000//kronk.spb.ru/library/bolshakov-og-1980.htm,62292786
10.0000//www.ncbi.nlm.nih.gov/pubmed/2963548,77431839
10.0000/0,66471328
10.0000/0-hera.ugr.es.adrastea.ugr.es/generic-B58BB1205BB7,34313043
10.0000/0-hera.ugr.es.adrastea.ugr.es/generic-B7222606114D,32797621
10.0000/00000000,66696625
10.0000/002,66471330
10.0000/003,66471331
10.0000/004,66471332

How do you know the offset to read?

I am basically doing binary search on offset.

How large is the file? How many records?

It's only ~3GB uncompressed right now, but could grow in the future.

wc index.csv
  81381898   81382677 2952600694 index.csv

frrad · 2020-04-12T22:48:41Z

reader_at.go

+				}
+				blockAmtToRead := blockEndPos - blockStartPos
+
+				r.readBlockAt(


To do this in parallel this readBlockAt could be changed to errgroup.Go(readBlockAt)

ulikunitz · 2020-04-13T06:19:45Z

Thanks for the answers. I understand now better. For a local program you might be better off to seek the file directly, but if you want to run it in a web service or distribute the program using the compressed file would be of course better.

frrad · 2020-05-02T18:28:57Z

Hi Ulrich,

Just wanted to check in and see how you feeling about this change? No rush if you're still thinking about it.

ulikunitz · 2020-05-05T06:45:27Z

Frederick, the software will have ReaderAt and I would like to implement it together with parallel processing. I'm currently working on better LZ sequencing including a slow but optimal sequence generators, so there are things happening, which are not visible right now.

frrad · 2020-05-05T16:21:58Z

Thanks for the update! Let me know if I can help with anything.

frrad and others added 13 commits March 17, 2020 21:40

add ReaderAt

8e4bf87

building skeleton with one test

67736ad

factor out readFooter function

dc0fb91

start reading index

03f3130

reads an index

7782d18

simple read test passes

6a023ec

move test files

7222079

add chunked test file

db43713

handle trailing nulls

ab374dd

add new failing test for multiple streams

ef84e9b

handle multi-sentence case

5879327

refactor test

97881a4

add benchmark

3d05949

frrad mentioned this pull request Apr 8, 2020

expose blockreader #28

Open

add Size() method to access decompressed size

927ef4b

frrad commented Apr 12, 2020

View reviewed changes

ulikunitz closed this Nov 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ReaderAt #31

Add ReaderAt #31

frrad commented Apr 8, 2020

ulikunitz commented Apr 8, 2020

ulikunitz commented Apr 12, 2020

frrad commented Apr 12, 2020 •

edited

Loading

ulikunitz commented Apr 12, 2020

frrad commented Apr 12, 2020 •

edited

Loading

frrad Apr 12, 2020

ulikunitz commented Apr 13, 2020

frrad commented May 2, 2020

ulikunitz commented May 5, 2020

frrad commented May 5, 2020

Add ReaderAt #31

Add ReaderAt #31

Conversation

frrad commented Apr 8, 2020

ulikunitz commented Apr 8, 2020

ulikunitz commented Apr 12, 2020

frrad commented Apr 12, 2020 • edited Loading

ulikunitz commented Apr 12, 2020

frrad commented Apr 12, 2020 • edited Loading

frrad Apr 12, 2020

Choose a reason for hiding this comment

ulikunitz commented Apr 13, 2020

frrad commented May 2, 2020

ulikunitz commented May 5, 2020

frrad commented May 5, 2020

frrad commented Apr 12, 2020 •

edited

Loading

frrad commented Apr 12, 2020 •

edited

Loading