Ability to work with zip file in memory #36

axman6 · 2017-06-08T06:14:33Z

We have an app which downloads many small zip files from the internet and processes them in memory. Currently we're using zip-archive, but I was curious about using a pure Haskell implementation, and wondered if we could use zip. currently it looks impossible without downloading the files to disk, and then processing them (making cleanup much harder). The readme for the project lists this as a negative feature of zip-archive, but IMO it's an incredibly useful one, and stopping us from using zip.

Curious to hear your thoughts on the subject.

mrkkrp · 2017-06-08T07:09:38Z

Currently the idea from #20 seems to be most promising. Here is a relevant PR: #22, but it shows no signs of activity since August 2016.

In memory processing is certainly a very useful feature is some cases, I'll see if I can find the time to implement the ideas from #20 myself and release a new major version.

mrkkrp · 2017-06-08T19:20:30Z

Writing to a given Handler instead of a file seems to be quite doable. Would this be useful to you?

Also, what do you mean by "a pure Haskell implementation"? AFAIK zip-archive is quite pure. Both zip-archive and zip use zlib though (in the case of zip it happens indirectly, via conduit), which is C wrapped by Haskell.

axman6 · 2017-06-09T01:00:43Z

Do you mean "Writing to a given Handle"? I want to avoid handles and actual files as much as possible. I don;t understand how the linked issues/PRs address that.

mrkkrp · 2017-06-09T04:49:52Z

Do you mean "Writing to a given Handle"?

Oh yeah, sorry.

@robertLeeGDM proposes to use something like anonymous file where you can write the archive and then seek to its beginning and make a Source out of it as shown here.

Not that I like the approach very much, but the package is file-centered at the moment and that's the only way to avoid dealing with files.

The main motivation for me was to make it possible to work with very large archives efficiently and in constant memory, so it basically tries to stream everything directly to resulting file. The problem with just streaming to a given sink is that we need to write sizes of various blocks which we do by writing dummy zeroes first, then writing content, than calculating size of the content by comparing positions in file before writing the content and after. Then we return and overwrite the zeros with a correct value.

We cannot know the size in advance because we often get contents of particular entry as a Source which we must consume to discover its size.

axman6 · 2017-06-12T06:23:46Z

Right, makes sense, though I guess that means this isn't the package for us. Looks good nonetheless, thanks for taking the time to look into this!

mrkkrp added enhancement feature-request labels Jun 8, 2017

axman6 closed this as completed Jun 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to work with zip file in memory #36

Ability to work with zip file in memory #36

axman6 commented Jun 8, 2017

mrkkrp commented Jun 8, 2017

mrkkrp commented Jun 8, 2017

axman6 commented Jun 9, 2017

mrkkrp commented Jun 9, 2017

axman6 commented Jun 12, 2017

Ability to work with zip file in memory #36

Ability to work with zip file in memory #36

Comments

axman6 commented Jun 8, 2017

mrkkrp commented Jun 8, 2017

mrkkrp commented Jun 8, 2017

axman6 commented Jun 9, 2017

mrkkrp commented Jun 9, 2017

axman6 commented Jun 12, 2017