Support index creation for unseekable file objects #103

epicfaace · 2022-08-17T14:57:55Z

Fixes #102.

pauldmccarthy · 2022-08-19T08:40:55Z

Hi @epicfaace, would you be able to clarify your use case (in code)? I'm guessing your code looks something like this?

f = <some-file-like>
fidx = <some-file-like>
gzf = indexed_gzip.IndexedGzipFile(f)
gzf.build_full_index()
gzf.export_index(fileobj=fidx)

The CI failures are on the py27 jobs - it looks like seekable() is only present in python 3.x. But it should be straightforward to implement a hacky replacement, e.g.:

def seekable():
    try:
        fobj.seek(fobj.tell())
        return True
    except OSError:
        return False

pauldmccarthy · 2022-08-19T09:50:48Z

@epicfaace Hmm, your unit test is not actually triggering the seekable logic - the IndexedGzipFile constructor detects that the python file object is backed by a real file descriptor, and so is passing that into the zran library, instead of the python object.

There are a few other changes that would be required in order to make this work - one option (possibly the simplest) would be to pass the compressed size at creation, so that the zran code wouldn't need to call seek/tell. Would this work for you, i.e. do you know the size of your compressed data?

…ekable file-likes

pauldmccarthy · 2022-08-19T10:19:16Z

I've pushed a couple of commits with my proposal to pass in compressed_size when an IndexedGzipFile is created. I'd like to play around with this a little more to try and understand the ramifications of the change, but it seems to work..

…unrelated error

epicfaace · 2022-08-19T18:34:18Z

@pauldmccarthy thanks for looking into this! Yes, the code block you wrote is essentially how I'm using it, though I'm interfacing with ratarmount.

In my use case, I'm essentially gzip-streaming a file and want to export an index from that stream, so I don't in fact know the compressed size in advance. However, it looks like the only time compressed_size is used is when reading from an index (as opposed to writing to an index) -- unless I'm mistaken -- so that's why I think it should still be possible to add that fix into this library.

Also, if it's helpful, see here for more context on my complete use case! mxmlnkn/ratarmount#95 (comment)

epicfaace · 2022-09-07T12:54:52Z

@pauldmccarthy Actually, this PR as is should serve my purposes. Even though I don't know the compressed size of the file in advance, I can just pass in a dummy value (such as 0) to compressed_size when the index is built, which will prevent indexed_gzip from seeking on the fileobj. Since compressed_size isn't otherwise used upon index creation, the dummy value should do no harm.

epicfaace · 2022-09-07T12:55:05Z

@pauldmccarthy Actually, this PR as is should serve my purposes. Even though I don't know the compressed size of the file in advance, I can just pass in a dummy value (such as 0) to compressed_size when the index is built, which will prevent indexed_gzip from seeking on the fileobj. Since compressed_size isn't otherwise used upon index creation, the dummy value should do no harm.

pauldmccarthy · 2022-09-07T16:41:35Z

Hi @epicfaace, sorry for the delay, I've not found the time to look at this again. So have you tested that you can build an index from a stream, by passing a dummy non-0 value for the compressed size?

The compressed size is used in a few locations (do a grep for ->compressed_size in zran.c), so I'd like to audit the code in order to figure out whether it is really necessary (I struggle to remember code that I wrote 3 months ago, let alone 7 years!).

paulmccarthy · 2022-09-07T16:45:30Z

@paulmccarthy Actually, this PR as is should serve my purposes. Even though I don't know the compressed size of the file in advance, I can just pass in a dummy value (such as 0) to compressed_size when the index is built, which will prevent indexed_gzip from seeking on the fileobj. Since compressed_size isn't otherwise used upon index creation, the dummy value should do no harm.

@epicfaace You've tagged the wrong username. @pauldmccarthy is the name you meant to tag. Can you remove me from this conversation please?

epicfaace · 2022-09-07T18:29:53Z

@paulmccarthy sorry! You will need to click "Unsubscribe" to remove yourself from the conversation.

epicfaace · 2022-09-07T18:30:50Z

So have you tested that you can build an index from a stream, by passing a dummy non-0 value for the compressed size?

Not yet!

pauldmccarthy · 2022-09-09T13:34:07Z

Moved this over to #105

epicfaace · 2022-09-10T01:32:43Z

closing in favor of #105

Support index creation for unseekable file objects

d56d374

pauldmccarthy mentioned this pull request Aug 19, 2022

Can't create indexes from un-seekable fileobj's #102

Closed

MNT: Support seekable() on py2

8a9f101

pauldmccarthy added 2 commits August 19, 2022 11:14

RF: Option to pass in compressed size at creation - required for unss…

9f1d261

…ekable file-likes

TEST: Adjust zran_init calls, fix unseekable test

fe39310

pauldmccarthy added 2 commits August 19, 2022 14:41

RF: Need to clear error state after failed attempt to call tell()

ee0a158

BF: Only clear error if an error occurred - could otherwise clear an …

ee65d7c

…unrelated error

epicfaace mentioned this pull request Sep 7, 2022

Bypass server upload large file is slow because of generating index file codalab/codalab-worksheets#4201

Open

Update ctest_indexed_gzip.pyx

84cc49f

pauldmccarthy mentioned this pull request Sep 9, 2022

Support unseekable streams #105

Merged

epicfaace closed this Sep 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support index creation for unseekable file objects #103

Support index creation for unseekable file objects #103

epicfaace commented Aug 17, 2022

pauldmccarthy commented Aug 19, 2022 •

edited

Loading

pauldmccarthy commented Aug 19, 2022 •

edited

Loading

pauldmccarthy commented Aug 19, 2022

epicfaace commented Aug 19, 2022

epicfaace commented Sep 7, 2022 •

edited by pauldmccarthy

Loading

epicfaace commented Sep 7, 2022 •

edited by pauldmccarthy

Loading

pauldmccarthy commented Sep 7, 2022

paulmccarthy commented Sep 7, 2022

epicfaace commented Sep 7, 2022

epicfaace commented Sep 7, 2022

pauldmccarthy commented Sep 9, 2022

epicfaace commented Sep 10, 2022

Support index creation for unseekable file objects #103

Support index creation for unseekable file objects #103

Conversation

epicfaace commented Aug 17, 2022

pauldmccarthy commented Aug 19, 2022 • edited Loading

pauldmccarthy commented Aug 19, 2022 • edited Loading

pauldmccarthy commented Aug 19, 2022

epicfaace commented Aug 19, 2022

epicfaace commented Sep 7, 2022 • edited by pauldmccarthy Loading

epicfaace commented Sep 7, 2022 • edited by pauldmccarthy Loading

pauldmccarthy commented Sep 7, 2022

paulmccarthy commented Sep 7, 2022

epicfaace commented Sep 7, 2022

epicfaace commented Sep 7, 2022

pauldmccarthy commented Sep 9, 2022

epicfaace commented Sep 10, 2022

pauldmccarthy commented Aug 19, 2022 •

edited

Loading

pauldmccarthy commented Aug 19, 2022 •

edited

Loading

epicfaace commented Sep 7, 2022 •

edited by pauldmccarthy

Loading

epicfaace commented Sep 7, 2022 •

edited by pauldmccarthy

Loading