Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support index creation for unseekable file objects #103

Closed
wants to merge 7 commits into from

Conversation

epicfaace
Copy link
Contributor

Fixes #102.

@pauldmccarthy
Copy link
Owner

pauldmccarthy commented Aug 19, 2022

Hi @epicfaace, would you be able to clarify your use case (in code)? I'm guessing your code looks something like this?

f = <some-file-like>
fidx = <some-file-like>
gzf = indexed_gzip.IndexedGzipFile(f)
gzf.build_full_index()
gzf.export_index(fileobj=fidx)

The CI failures are on the py27 jobs - it looks like seekable() is only present in python 3.x. But it should be straightforward to implement a hacky replacement, e.g.:

def seekable():
    try:
        fobj.seek(fobj.tell())
        return True
    except OSError:
        return False

@pauldmccarthy
Copy link
Owner

pauldmccarthy commented Aug 19, 2022

@epicfaace Hmm, your unit test is not actually triggering the seekable logic - the IndexedGzipFile constructor detects that the python file object is backed by a real file descriptor, and so is passing that into the zran library, instead of the python object.

There are a few other changes that would be required in order to make this work - one option (possibly the simplest) would be to pass the compressed size at creation, so that the zran code wouldn't need to call seek/tell. Would this work for you, i.e. do you know the size of your compressed data?

@pauldmccarthy
Copy link
Owner

I've pushed a couple of commits with my proposal to pass in compressed_size when an IndexedGzipFile is created. I'd like to play around with this a little more to try and understand the ramifications of the change, but it seems to work..

@epicfaace
Copy link
Contributor Author

@pauldmccarthy thanks for looking into this! Yes, the code block you wrote is essentially how I'm using it, though I'm interfacing with ratarmount.

In my use case, I'm essentially gzip-streaming a file and want to export an index from that stream, so I don't in fact know the compressed size in advance. However, it looks like the only time compressed_size is used is when reading from an index (as opposed to writing to an index) -- unless I'm mistaken -- so that's why I think it should still be possible to add that fix into this library.

Also, if it's helpful, see here for more context on my complete use case! mxmlnkn/ratarmount#95 (comment)

@epicfaace
Copy link
Contributor Author

epicfaace commented Sep 7, 2022

@pauldmccarthy Actually, this PR as is should serve my purposes. Even though I don't know the compressed size of the file in advance, I can just pass in a dummy value (such as 0) to compressed_size when the index is built, which will prevent indexed_gzip from seeking on the fileobj. Since compressed_size isn't otherwise used upon index creation, the dummy value should do no harm.

1 similar comment
@epicfaace
Copy link
Contributor Author

epicfaace commented Sep 7, 2022

@pauldmccarthy Actually, this PR as is should serve my purposes. Even though I don't know the compressed size of the file in advance, I can just pass in a dummy value (such as 0) to compressed_size when the index is built, which will prevent indexed_gzip from seeking on the fileobj. Since compressed_size isn't otherwise used upon index creation, the dummy value should do no harm.

@pauldmccarthy
Copy link
Owner

Hi @epicfaace, sorry for the delay, I've not found the time to look at this again. So have you tested that you can build an index from a stream, by passing a dummy non-0 value for the compressed size?

The compressed size is used in a few locations (do a grep for ->compressed_size in zran.c), so I'd like to audit the code in order to figure out whether it is really necessary (I struggle to remember code that I wrote 3 months ago, let alone 7 years!).

@paulmccarthy
Copy link

@paulmccarthy Actually, this PR as is should serve my purposes. Even though I don't know the compressed size of the file in advance, I can just pass in a dummy value (such as 0) to compressed_size when the index is built, which will prevent indexed_gzip from seeking on the fileobj. Since compressed_size isn't otherwise used upon index creation, the dummy value should do no harm.

@epicfaace You've tagged the wrong username. @pauldmccarthy is the name you meant to tag. Can you remove me from this conversation please?

@epicfaace
Copy link
Contributor Author

@paulmccarthy sorry! You will need to click "Unsubscribe" to remove yourself from the conversation.

image

@epicfaace
Copy link
Contributor Author

So have you tested that you can build an index from a stream, by passing a dummy non-0 value for the compressed size?

Not yet!

@pauldmccarthy
Copy link
Owner

Moved this over to #105

@epicfaace
Copy link
Contributor Author

closing in favor of #105

@epicfaace epicfaace closed this Sep 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't create indexes from un-seekable fileobj's
3 participants