Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible optimization? #12

Closed
Vadiml1024 opened this issue Feb 27, 2023 · 6 comments
Closed

Possible optimization? #12

Vadiml1024 opened this issue Feb 27, 2023 · 6 comments
Labels
duplicate This issue or pull request already exists question Further information is requested

Comments

@Vadiml1024
Copy link

I've stumbled on following scenario:

I'm mounting a .zip archive with ratarmont,
The archive contains 10 .tar files each of the 10G in size;
I have a 3rd party antivirus program which scans the mount point and which is EXTREMELY slow (relative to others).
So I've analyzed it's behavior with strace. It seems that it tries to determine the file size using the following (or similar) code:

       long pos = ftell(fp);
       long size = fseek(fp, 0, SEEK_END);
       fseek(fp, pos, SEEK_SET);

Of course the first fseek causes ratarmount to fully decompress the member of the .zip file which takes a LOT of time.
So I wonder is it possible to make ratarmount to postpone to do the actual seek until the read or write operations?
The seek(... SEK_END) call can position virtual offset to the value retrieved form associated struct stat?

@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 27, 2023

I think this is a duplicate of mxmlnkn/ratarmount#105. Pragzip is not yet used for zip files. Your case should work after it has been integrated.

@mxmlnkn mxmlnkn added duplicate This issue or pull request already exists question Further information is requested labels Feb 27, 2023
@Vadiml1024
Copy link
Author

IMHO it is related but not the same.
In the .tar.gz case ratarmount has to deflate the whole file to build its index (btw maybe it could be interesting to implement some sort of lazy mode, so not the whole file is inflated upon when mounting so that mount will be fast, and delating is postponed unit actual read/write/readdir).

In case of .zip file there is no need at all to deflate anything upon mount....

@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 27, 2023

But didn't you say you were mounting a ".zip archive"?

For .tar.gz, I don't see any way around inflating the whole file. That's because the metadata for each file can be anywhere inside the TAR. That's why I have to go over it once to collect all file names. Not inflating the whole file would mean that some file names would be missing in the mount point. I am simply skipping over the file contents during metadata gathering but gzip does not allow to skip data. The gzip decompression also needs an index for that in the first place. I could try to start decoding in the middle of a gzip file but it would never be guaranteed that this would work and I wouldn't even know at which decompressed offset I am currently at. I need to know all data before to determine that.

In case of .zip file there is no need at all to deflate anything upon mount....

That is correct but in your original post you were talking about seeking to the end of zip members ...

@mxmlnkn
Copy link
Owner

mxmlnkn commented Mar 19, 2023

I've stumbled on following scenario:

I'm mounting a .zip archive with ratarmont, The archive contains 10 .tar files each of the 10G in size; I have a 3rd party antivirus program which scans the mount point and which is EXTREMELY slow (relative to others). So I've analyzed it's behavior with strace. It seems that it tries to determine the file size using the following (or similar) code:

       long pos = ftell(fp);
       long size = fseek(fp, 0, SEEK_END);
       fseek(fp, pos, SEEK_SET);

Of course the first fseek causes ratarmount to fully decompress the member of the .zip file which takes a LOT of time. So I wonder is it possible to make ratarmount to postpone to do the actual seek until the read or write operations? The seek(... SEK_END) call can position virtual offset to the value retrieved form associated struct stat?

I was not able to reproduce your observed behavior. I have tried:

base64 /dev/urandom | head -c $(( 8 * 1024 * 1024 * 1024 )) > large
zip large.zip large
ratarmount large.zip mounted
python3 -c 'import io; file=open("mounted/10k-1MiB-files.tar", "rb"); file.seek(0, io.SEEK_END); print(file.tell())'

Getting the file size like this is completed in 19ms. This indicates that it does not actually decompress the whole member. It simply seeks to the end and returns the size without any decompression.

I'm closing it for now.

Please provide a bash script to reproduce the issue. And it should probably be an issue in the ratarmount repository not in here in the pragzip repository.

Are you by change using ratarmount --lazy --recursive? In that case, it would make some sense that the index is built when accessing the file. To be precise, it should be built the accessing the parent directory.

@mxmlnkn mxmlnkn closed this as completed Mar 19, 2023
@Vadiml1024
Copy link
Author

base64 /dev/urandom | head -c $(( 8 * 1024 * 1024 * 1024 )) > large
zip large.zip large
ratarmount large.zip mounted
python3 -c 'import io; file=open("mounted/10k-1MiB-files.tar", "rb"); file.seek(0, io.SEEK_END); print(file.tell())'

Are you sure this is the script you've used for testing?
Because as quoted in cannot work: the file mounted/10k-1MiB-files.tar will not be present in the large.zip

@mxmlnkn
Copy link
Owner

mxmlnkn commented Mar 20, 2023

Yeah sorry, I wanted to make the script more generic and reproducible by changing 10k-1MiB-files.tar to a simple base64 file. The 10k-1MiB-files.tar also contains only base64 data and therefore is similarly compressible and will be added as a compressed member. I checked that with zipinfo. Here is the adjusted script:

base64 /dev/urandom | head -c $(( 8 * 1024 * 1024 * 1024 )) > large
zip large.zip large
ratarmount large.zip mounted
python3 -c 'import io; file=open("mounted/large", "rb"); file.seek(0, io.SEEK_END); print(file.tell())'

The result is the same. It takes ~20ms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants