Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bypass server upload large file is slow because of generating index file #4201

Open
wwwjn opened this issue Aug 10, 2022 · 12 comments
Open

Bypass server upload large file is slow because of generating index file #4201

wwwjn opened this issue Aug 10, 2022 · 12 comments
Assignees
Labels
p1 Do it in the next two weeks. user-question

Comments

@wwwjn
Copy link
Contributor

wwwjn commented Aug 10, 2022

According to #3370 and test result, bypass server upload need a lot of time to calculate the index file.

@epicfaace
Copy link
Member

Ashwin made a POC here: #4212

But we depend on fixing pauldmccarthy/indexed_gzip#102 and mxmlnkn/ratarmount#95 before it can work.

@epicfaace epicfaace self-assigned this Aug 24, 2022
@epicfaace
Copy link
Member

epicfaace commented Sep 7, 2022

Ashwin: get pauldmccarthy/indexed_gzip#103 to work, Ashwin will also publish a fork of indexed_gzip (indexed_gzip_fileobj_fork_epicfaace)

Jiani: fixing up #4212. Maybe try reverting to the simpler peek solution in ac5ed66. Once Ashwin publishes indexed_gzip fork, change it to use the indexed_gzip fork

import indexed_gzip_fileobj_fork_epicfaace as indexed_gzip

@percyliang percyliang added the p1 Do it in the next two weeks. label Sep 30, 2022
@wwwjn
Copy link
Contributor Author

wwwjn commented Oct 17, 2022

POC to fix this problem is here: Using peek() instead of read() in SQLiteIndexedTar for gz files

However, the current Indexed_gzip implementation will not call peek(), instead it calls read(). Simple script to reproduce this: epicfaace/test-repro#12

@wwwjn
Copy link
Contributor Author

wwwjn commented Oct 19, 2022

There still some bugs in indexed_gzip, which blocks this issue: pauldmccarthy/indexed_gzip#106

@wwwjn
Copy link
Contributor Author

wwwjn commented Oct 26, 2022

Update: Related issue for indexed_gizp github repo: pauldmccarthy/indexed_gzip#107

@leilenah
Copy link
Collaborator

will sync with ashwin

@epicfaace
Copy link
Member

epicfaace commented Dec 7, 2022

Potential solution:

  1. Skip calling peek() from ratarmount (ac5ed66#diff-ad5ad76eb55b6437b2e1aa24b324c6d11d176be1926ebea1950a006bfce4efbeR46). Assume it's always a gzip file.
  2. Fix issues around building indexes (on this line of code ac5ed66#diff-ad5ad76eb55b6437b2e1aa24b324c6d11d176be1926ebea1950a006bfce4efbeR1439-R1441). 2 options:
    1. Fix For unseekable files, can't build full index by reading the entire file pauldmccarthy/indexed_gzip#107 so that if you read an entire unseekable file, it should build an index for it.
    2. Maybe we can just change ratarmount here ac5ed66#diff-ad5ad76eb55b6437b2e1aa24b324c6d11d176be1926ebea1950a006bfce4efbeR1439-R1441 to just call build_full_index.

@wwwjn
Copy link
Contributor Author

wwwjn commented Jan 2, 2023

Potential solution:

  1. Skip calling peek() from ratarmount (ac5ed66#diff-ad5ad76eb55b6437b2e1aa24b324c6d11d176be1926ebea1950a006bfce4efbeR46). Assume it's always a gzip file.

  2. Fix issues around building indexes (on this line of code ac5ed66#diff-ad5ad76eb55b6437b2e1aa24b324c6d11d176be1926ebea1950a006bfce4efbeR1439-R1441). 2 options:

    1. Fix For unseekable files, can't build full index by reading the entire file pauldmccarthy/indexed_gzip#107 so that if you read an entire unseekable file, it should build an index for it.
    2. Maybe we can just change ratarmount here ac5ed66#diff-ad5ad76eb55b6437b2e1aa24b324c6d11d176be1926ebea1950a006bfce4efbeR1439-R1441 to just call build_full_index.

Did more tests based on this POC. And test the above potential solution. However, it does not works for me.

For step2 & Option 1, I found an interesting observation: By reading through the entire unseekable file to build index, this only works for small files (smaller than 10 MB). In other words, the solution in this commit only works for small files. For those large files, it will raise ZRAN_READ_FAIL error when reading through the file (Line 753 or line 1439).

For step 2 & Option 2, build_full_index is still buggy for me.

@wwwjn
Copy link
Contributor Author

wwwjn commented Jan 28, 2023

Solution:

  1. Skip peek() from ratarmount
  2. Fix issues around building indexes. Use build full_index() to build index from GzipStream().

@wwwjn
Copy link
Contributor Author

wwwjn commented Mar 8, 2023

Fix this, test on largest bundle uploading to dev (158GB).

@wwwjn
Copy link
Contributor Author

wwwjn commented Mar 8, 2023

  1. pass the test and deploy
  2. upload a large bundle locally, stress test locally (test a lot of small files)

@percyliang
Copy link
Collaborator

Can we check that we can upload a 150+GB file and close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1 Do it in the next two weeks. user-question
Projects
Status: Blocked
Development

No branches or pull requests

4 participants