Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC/Research Prototype: Use createLazyFile to mount a read only view of remote sqlite databases #49

Closed
wants to merge 14 commits into from

Conversation

nelsonjchen
Copy link

@nelsonjchen nelsonjchen commented Sep 24, 2022

This is more POC/WIP/curiosity than anything serious. Just wanted to make a draft PR for posterity.


edit The state of the art is this:

https://datasette-lite-lab.mindflakes.com/index.html?url=https://datasette-lite-lab.mindflakes.com/sdb/2022-10-02_93eff57de3573985_ca_unclaimed_property.sqlite#/2022-10-02_93eff57de3573985_ca_unclaimed_property?sql=SELECT+*+FROM+records+WHERE+records.owner_name+MATCH+%22Elon+Musk%22+ORDER+BY+CAST%28CURRENT_CASH_BALANCE+AS+FLOAT%29++DESC%3B

which is the chunked/CDN'd version described further below.


For context, I'm trying to put a 32GB FTS5 sqlite database on the internet to query. I plan to host it on Cloudflare so I do not care about BW costs.

The functionality here is a bit like https://github.com/phiresky/sql.js-httpvfs, but far lazier and inefficient. I originally thought I would have to make my own GUI, but I kind of like the GUI I saw in datasette and wondered if I could reuse it. Unfortunately, current datasette-lite seems to pull everything into memory.

It does work actually, but I don't know how viable this is. Emscripten is by default reads 1MB chunks but it seems to have some way to recompile it to not:

https://github.com/emscripten-core/emscripten/blob/18bc868cb5242e6816a4b3bde74b1e1dcd6fd818/src/library_fs.js#L1713

azavea/loam#75 (comment)

Some FTS5 queries take 1GB to query while others may take only 66MB. As I've said, I don't care about BW but the great inefficiency harms UX.

Also, this chunked inefficiency means that I have to hack the URL to not load tables of a database as it seems to try to load the whole database when I click on a database.

I think for my goal, I might have to try to recompile pyodide with a small XHR build of emscripten. 😬

As for any dependencies from the remote URLs, they'll need to have the proper CORS headers set, including "expose" headers. If it can't read expose headers or the remote file is GZIPed, it just downloads the whole file, so it does degrade gracefully. However, it does cause a nasty error message in the console to popup saying something denied access to some headers to Emscripten.

Anyway, this is just for fun.

@nelsonjchen
Copy link
Author

I also built a fork of some static web file server to also expose the necessary headers too:

static-web-server/static-web-server#144

@simonw
Copy link
Owner

simonw commented Sep 24, 2022

Wow, I did not think this would be possible without completely replacing the Python sqlite3 module!

@nelsonjchen
Copy link
Author

The experimental tuned extracted thing explodes on the example databases. Not sure why. Well, that's why this is a draft after all :D.

@simonw
Copy link
Owner

simonw commented Sep 24, 2022

Also, this chunked inefficiency means that I have to hack the URL to not load tables of a database as it seems to try to load the whole database when I click on a database.

I bet that's because Datasette tries to show a count of all of the rows in each table when it shows the list on that page, which triggers a full table scan.

Would be great to have a setting that turns that feature off, which could then be exposed as a query string option for Datasette Lite.

@nelsonjchen
Copy link
Author

https://github.com/phiresky/sql.js-httpvfs/blob/master/src/lazyFile.ts

I think it should be possible to adapt this theoretically more efficient dynamic chunk size version to this approach as well.

@nelsonjchen
Copy link
Author

Oh, I should probably explain what I'm trying to query or do. I like looking at CA unclaimed property records with SQLite since it's a much more powerful and faster full text search than what's offered on the government site:

https://www.sco.ca.gov/upd_download_property_records.html

It's about 35GB when imported into a FTS5 table, optimized, and vacuumed. I was thinking of trying to expose a similar service with something like https://github.com/wilsonzlin/edgesearch but it was a little too foreign, a little way too tied to one vendor's offering and architecture, and not SQLite.

@nelsonjchen
Copy link
Author

nelsonjchen commented Sep 24, 2022

@nelsonjchen
Copy link
Author

Oh wow I didn't realize you were already aware of this stuff:

https://twitter.com/simonw/status/1421497663732604928

I think we could just use "sql.js-httpvfs" implementation's of createLazyFile as-is and we should be good to go actually on using it with datasette-lite.

@nelsonjchen
Copy link
Author

nelsonjchen commented Sep 25, 2022

@nelsonjchen
Copy link
Author

nelsonjchen commented Sep 26, 2022

So this is all still research and a POC PR that is not meant to be merged in the end but I have some ideas to keep following up on when time permits that might be interesting for other's future proper PRs/RFCs:

  • It doesn't seem possible at the moment to have a manifest. I would love to point users at a fatter README or site to describe what's going on in the data set.
  • What if we pointed datasette-lite at a manifest file of some sort as an alternative to a sqlite db url in it instead of a DB directly.
  • What if the sqlite db url in it could also be instead a manifest of split chunks? The underlying rangemapper option of sql.js's lazyfile could be used to handle chunks. An example on generating the necessary split DBs files is in https://github.com/phiresky/sql.js-httpvfs/blob/master/create_db.sh . I think I still want to customize it a bit though, the sql.js-httpvfs's output chunk's filename scheme would not be cached with cloudflare's default caching rules: https://developers.cloudflare.com/cache/about/default-cache-behavior/#default-cached-file-extensions . It's possible to work around the caching choice defaults with page rules but they are very limited in amounts on free Cloudflare plans so a friendlier distribution would negate the need to expend a page rule. I'm super interested in chunking as the current status quo of always missing at the moment incurs a ~300ms cost for any SQLite page read. That's basically about 30x slower than a HDD.

@nelsonjchen
Copy link
Author

nelsonjchen commented Sep 27, 2022

Threw on a page rule in CF to see what happens if the sqlite database is forced to be cached.

nelsonjchen/ca_unclaimed_property_db_generator_toolkit#1 (comment)

Unfortunately, I ran into some odd bug where a range request tried to return a 200. All of it, 28GB. At least it tried to before I killed it. Removed the rule.

https://community.cloudflare.com/t/bug-partial-requests-for-cached-files-occasionally-results-in-a-200-response-with-the-full-contents/376516

The post is both agonizing and tantalizing since the OP posted a timeline. If Cloudflare gets their stuff together, we could speed up repeated queries and/or cache few levels of popular indirection by as much as ~10x since the hits become 30ms vs 300ms for the current status quo of always missing. It's also agonizing since CF claims the issue was fixed.

Thinking of making a periodic and continuous GitHub repo/actions setup with Playwright to test and validate the bug and running it up to CF engineering.

@nelsonjchen nelsonjchen changed the title Use createLazyFile to mount a read only view of remote sqlite databases POC/Research Prototype: Use createLazyFile to mount a read only view of remote sqlite databases Oct 2, 2022
@nelsonjchen
Copy link
Author

https://kevincox.ca/2021/06/04/http-range-caching/

Found this great post on caching behavior of ranges in common CDNs.

@nelsonjchen
Copy link
Author

nelsonjchen commented Oct 8, 2022

Made a chunked version as POC test. It is indeed faster!

https://datasette-lite-lab.mindflakes.com/index.html?url=https://datasette-lite-lab.mindflakes.com/sdb/2022-10-02_93eff57de3573985_ca_unclaimed_property.sqlite#/2022-10-02_93eff57de3573985_ca_unclaimed_property?sql=SELECT+*+FROM+records+WHERE+records.owner_name+MATCH+%22Elon+Musk%22+ORDER+BY+CAST%28CURRENT_CASH_BALANCE+AS+FLOAT%29++DESC%3B

Hack, with hardcoded overall file size: https://github.com/nelsonjchen/datasette-lite/tree/chunk-hack

Something like this definitely needs to have a manifest-like thing though. Upfront, a file size is needed. And I've already expressed my desire for metadata.

4096KB pages, 10MB chunks, ~30ms hits, ~300-500ms misses. Not bad! Cloudflare does seem to expire the cache rather a lot though. At least a page refresh is fast. Though, it seems caching sometimes isn't free and it leans towards 600ms on a miss.

On that note. I think I'm going to wind down the experimentation a bit. Hopefully someone else can use these learnings.

And I have looked you up @simonw . There is something, not a lot and not pocket change, but there is something. Hopefully the CA state controller won't give you too much trouble. You the real MVP!

@nelsonjchen nelsonjchen closed this Oct 8, 2022
@nelsonjchen
Copy link
Author

nelsonjchen commented Oct 9, 2022

Note: phiresky/sql.js-httpvfs#40

There's an upstream bug in the lazyFile implementation which I've fixed in my test hack of a lab for chunked cacheable database.

@nelsonjchen
Copy link
Author

So, with the split database and all that, the cost per month to host and expose this 28GB database with Datasette:

https://developers.cloudflare.com/r2/platform/pricing/

$0.42

The cost can increase if there are a lot of queries but this is extremely negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants