-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with large databases / large job output #192
Comments
Do you have any test results on this? |
As mentioned on #194, it might be worth converting the |
Sure, I've got a 22 GB database around (real data, would initially work on a copy.) Actually, right now is the perfect time for some testing, as I'm currently setting up a NetBSD-based VAX pkgsrc pbulk package auto builder. Until I figure out how to do that properly, I cannot really continue with the other tests, so I've got (I guess) about a week---and holidays! But will a BLOB be stored "outside" of the rest of the row record? My impression is that sqlite will internally always load a full row, which may be like "1 KB" for all relevant metadata and (in my case) an additional "100 MB" for compressed log output. So I guess that this would only work out IFF sqlite would put the BLOBs into separate files. OTOH, I quite think that breaking the current |
So let's give this a try:
Let's see how well that works. :) |
Tried that, with limited success: With the Laminar web frontend opened, freshly queued jobs show up and you can follow their live output. However, once the jobs finish (and after a frontend reload), they no longer show up. It also seems they didn't make it to the database at all. |
Hmm, I thought sqlite would just handle BLOB/TEXT interop. Could you try with this patch? |
Sorry for being slow responding... I just applied that patch, placed my converted
...it's just reading loads and loads of data instead of "just" handing out the list of recent jobs. (DB file is about 22 GB, around 60'000 runs. If you'd like to have either one of the two DB files (with BLOBs and without), let me know!) I'm now waiting some waaay over a minute and yet waiting for the recent job list (along with queued jobs) being shown. Will add another comment once I can check if that patch actually fixes the intermediate problem that jobs get "lost" due their output now being stored properly. |
After restart, it was actually quite quick displaying a job list. Probably it had all the 22 GB in cache? NB: Looking at the Web frontend, I realize that it likely crashed after storing the first finished job's output. Because that one job is still shown (with 8 others that were running at that time having vanished without having left a trace...) |
So with the new round, a number of jobs finished and compressed output was successfully stored. It feels as fast as it was before when the database was "hot" in the cache. (The host has enough RAM to keep the 22 GB in memory when non-demanding jobs are run.) The initially observed crash didn't happen again up to now. |
I've done a few I fear there's no real speed up: As long as all the database is cached in RAM, things are fast (with and without BLOBs.) But as soon as caches are cleared, it's slow as it loads hugh amounts of data from disk. So I think for large databases, maybe eg. the PostgreSQL work might help. But in the end, getting the (probably hugh in comparison) |
In those cases, I would certainly look into storing the output externally. According to a rule of thumb from SQLite upstream's own Internal Versus External BLOBs in SQLite, it's faster to store BLOBs larger than ~100kB externally, instead of internally. They also mention page sizes of 8192 or 16384 bytes might perform better with large BLOBs (instead of the 4096 bytes default nowadays). How about the following ideas:
My hunch says nr. 1. would have the largest impact on performance, and might be good enough in itself. |
Even with ~ 100 kB, the DB would be slow'ish in access: It seems there isn't much of an index in use. And when pulling in overview data (ie. the most recent /x/ runs to show on the greeting page), it'll always load a run's complete row. Then it makes a hugh difference if you're only loading a few'ish 100 bytes (name, number, maybe reason, queue/start/finish timestamps) or also the zipped |
Sure, there might be more than one sources causing the perceived low performance. Storing huge blobs inside the database and lack of proper indexes do not exclude each other.
I understand that, and I agree reducing the amount of data to be read is a good direction to aim for.
SQLite is a real database ;) But I agree that other databases might have a built-in optimizations to handle such cases better. My feeling about the likely underlying issue is that Laminar's approach around SQLite is not designed/prepared for use cases at such scale yet.
I expect using a separate As a stop-gap measure, especially given the database still fits into memory, perhaps using Memory-Mapped I/O with SQLite would help somewhat with the current status quo around large blobs (but probably not around lack of indexes) 🤔 From a purely performance perspective, my main bet would still be on storing such large outputs as files, though. That is proposed by least the two of us.
I would be interested to obtain a copy somehow to try and reproduce your use case and locally experiment with alternatives sometime. |
I can certainly make that available, though must see how I can offer that somewhere. Give me some time on that. :) |
Will probably not keep that around forever, it's 24 GB. OTOH, you'd be able to easily create some random job that creates 500 MB of Lorem Ipsum text along with time stamps (so it doesn't compress too well.) |
@jbglaw: I got a copy, thank you! |
As far as I'm aware the existing indexes are appropriate for the current schema. But it's not my specialty, so suggestions most definitely welcome!
Seems like this is the consensus for sqlite. Assuming support for postgres gets added, where should the logs be stored in that case? I think it's probably more convenient for the administrator to store the logs in postgres along with the rest of the data, despite that being different behaviour from the (proposed) sqlite mechanism, assuming there are no performance concerns? |
I also think that the indexes are correct. But sqlite either doesn't (properly?) use them or I trigger some unexpected behavior. Maybe it looks like it would be a good idea to just do a table scan, ignoring the fact that a "row" in the table isn't just a few bytes (that is, you're reading quite a good number of rows per 4K read), but that my "rows" may be 100 MB in size... So before progressing in any direction, it would be good to see if eg. PostgreSQL would handle exactly this situation better than sqlite. The alternative would be to either store build output in a separate table or as external files. |
Using the downloaded copy, I don't seem to be able to (fully) reproduce the issue. I feel like I might miss some crucial detail in how to reproduce, or my setup is sufficiently different. Admittedly, I've only done read-only usage of the database, and did not try to write something to it. I dump some info below in hopes of being useful, feel free to ask me for more details if you want to compare your results/environment. What did I check while trying to reproduce the issue?
Questions and notesBased on the above experiments, I have the following questions and notes: What is
What are those "overview pages" exactly?
My understanding is that the SQLite database is in the same
I've tested it with querying all fields except SELECT name, number, node, queuedAt, startedAt, completedAt, result, outputLen, parentJob, parentBuild, reason FROM builds; It finishes in about 4000 ms with cold caches. To me, that suggests SQLite does not have to fetch the whole row from the file, only the requested fields. In comparison, asking also for the
I can confirm this for most queries, presumably due to the filesystem caching provided by the kernel. My cache usage for the 25GB database file never went above ~4-4.5GB though (and I certainly don't have enough RAM to cache it all anyway). Cold/warm caches did not seem to matter with large enough result sets (presumably when the data for the result set is not fully already in the cache). Possible workaroundsI wonder if the effect of this can be at least partially mitigated for now by writing the jobs in a slightly different way, for example:
Further stepsIt is certainly an interesting performance problem to track down, even though I feel like there might be still quite a few pieces missing to get a full answer. I believe it might be much faster to attempt to concentrate our efforts by teaming up more closely to solve this. I would be up to pair with one or more of you as part of my regular Open Source Office Hours offering for a closer look together. |
With that setup (using a NVMe disk), you can probably read the whole file within a few seconds. I've got way more RAM, but that's actually also in use. From my system:
A bit larger by now, but it's about 2 min.
Yeah! Had an attempt at using a summary table, which I hoped to can get in a materialized version. But sqlite seems to not support that.
Oh, does it? I didn't check that, but maybe I'd add it manually?
That's another issue (ie. unresponsiveness while committing a run.)
File was possibly copied live.
ACK, that totally depends on cache / availability of (all?) of the data. Would be nice to see how that looks with cleared caches and while having
How was I/O for that?
That ... can happen. ;-)
An experiment in the hope of getting a materialized view that
The simple "start" page, ie. http://toolchain.lug-owl.de/laminar/
With your fast disk setup, 4 sec could be enough to read all the 24 GB! 24 sec sounds like there's some more processing happening.
I'm quite certain that setting aside the actual job output to a file (or maaaaybe to a separate table) would quite completely resolve any issue I see. My disks aren't nearly as fast as yours.
Sure! I'd like to join! |
Sure, NVMe may be faster, so here's the same command output for my case:
I also did some more benchmarks (note, it was not the only load on the machine at the time, though, hence some extra variability is assumed):
Yes, I believe there's no materialized views in SQLite (yet?), only materialized CTEs.
It is my own Laminar instance's production schema that is missing this index. That looks like a separate bug, though. Just mentioned it for the sake of completeness, as I believe it should be there after the migration introduced between 1.2 and 1.3 (I'm not linking issue 182 on purpose here to avoid cross-spamming threads).
It might be another issue, right, but still that's the only one I could notice with the "huge outputs" database file. My setup seems to freeze during rendering the huge outputs returned from the database with the blob patches. I made sure to apply the blob patches locally too, but the result is the same with or without the patches. That further strengthens the "this is a browser/JS/rendering issues, rather than a database one" hypothesis. I did not attempt any write to the database on purpose (in fact, I marked the file immutable to avoid even accidental modification). If committing the job output to the database fails/slow, I expect something like this happens around there:
I doubt the problematic step is the 18 MB disk write while committing the data. The compression part feels more likely culprit (especially since it crashes). Though, I'd expect that to be reasonably fast too, since it's all in RAM (unless it gets swapped out). It would be great to measure this part with a profiler. Both for a "large but successfully committable" result, and a "too large to commit" result.
Oh, I guess that explains it, thanks! 👍
With cold caches, the same query I/O load looks like this:
There's burst of read as expected, and the overall amount read is only ~230 MB. This again indicates that SQLite does not read the whole database, only what's requested (which admittedly might cause lot of seek time, especially on rotational media).
I couldn't meaningfully measure it yet with vmstsat/pidstat/iotop due to the very short time it takes. Judging by the time it takes, I believe it is just retrieving this 18MB, and perhaps some process startup overhead, and buffering to output. At this time, I don't have reason to assume it does anything else, especially things like huge disk I/O.
OK, thanks, I can make sure to look at the performance my landing page then too.
It was good idea to check, but far from it. It takes ~18 seconds for the full database to be read, as reported above with the
Could be, that doesn't change my main point here: SQLite does not seem to read all the fields of the row from the disk, only the requested fields. If this query would make SQLite read all fields of all rows, it would not be able to return in ~4 seconds with cold caches, since reading the whole database takes ~18 seconds.
Based on the information available, I don't expect storing the output in a separate table would help too much. I expect limiting the amount of job output that gets stored in the database has higher chance of improving the situation in this case.
Please book a matching time slot through the above link. I expect it would be much faster to iterate over hypotheses and measurements if we are able to team up on this. Especially since the messages are getting quite long to follow easily 😅 |
Even if the log's in a separate file, the browser is going to struggle fetching and ansi-highlighting a ~900Mb log. The quick "solution" is to just add
I'm happy to chat but don't have a lot of spare time and am in UTC+13 so could only realistically join in your morning hours. |
Yes, I guess that's what breaks my job output pages in the browser. My system seems to be able to retrieve the data from the database quickly, but struggles with inflating it back to the raw ~900 MB version in the single-threaded process, and/or with rendering (displaying/highlighting/etc.) that much data on a single page. @jbglaw's system seems to struggle with storing that ~900 MB raw data into the database sometimes (presumably during compression before inserting to database), and with displaying the landing page ("initial overview query"). I look forward to learn more details about that exact setup soon together. I believe now we dig deeper into performance characteristics with huge datasets, we identified at least three different topics already, so it might be wise to split the discussion at some point into separate issues. I don't expect the solution to any of those is as simple as "just switch to some other database". For now, I'd say a likely interesting candidate to investigate is the in-process single-threaded (de)compression of large output, which potentially affects two separate cases we've seen so far.
That sounds like a nice defensive approach to prevent things from escalating when huge outputs are involved 👍 I wonder if it's possible to get only the last 10 kB of a huge compressed output without decompressing the whole, though 🤔 To reiterate on "limiting the amount of job output that gets stored in the database", I meant one or more of multiple options mentioned earlier (which I did not wish to repeat due to length):
If there's a fitting slot in my booking calendar, feel free to grab it. If there's no fitting slot, I'm happy to find one out of the listed ones (please feel free to reach out to me on any channels listed on my profiles). |
I took a closer look at SQLite internals, and I feel I can clarify some of our findings so far. SQLite internalsSQLite, like many other databases, stores data in pages. The default page size is 4096 bytes since 3.12.0 (2016-03-29). It was 1024 bytes before that. These pages are arranged into a binary tree structure for fast localization (see B-tree Pages). Since the unit of lookup and retrieval is one page, it is generally a good starting point for performance reasons to match the page size to the native I/O block size of underlying storage media and/or file system. With small rows, even multiple of them may fit into a single page. Large rows, like the ones we look at here with the large Also, the part of a query between SELECT and FROM is called the result expression list (see Generation of the set of result rows. Each expression in the result expression list is matched for each row retrieved by the query. Refined retrieval procedureBased on the above information, it's a bit of oversimplification from me to state that "SQLite doesn't retrieve columns that was not asked for". My understanding about what happens is more like:
For the purposes identified in this issue though, that still feels like, and is very close to not retrieving pages from disk (or cache) that it already knows it won't need to answer the query. Further considerations for performance@jbglaw: based on the the info you provided to read the whole database with If yes, it would worth to check the following:
Especially on rotational media, the actual disk layout of the database file may be interesting too. A single ~25 GB database file might not be laid out sequentially on disk, but scattered around the plate (which is less of a problem on non-rotational media). |
After letting the topic rest for a while, I think there are at least two more aspects that I was not able to investigate fully yet. Column orderThe patched database schema has the BLOB In contrast the vanilla unpatched schema has the TEXT This may cause what was perceived as “SQLite reads the whole row” in the original post, but what I couldn't reproduce yet with the sample database using the BLOB patch. Locking with rollback journalsSince Laminar does not seem to use the write-ahead log mode of SQLite, I assume the default rollback journal mode is in effect. This and the involved locking may lead to the following situations when all database clients are in the same OS process (
Perhaps enabling WAL mode, and even also using |
I've tested this conjecture by copying all data from the "blob" database into a vanilla database (with ATTACH "laminar.sqlite.blob" AS blob;
ATTACH "laminar.sqlite" AS text;
INSERT INTO text.builds(name, number, node, queuedAt, startedAt, completedAt, result, output, outputLen, parentJob, parentBuild, reason) SELECT name, number, node, queuedAt, startedAt, completedAt, result, output, outputLen, parentJob, parentBuild, reason FROM blob.builds; Then I ran one of the previously used test query: SELECT name,number,outputLen FROM builds ORDER BY outputLen DESC LIMIT 1; This used to cause ~230 MB disk IO according to In the vanilla version of the database the Based on this, it should be ensured that the Retrieving the largest compressed output with: SELECT writefile('/tmp/blob', output) FROM builds WHERE name = 'netbsd-amd64-x86_64' AND number = 19; is still very fast (~20-40 ms), reading and then writing ~18MB from disk (uncompressed ~1 GB). Opening a large output in the UI still freezes, so my next best guess about this part is still around the single-threaded extraction and/or highlighting. The write problem is still probably related to locking and/or journal mode. In summary, I believe we have identified at least 3 different topics, some of them with multiple options to address and investigate further – it might be time to split those into their own issues, and maybe use this one as a "tracker" issue. I would also still be happy to look at specific scenarios on specific hardware together with some of you in case that's helpful to learn more. It certainly looks like there are several opportunities to push performance bottlenecks farther via built-in SQLite settings and improved database design (which is expected to be lower complexity than attempting to migrate to other database technologies). |
Hi!
With my current setup (http://toolchain.lug-owl.de/laminar/), it may or may not take a lot of time to get the overview pages.
This is a setup with nearly 2300 jobs, ~ 48500 runs. I've recently started to document "build rounds", thus a complete execution of all of these jobs: http://toolchain.lug-owl.de/reports/round-overview-0.html That sheds some light on the output and gzip'ed output sized. The largest individual outputs generated are in the area of 700 to 900 MB (uncompressed) or 16 to 18 MB (compressed) for an individual (!) run.
I've not tested it yet, but it seems
laminard
calling into sqlite will always load a complete DB row with all the gz'ed output, even for just getting build metadata. (The DB is about some 12 GB right now.) Doing a fresh page load on a "cold" system takes quite some time, it's much faster when much of the 12 GB is already in memory.Another observation is that (I guess!) when a run finishes, compressing/writing out the run's output will block all other
laminard
operation. But I guess this cannot be changed due to its nature being single-threaded.However, for fetching a build's metadata, it may or may not be feasible to place all
output
data into a separate table.(Also, think I still saw instances where a
curl
that was ment to follow log output will exit too early for a still running job. I think this happens when fresh log output is generated while previous output is sent to the client.)I'll try to prepare a few test cases to quickly load a lot of data into the DB and maybe also for the mentioned
curl
issue.The text was updated successfully, but these errors were encountered: