Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with large databases / large job output #192

Open
jbglaw opened this issue Mar 27, 2023 · 28 comments
Open

Issues with large databases / large job output #192

jbglaw opened this issue Mar 27, 2023 · 28 comments

Comments

@jbglaw
Copy link
Contributor

jbglaw commented Mar 27, 2023

Hi!

With my current setup (http://toolchain.lug-owl.de/laminar/), it may or may not take a lot of time to get the overview pages.

This is a setup with nearly 2300 jobs, ~ 48500 runs. I've recently started to document "build rounds", thus a complete execution of all of these jobs: http://toolchain.lug-owl.de/reports/round-overview-0.html That sheds some light on the output and gzip'ed output sized. The largest individual outputs generated are in the area of 700 to 900 MB (uncompressed) or 16 to 18 MB (compressed) for an individual (!) run.

I've not tested it yet, but it seems laminard calling into sqlite will always load a complete DB row with all the gz'ed output, even for just getting build metadata. (The DB is about some 12 GB right now.) Doing a fresh page load on a "cold" system takes quite some time, it's much faster when much of the 12 GB is already in memory.

Another observation is that (I guess!) when a run finishes, compressing/writing out the run's output will block all other laminard operation. But I guess this cannot be changed due to its nature being single-threaded.

However, for fetching a build's metadata, it may or may not be feasible to place all output data into a separate table.

(Also, think I still saw instances where a curl that was ment to follow log output will exit too early for a still running job. I think this happens when fresh log output is generated while previous output is sent to the client.)

I'll try to prepare a few test cases to quickly load a lot of data into the DB and maybe also for the mentioned curl issue.

@ohwgiles
Copy link
Owner

ohwgiles commented May 7, 2023

Do you have any test results on this?

@ohwgiles
Copy link
Owner

As mentioned on #194, it might be worth converting the output column in the database to BLOB type. Are you able to try that out and see if it makes a noticeable different to your workload?

@jbglaw
Copy link
Contributor Author

jbglaw commented Jul 23, 2023

Sure, I've got a 22 GB database around (real data, would initially work on a copy.) Actually, right now is the perfect time for some testing, as I'm currently setting up a NetBSD-based VAX pkgsrc pbulk package auto builder. Until I figure out how to do that properly, I cannot really continue with the other tests, so I've got (I guess) about a week---and holidays!

But will a BLOB be stored "outside" of the rest of the row record? My impression is that sqlite will internally always load a full row, which may be like "1 KB" for all relevant metadata and (in my case) an additional "100 MB" for compressed log output. So I guess that this would only work out IFF sqlite would put the BLOBs into separate files. OTOH, I quite think that breaking the current builds table into two tables (builds with everything but output and outputLen, and a new table output with eg. id, buildId (referencing builds which should get a new id column) and the two former output and outputLen columns.)

@ohwgiles
Copy link
Owner

But will a BLOB be stored "outside" of the rest of the row record?

IIUC, according to @mitya57's comment, yes

@jbglaw
Copy link
Contributor Author

jbglaw commented Jul 24, 2023

So let's give this a try:

root@lili:/var/lib/laminar# systemctl stop laminar.service
root@lili:/var/lib/laminar# cp laminar.sqlite test.sqlite
root@lili:/var/lib/laminar# sqlite3 test.sqlite
SQLite version 3.40.1 2022-12-28 14:03:47
sqlite> ALTER TABLE builds ADD COLUMN output2 blob;
sqlite> update builds set output2=output;
sqlite> ALTER TABLE builds drop COLUMN output;
sqlite> ALTER TABLE builds RENAME output2 TO output;
sqlite> VACUUM ;

root@lili:/var/lib/laminar# mv laminar.sqlite laminar.sqlite~bak
root@lili:/var/lib/laminar# mv test.sqlite laminar.sqlite
root@lili:/var/lib/laminar# systemctl start laminar.service

Let's see how well that works. :)

@jbglaw
Copy link
Contributor Author

jbglaw commented Jul 28, 2023

Tried that, with limited success: With the Laminar web frontend opened, freshly queued jobs show up and you can follow their live output. However, once the jobs finish (and after a frontend reload), they no longer show up. It also seems they didn't make it to the database at all.

@ohwgiles
Copy link
Owner

Hmm, I thought sqlite would just handle BLOB/TEXT interop. Could you try with this patch?
laminar-sqlite-blob.patch.txt

@jbglaw
Copy link
Contributor Author

jbglaw commented Sep 1, 2023

Sorry for being slow responding... I just applied that patch, placed my converted laminar.sqlite file (as above) and started laminard again.
Queued a number of jobs and opened the web frontend. This works quite like before: Still showing me "Connecting..." and looking at the strace output:

pread64(6, "\r\0\0\0\1\7r\0\7r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 1365565440) = 4096
pread64(6, "\r\0\0\0\1\5\265\0\5\265\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 1160654848) = 4096
pread64(6, "\r\0\0\0\1\1\373\0\1\373\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 211525632) = 4096
pread64(6, "\r\0\0\0\1\7v\0\7v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 14761984) = 4096
pread64(6, "\r\0\0\0\1\7k\0\7k\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 22463127552) = 4096
pread64(6, "\r\0\0\0\2\4\v\0\t\262\4\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 20118360064) = 4096
pread64(6, "\r\0\0\0\3\0\325\0\v\25\5\354\0\325\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 17419665408) = 4096
pread64(6, "\r\0\0\0\2\1\363\0\t\23\1\363\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 17153323008) = 4096
pread64(6, "\r\0\0\0\2\2\357\0\t\205\2\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 14452269056) = 4096
pread64(6, "\r\0\0\0\1\0m\0\0m\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 7680790528) = 4096
pread64(6, "\r\0\0\0\6\4N\0\16\r\f\32\n'\0104\6A\4N\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 7464267776) = 4096
pread64(6, "\r\0\0\0\1\0\321\0\0\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 7250530304) = 4096
pread64(6, "\r\0\0\0\2\0`\0\10\37\0`\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 6753173504) = 4096
pread64(6, "\r\0\0\0\1\1.\0\1.\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 6336630784) = 4096
pread64(6, "\r\0\0\0\5\1\344\0\rN\n\204\7\230\4\270\1\344\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 6171406336) = 4096
pread64(6, "\r\0\0\0\2\0+\0\10\5\0+\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 6086373376) = 4096
pread64(6, "\r\0\0\0\1\1\207\0\1\207\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 5920780288) = 4096

...it's just reading loads and loads of data instead of "just" handing out the list of recent jobs. (DB file is about 22 GB, around 60'000 runs. If you'd like to have either one of the two DB files (with BLOBs and without), let me know!)

I'm now waiting some waaay over a minute and yet waiting for the recent job list (along with queued jobs) being shown. Will add another comment once I can check if that patch actually fixes the intermediate problem that jobs get "lost" due their output now being stored properly.

@jbglaw
Copy link
Contributor Author

jbglaw commented Sep 1, 2023

laminard crashed (SIGSEGV), unfortunately I had no gdb attached.

After restart, it was actually quite quick displaying a job list. Probably it had all the 22 GB in cache?

NB: Looking at the Web frontend, I realize that it likely crashed after storing the first finished job's output. Because that one job is still shown (with 8 others that were running at that time having vanished without having left a trace...)

@jbglaw
Copy link
Contributor Author

jbglaw commented Sep 1, 2023

So with the new round, a number of jobs finished and compressed output was successfully stored. It feels as fast as it was before when the database was "hot" in the cache. (The host has enough RAM to keep the 22 GB in memory when non-demanding jobs are run.) The initially observed crash didn't happen again up to now.

@jbglaw
Copy link
Contributor Author

jbglaw commented Sep 3, 2023

I've done a few runs by now. The initial crash I noticed didn't show up ever since.

I fear there's no real speed up: As long as all the database is cached in RAM, things are fast (with and without BLOBs.) But as soon as caches are cleared, it's slow as it loads hugh amounts of data from disk. So I think for large databases, maybe eg. the PostgreSQL work might help. But in the end, getting the (probably hugh in comparison) output data out of the rows is the main thing that could help there.

@ferki
Copy link
Contributor

ferki commented Sep 27, 2023

The largest individual outputs generated are in the area of 700 to 900 MB (uncompressed) or 16 to 18 MB (compressed) for an individual (!) run.

In those cases, I would certainly look into storing the output externally.

According to a rule of thumb from SQLite upstream's own Internal Versus External BLOBs in SQLite, it's faster to store BLOBs larger than ~100kB externally, instead of internally.

They also mention page sizes of 8192 or 16384 bytes might perform better with large BLOBs (instead of the 4096 bytes default nowadays).

How about the following ideas:

  1. Set 100kB as a hard limit in code (or perhaps a compile-time setting) to decide whether the job output is stored inside the database, or externally in a file (and storing only the filename in the database).
  2. Since the optimal threshold potentially depends on the given hardware and use case at hand, perhaps offer a configuration option for it later if needed.
  3. Optionally allow setting the page size of the SQLite database upon creation.

My hunch says nr. 1. would have the largest impact on performance, and might be good enough in itself.

@jbglaw
Copy link
Contributor Author

jbglaw commented Sep 27, 2023

Even with ~ 100 kB, the DB would be slow'ish in access: It seems there isn't much of an index in use. And when pulling in overview data (ie. the most recent /x/ runs to show on the greeting page), it'll always load a run's complete row. Then it makes a hugh difference if you're only loading a few'ish 100 bytes (name, number, maybe reason, queue/start/finish timestamps) or also the zipped output. I guess that a "real" database would probably perform better. But I guess that the basics would apply here as well: Either keep the output in a separate table (and quickly accessible with an index), or place it into external files.

@ferki
Copy link
Contributor

ferki commented Sep 27, 2023

Even with ~ 100 kB, the DB would be slow'ish in access: It seems there isn't much of an index in use.

Sure, there might be more than one sources causing the perceived low performance. Storing huge blobs inside the database and lack of proper indexes do not exclude each other.

And when pulling in overview data (ie. the most recent /x/ runs to show on the greeting page), it'll always load a run's complete row. Then it makes a hugh difference if you're only loading a few'ish 100 bytes (name, number, maybe reason, queue/start/finish timestamps) or also the zipped output.

I understand that, and I agree reducing the amount of data to be read is a good direction to aim for.

I guess that a "real" database would probably perform better.

SQLite is a real database ;) But I agree that other databases might have a built-in optimizations to handle such cases better.

My feeling about the likely underlying issue is that Laminar's approach around SQLite is not designed/prepared for use cases at such scale yet.

But I guess that the basics would apply here as well: Either keep the output in a separate table (and quickly accessible with an index), or place it into external files.

I expect using a separate outputs table would not help much since SQLite would still be just a single file (and perhaps some supporting files like WAL, etc.).

As a stop-gap measure, especially given the database still fits into memory, perhaps using Memory-Mapped I/O with SQLite would help somewhat with the current status quo around large blobs (but probably not around lack of indexes) 🤔

From a purely performance perspective, my main bet would still be on storing such large outputs as files, though. That is proposed by least the two of us.

If you'd like to have either one of the two DB files (with BLOBs and without), let me know!)

I would be interested to obtain a copy somehow to try and reproduce your use case and locally experiment with alternatives sometime.

@jbglaw
Copy link
Contributor Author

jbglaw commented Sep 27, 2023

I would be interested to obtain a copy somehow to try and reproduce your use case and locally experiment with alternatives sometime.

I can certainly make that available, though must see how I can offer that somewhere. Give me some time on that. :)

@jbglaw
Copy link
Contributor Author

jbglaw commented Sep 27, 2023

wget https://jbglaw-octopus.dyn.lug-owl.de/lami/laminar.sqlite.bak

Will probably not keep that around forever, it's 24 GB.

OTOH, you'd be able to easily create some random job that creates 500 MB of Lorem Ipsum text along with time stamps (so it doesn't compress too well.)

@jbglaw
Copy link
Contributor Author

jbglaw commented Oct 1, 2023

@ohwgiles @ferki Did you two fetch that database? (Or the other way around: Are you still interested in keeping that download available?)

@ferki
Copy link
Contributor

ferki commented Oct 3, 2023

Did you two fetch that database? (Or the other way around: Are you still interested in keeping that download available?)

@jbglaw: I got a copy, thank you!

@ohwgiles
Copy link
Owner

ohwgiles commented Oct 4, 2023

lack of proper indexes

As far as I'm aware the existing indexes are appropriate for the current schema. But it's not my specialty, so suggestions most definitely welcome!

From a purely performance perspective, my main bet would still be on storing such large outputs as files, though. That is proposed by least the two of us.

Seems like this is the consensus for sqlite. Assuming support for postgres gets added, where should the logs be stored in that case? I think it's probably more convenient for the administrator to store the logs in postgres along with the rest of the data, despite that being different behaviour from the (proposed) sqlite mechanism, assuming there are no performance concerns?

@jbglaw
Copy link
Contributor Author

jbglaw commented Oct 5, 2023

I also think that the indexes are correct. But sqlite either doesn't (properly?) use them or I trigger some unexpected behavior. Maybe it looks like it would be a good idea to just do a table scan, ignoring the fact that a "row" in the table isn't just a few bytes (that is, you're reading quite a good number of rows per 4K read), but that my "rows" may be 100 MB in size...

So before progressing in any direction, it would be good to see if eg. PostgreSQL would handle exactly this situation better than sqlite. The alternative would be to either store build output in a separate table or as external files.

@ferki
Copy link
Contributor

ferki commented Oct 11, 2023

Using the downloaded copy, I don't seem to be able to (fully) reproduce the issue. I feel like I might miss some crucial detail in how to reproduce, or my setup is sufficiently different. Admittedly, I've only done read-only usage of the database, and did not try to write something to it. I dump some info below in hopes of being useful, feel free to ask me for more details if you want to compare your results/environment.

What did I check while trying to reproduce the issue?

  1. I'm running these on Gentoo, with SQLite 3.43.2, Laminar-1.3, and kernel 6.1.53, using a Btrfs filesystem on an NVME drive plus 16GB RAM.

  2. The sample database from above seems to have a different schema than the default.

    My production schema:

    CREATE TABLE IF NOT EXISTS "builds_old"(name TEXT, number INT UNSIGNED, node TEXT, queuedAt INT, startedAt INT, completedAt INT, result INT, output TEXT, outputLen INT, parentJob TEXT, parentBuild INT, reason TEXT, PRIMARY KEY (name, number));
    CREATE INDEX idx_completion_time ON "builds_old"(completedAt DESC);
    CREATE TABLE builds(name TEXT, number INT UNSIGNED, node TEXT, queuedAt INT, startedAt INT, completedAt INT, result INT, output TEXT, outputLen INT, parentJob TEXT, parentBuild INT, reason TEXT, PRIMARY KEY (name, number DESC));

    Downloaded schema:

    CREATE TABLE builds_summary (name text,
                                number int unsigned,
                                queuedAt int,
                                startedAt int,
                                completedAt int,
                                result int);
    CREATE TABLE builds(name TEXT, number INT UNSIGNED, node TEXT, queuedAt INT, startedAt INT, completedAt INT, result INT, outputLen INT, parentJob TEXT, parentBuild INT, reason TEXT, output blob, PRIMARY KEY (name, number DESC));
    CREATE INDEX idx_completion_time ON builds(completedAt DESC);

    Notably:

    • the sample schema uses BLOB type for the output field instead of TEXT (an experiment from this issue)
    • the sample schem has a (custom?) builds_summary table
    • my production schema looks like was migrated during the upgrade to version 1.2, but misses the idx_completion_time index, which is possibly a separate bug
  3. I copied the sample database to a local installation of laminar under /var/lib/laminar/laminar.sqlite, and set an immutable attribute to it with chattr +i to avoid any changes to it during investigation.

  4. The sample database size is 25326919680 bytes or ~25GB.

  5. Opening the local Laminar UI in my browser feels fast for most endpoints I could think of to click around:

    What seemed slow/broken is selecting a finished run with a large output. There I only got partial output up to a certain point in the UI, and the browser tab froze (both in Chrome, and a WebKit based browser). Looking at the processes it was the browser process that was stuck on 100% CPU usage, the laminard process was fine without any excessive usage.

    To me, it suggests more like a (JavaScript?) rendering issue of huge output, rather than a database querying issue.

    There was also some inconsistency in the database, for example around the netbsd-amd64-x86_64 job. There's a run number 21 in the builds table with partial data available: name, number, node, queuedAt, startedAt and parentBuild is there, but the values for completedAt, result, outputLen, and output are missing, compared to other rows of the same job's runs.

  6. I queried for the largest job output based on outputLen, with

    SELECT name,number,outputLen FROM builds ORDER BY outputLen DESC LIMIT 1;

    According to time sqlite3 laminar.sqlite ..., it took about 3700 ms, while rerunning it is around 45 ms due to file system caching.

  7. Dropped laminar.sqlite from caches with dd if=laminar.sqlite iflag=nocache count=0.

  8. Queried the largest output blob, and saved it into /tmp/blob with:

    SELECT writefile('/tmp/blob', output) FROM builds WHERE name = 'netbsd-amd64-x86_64' AND number = 19;

    The size of /tmp/blob is 18722708 bytes, or around 18MB. According to time sqlite3 laminar.sqlite it took about 32 ms to write the blob with clod caches, and about 21 ms with warm ones.

  9. Extracted /tmp/blob into /tmp/blob2 with unpigz -c /tmp/blob > /tmp/blob2. The extracted size is 1000860844 bytes, or about ~955MB.

Questions and notes

Based on the above experiments, I have the following questions and notes:

What is builds_summary table?

it may or may not take a lot of time to get the overview pages.

What are those "overview pages" exactly?

it seems laminard calling into sqlite

My understanding is that the SQLite database is in the same laminard process, so there's no separate database to call into. It does need to fetch the data from the database file, though.

will always load a complete DB row with all the gz'ed output, even for just getting build metadata.

I've tested it with querying all fields except output:

SELECT name, number, node, queuedAt, startedAt, completedAt, result, outputLen, parentJob, parentBuild, reason FROM builds;

It finishes in about 4000 ms with cold caches. To me, that suggests SQLite does not have to fetch the whole row from the file, only the requested fields. In comparison, asking also for the output field to the list of fields in the query finishes in much longer time (~24000 ms).

it's much faster when much of the 12 GB is already in memory.

I can confirm this for most queries, presumably due to the filesystem caching provided by the kernel. My cache usage for the 25GB database file never went above ~4-4.5GB though (and I certainly don't have enough RAM to cache it all anyway).

Cold/warm caches did not seem to matter with large enough result sets (presumably when the data for the result set is not fully already in the cache).

Possible workarounds

I wonder if the effect of this can be at least partially mitigated for now by writing the jobs in a slightly different way, for example:

  • Would it be possible to reduce some of the output in the jobs if there are some less critical info there?
  • Would it be possible to write parts or all of the output as an artifact file instead of storing it inside the database? This artifact might even be compressed at the end of the run, and would be linked from the run view by laminar if someone needs the full output.

Further steps

It is certainly an interesting performance problem to track down, even though I feel like there might be still quite a few pieces missing to get a full answer.

I believe it might be much faster to attempt to concentrate our efforts by teaming up more closely to solve this.

I would be up to pair with one or more of you as part of my regular Open Source Office Hours offering for a closer look together.

@jbglaw
Copy link
Contributor Author

jbglaw commented Oct 11, 2023

What did I check while trying to reproduce the issue?

  1. I'm running these on Gentoo, with SQLite 3.43.2, Laminar-1.3, and kernel 6.1.53, using a Btrfs filesystem on an NVME drive plus 16GB RAM.

With that setup (using a NVMe disk), you can probably read the whole file within a few seconds. I've got way more RAM, but that's actually also in use. From my system:

root@lili:/var/lib/laminar# time cat ./laminar.sqlite > /dev/null

real    2m9.859s
user    0m0.074s
sys     0m25.828s
root@lili:/var/lib/laminar# ls -l laminar.sqlite
-rw-r--r-- 1 laminar laminar 27889238016 Oct 11 17:33 laminar.sqlite

A bit larger by now, but it's about 2 min.

  1. The sample database from above seems to have a different schema than the default.
    My production schema:

    CREATE TABLE IF NOT EXISTS "builds_old"(name TEXT, number INT UNSIGNED, node TEXT, queuedAt INT, startedAt INT, completedAt INT, result INT, output TEXT, outputLen INT, parentJob TEXT, parentBuild INT, reason TEXT, PRIMARY KEY (name, number));
    CREATE INDEX idx_completion_time ON "builds_old"(completedAt DESC);
    CREATE TABLE builds(name TEXT, number INT UNSIGNED, node TEXT, queuedAt INT, startedAt INT, completedAt INT, result INT, output TEXT, outputLen INT, parentJob TEXT, parentBuild INT, reason TEXT, PRIMARY KEY (name, number DESC));

    Downloaded schema:

    CREATE TABLE builds_summary (name text,
                                number int unsigned,
                                queuedAt int,
                                startedAt int,
                                completedAt int,
                                result int);
    CREATE TABLE builds(name TEXT, number INT UNSIGNED, node TEXT, queuedAt INT, startedAt INT, completedAt INT, result INT, outputLen INT, parentJob TEXT, parentBuild INT, reason TEXT, output blob, PRIMARY KEY (name, number DESC));
    CREATE INDEX idx_completion_time ON builds(completedAt DESC);

Yeah! Had an attempt at using a summary table, which I hoped to can get in a materialized version. But sqlite seems to not support that.

Notably:

  • the sample schema uses BLOB type for the output field instead of TEXT (an experiment from this issue)
  • the sample schem has a (custom?) builds_summary table
  • my production schema looks like was migrated during the upgrade to version 1.2, but misses the idx_completion_time index, which is possibly a separate bug

Oh, does it? I didn't check that, but maybe I'd add it manually?

  1. I copied the sample database to a local installation of laminar under /var/lib/laminar/laminar.sqlite, and set an immutable attribute to it with chattr +i to avoid any changes to it during investigation.

  2. The sample database size is 25326919680 bytes or ~25GB.

  3. Opening the local Laminar UI in my browser feels fast for most endpoints I could think of to click around:

What seemed slow/broken is selecting a finished run with a large output. There I only got partial output up to a certain point in the UI, and the browser tab froze (both in Chrome, and a WebKit based browser). Looking at the processes it was the browser process that was stuck on 100% CPU usage, the laminard process was fine without any excessive usage.
To me, it suggests more like a (JavaScript?) rendering issue of huge output, rather than a database querying issue.

That's another issue (ie. unresponsiveness while committing a run.)

There was also some inconsistency in the database, for example around the netbsd-amd64-x86_64 job. There's a run number 21 in the builds table with partial data available: name, number, node, queuedAt, startedAt and parentBuild is there, but the values for completedAt, result, outputLen, and output are missing, compared to other rows of the same job's runs.

File was possibly copied live.

  1. I queried for the largest job output based on outputLen, with

    SELECT name,number,outputLen FROM builds ORDER BY outputLen DESC LIMIT 1;

    According to time sqlite3 laminar.sqlite ..., it took about 3700 ms, while rerunning it is around 45 ms due to file system caching.

ACK, that totally depends on cache / availability of (all?) of the data. Would be nice to see how that looks with cleared caches and while having vmstat or iotop running.

  1. Dropped laminar.sqlite from caches with dd if=laminar.sqlite iflag=nocache count=0.

  2. Queried the largest output blob, and saved it into /tmp/blob with:

    SELECT writefile('/tmp/blob', output) FROM builds WHERE name = 'netbsd-amd64-x86_64' AND number = 19;

    The size of /tmp/blob is 18722708 bytes, or around 18MB. According to time sqlite3 laminar.sqlite it took about 32 ms to write the blob with clod caches, and about 21 ms with warm ones.

How was I/O for that?

  1. Extracted /tmp/blob into /tmp/blob2 with unpigz -c /tmp/blob > /tmp/blob2. The extracted size is 1000860844 bytes, or about ~955MB.

That ... can happen. ;-)

Questions and notes

Based on the above experiments, I have the following questions and notes:

What is builds_summary table?

An experiment in the hope of getting a materialized view that laminard could use to render the initial job overview (recent jobs
running jobs.)

it may or may not take a lot of time to get the overview pages.

What are those "overview pages" exactly?

The simple "start" page, ie. http://toolchain.lug-owl.de/laminar/

it seems laminard calling into sqlite

My understanding is that the SQLite database is in the same laminard process, so there's no separate database to call into. It does need to fetch the data from the database file, though.

will always load a complete DB row with all the gz'ed output, even for just getting build metadata.

I've tested it with querying all fields except output:

SELECT name, number, node, queuedAt, startedAt, completedAt, result, outputLen, parentJob, parentBuild, reason FROM builds;

It finishes in about 4000 ms with cold caches. To me, that suggests SQLite does not have to fetch the whole row from the file, only the requested fields. In comparison, asking also for the output field to the list of fields in the query finishes in much longer time (~24000 ms).

With your fast disk setup, 4 sec could be enough to read all the 24 GB!

24 sec sounds like there's some more processing happening.

it's much faster when much of the 12 GB is already in memory.

I can confirm this for most queries, presumably due to the filesystem caching provided by the kernel. My cache usage for the 25GB database file never went above ~4-4.5GB though (and I certainly don't have enough RAM to cache it all anyway).

Cold/warm caches did not seem to matter with large enough result sets (presumably when the data for the result set is not fully already in the cache).

Possible workarounds

I wonder if the effect of this can be at least partially mitigated for now by writing the jobs in a slightly different way, for example:

  • Would it be possible to reduce some of the output in the jobs if there are some less critical info there?
  • Would it be possible to write parts or all of the output as an artifact file instead of storing it inside the database? This artifact might even be compressed at the end of the run, and would be linked from the run view by laminar if someone needs the full output.

I'm quite certain that setting aside the actual job output to a file (or maaaaybe to a separate table) would quite completely resolve any issue I see. My disks aren't nearly as fast as yours.

Further steps

It is certainly an interesting performance problem to track down, even though I feel like there might be still quite a few pieces missing to get a full answer.

I believe it might be much faster to attempt to concentrate our efforts by teaming up more closely to solve this.

I would be up to pair with one or more of you as part of my regular Open Source Office Hours offering for a closer look together.

Sure! I'd like to join!

@ferki
Copy link
Contributor

ferki commented Oct 12, 2023

With that setup (using a NVMe disk), you can probably read the whole file within a few seconds. I've got way more RAM, but that's actually also in use. From my system:

root@lili:/var/lib/laminar# time cat ./laminar.sqlite > /dev/null

real    2m9.859s
user    0m0.074s
sys     0m25.828s
root@lili:/var/lib/laminar# ls -l laminar.sqlite
-rw-r--r-- 1 laminar laminar 27889238016 Oct 11 17:33 laminar.sqlite

A bit larger by now, but it's about 2 min.

Sure, NVMe may be faster, so here's the same command output for my case:

$ time cat ./laminar.sqlite > /dev/null

real    0m17.143s
user    0m0.054s
sys     0m8.878s

I also did some more benchmarks (note, it was not the only load on the machine at the time, though, hence some extra variability is assumed):

# with warmup and caches kept between runs (there was around 12-12.5 GB cache available at the time)
$ hyperfine -w 1 'cat ./laminar.sqlite > /dev/null'
Benchmark 1: cat ./laminar.sqlite > /dev/null
  Time (mean ± σ):     18.236 s ±  0.397 s    [User: 0.057 s, System: 9.899 s]
  Range (min … max):   17.909 s … 19.103 s    10 runs

# with the database file dropped from cache before each run
hyperfine --prepare 'dd if=laminar.sqlite iflag=nocache count=0' 'cat ./laminar.sqlite > /dev/null'
Benchmark 1: cat ./laminar.sqlite > /dev/null
  Time (mean ± σ):     18.302 s ±  0.573 s    [User: 0.057 s, System: 9.131 s]
  Range (min … max):   17.024 s … 19.046 s    10 runs

Yeah! Had an attempt at using a summary table, which I hoped to can get in a materialized version. But sqlite seems to not support that.

Yes, I believe there's no materialized views in SQLite (yet?), only materialized CTEs.

  • my production schema looks like was migrated during the upgrade to version 1.2, but misses the idx_completion_time index, which is possibly a separate bug

Oh, does it? I didn't check that, but maybe I'd add it manually?

It is my own Laminar instance's production schema that is missing this index. That looks like a separate bug, though. Just mentioned it for the sake of completeness, as I believe it should be there after the migration introduced between 1.2 and 1.3 (I'm not linking issue 182 on purpose here to avoid cross-spamming threads).

What seemed slow/broken is selecting a finished run with a large output. There I only got partial output up to a certain point in the UI, and the browser tab froze (both in Chrome, and a WebKit based browser). Looking at the processes it was the browser process that was stuck on 100% CPU usage, the laminard process was fine without any excessive usage.
To me, it suggests more like a (JavaScript?) rendering issue of huge output, rather than a database querying issue.

That's another issue (ie. unresponsiveness while committing a run.)

It might be another issue, right, but still that's the only one I could notice with the "huge outputs" database file.

My setup seems to freeze during rendering the huge outputs returned from the database with the blob patches. I made sure to apply the blob patches locally too, but the result is the same with or without the patches. That further strengthens the "this is a browser/JS/rendering issues, rather than a database one" hypothesis.

I did not attempt any write to the database on purpose (in fact, I marked the file immutable to avoid even accidental modification).

If committing the job output to the database fails/slow, I expect something like this happens around there:

  • the large job output (e.g. ~950 MB) ends up gradually written to memory while the job is running
  • when the job is finished, the large output gets compressed if it's above a certain threshold (default is #define COMPRESS_LOG_MIN_SIZE 1024)
  • when that finished, the compressed results gets committed to the database (resulting in ~18 MB of disk write)

I doubt the problematic step is the 18 MB disk write while committing the data. The compression part feels more likely culprit (especially since it crashes). Though, I'd expect that to be reasonably fast too, since it's all in RAM (unless it gets swapped out).

It would be great to measure this part with a profiler. Both for a "large but successfully committable" result, and a "too large to commit" result.

There was also some inconsistency in the database, [...]

File was possibly copied live.

Oh, I guess that explains it, thanks! 👍

  1. I queried for the largest job output based on outputLen, with

    SELECT name,number,outputLen FROM builds ORDER BY outputLen DESC LIMIT 1;
    

    According to time sqlite3 laminar.sqlite ..., it took about 3700 ms, while rerunning it is around 45 ms due to file system caching.

ACK, that totally depends on cache / availability of (all?) of the data. Would be nice to see how that looks with cleared caches and while having vmstat or iotop running.

With cold caches, the same query I/O load looks like this:

# vmstat
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 12225872     24 1597240    0    0  1131   315   33   27 26  7 67  0  0
 0  0      0 12224372     24 1598532    0    0     0   212 3077 2757  0  1 98  0  0
 0  1      0 12181736   2156 1651816    0    0 52672     0 7458 37353  1  3 89  7  0
 1  0      0 12091520   2156 1745296    0    0 93448     0 14233 59263  1  3 86 11  0
 0  1      0 12021464   2156 1815032    0    0 69616     0 13533 58716  0  2 86 11  0
 0  1      0 11954936   2156 1878044    0    0 63300   204 10869 58233  1  2 86 11  0
 0  0      0 11935784   2156 1892928    0    0 14276     0 2689 12933  1  1 96  2  0

# iotop, 2 runs, each with cold cache
$ sudo iotop -Po -qqq -u ferki -a -d 1 | grep sqlite3
25771 be/4 ferki        96.12 M      0.00 B ?unavailable?  sqlite3 laminar.sqlite SELECT name,number,outputLen FROM builds ORDER BY outputLen DESC LIMIT 1;
25771 be/4 ferki       163.94 M      0.00 B ?unavailable?  sqlite3 laminar.sqlite SELECT name,number,outputLen FROM builds ORDER BY outputLen DESC LIMIT 1;
25771 be/4 ferki       228.78 M      0.00 B ?unavailable?  sqlite3 laminar.sqlite SELECT name,number,outputLen FROM builds ORDER BY outputLen DESC LIMIT 1;
[...]
27593 be/4 ferki        87.71 M      0.00 B ?unavailable?  sqlite3 laminar.sqlite SELECT name,number,outputLen FROM builds ORDER BY outputLen DESC LIMIT 1;
27593 be/4 ferki       175.41 M      0.00 B ?unavailable?  sqlite3 laminar.sqlite SELECT name,number,outputLen FROM builds ORDER BY outputLen DESC LIMIT 1;
27593 be/4 ferki       233.32 M      0.00 B ?unavailable?  sqlite3 laminar.sqlite SELECT name,number,outputLen FROM builds ORDER BY outputLen DESC LIMIT 1;

# pidstat
$ pidstat -d -C sqlite3 1
Linux 6.1.53-gentoo-r1-x86_64 (libre)   10/12/2023      _x86_64_        (8 CPU)

09:12:48 PM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command

09:12:49 PM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
09:12:50 PM  1000      2593  92196.00      0.00      0.00       0  sqlite3

09:12:50 PM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
09:12:51 PM  1000      2593  87164.00      0.00      0.00       0  sqlite3

09:12:51 PM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
09:12:52 PM  1000      2593  63404.00      0.00      0.00       0  sqlite3

There's burst of read as expected, and the overall amount read is only ~230 MB. This again indicates that SQLite does not read the whole database, only what's requested (which admittedly might cause lot of seek time, especially on rotational media).

  1. Queried the largest output blob, and saved it into /tmp/blob with:

    SELECT writefile('/tmp/blob', output) FROM builds WHERE name = 'netbsd-amd64-x86_64' AND number = 19;
    

    The size of /tmp/blob is 18722708 bytes, or around 18MB. According to time sqlite3 laminar.sqlite it took about 32 ms to write the blob with clod caches, and about 21 ms with warm ones.

How was I/O for that?

I couldn't meaningfully measure it yet with vmstsat/pidstat/iotop due to the very short time it takes.

Judging by the time it takes, I believe it is just retrieving this 18MB, and perhaps some process startup overhead, and buffering to output. At this time, I don't have reason to assume it does anything else, especially things like huge disk I/O.

it may or may not take a lot of time to get the overview pages.

What are those "overview pages" exactly?

The simple "start" page, ie. http://toolchain.lug-owl.de/laminar/

OK, thanks, I can make sure to look at the performance my landing page then too.

will always load a complete DB row with all the gz'ed output, even for just getting build metadata.

I've tested it with querying all fields except output:

SELECT name, number, node, queuedAt, startedAt, completedAt, result, outputLen, parentJob, parentBuild, reason FROM builds;

It finishes in about 4000 ms with cold caches. To me, that suggests SQLite does not have to fetch the whole row from the file, only the requested fields. In comparison, asking also for the output field to the list of fields in the query finishes in much longer time (~24000 ms).

With your fast disk setup, 4 sec could be enough to read all the 24 GB!

It was good idea to check, but far from it. It takes ~18 seconds for the full database to be read, as reported above with the time cat ... and hyperfine measurements.

24 sec sounds like there's some more processing happening.

Could be, that doesn't change my main point here: SQLite does not seem to read all the fields of the row from the disk, only the requested fields.

If this query would make SQLite read all fields of all rows, it would not be able to return in ~4 seconds with cold caches, since reading the whole database takes ~18 seconds.

  • Would it be possible to reduce some of the output in the jobs if there are some less critical info there?
  • Would it be possible to write parts or all of the output as an artifact file instead of storing it inside the database? This artifact might even be compressed at the end of the run, and would be linked from the run view by laminar if someone needs the full output.

I'm quite certain that setting aside the actual job output to a file (or maaaaybe to a separate table) would quite completely resolve any issue I see. My disks aren't nearly as fast as yours.

Based on the information available, I don't expect storing the output in a separate table would help too much.

I expect limiting the amount of job output that gets stored in the database has higher chance of improving the situation in this case.

I would be up to pair with one or more of you as part of my regular Open Source Office Hours offering for a closer look together.

Sure! I'd like to join!

Please book a matching time slot through the above link. I expect it would be much faster to iterate over hypotheses and measurements if we are able to team up on this. Especially since the messages are getting quite long to follow easily 😅

@ohwgiles
Copy link
Owner

I expect limiting the amount of job output that gets stored in the database has higher chance of improving the situation in this case.

Even if the log's in a separate file, the browser is going to struggle fetching and ansi-highlighting a ~900Mb log. The quick "solution" is to just add > $ARCHIVE/full-log.txt in to your run script. A smarter long-term solution might be having laminar display the last ~10KiB of log and have a button to request the full log.

I would be up to pair with one or more of you as part of my regular Open Source Office Hours offering for a closer look together.

I'm happy to chat but don't have a lot of spare time and am in UTC+13 so could only realistically join in your morning hours.

@ferki
Copy link
Contributor

ferki commented Oct 13, 2023

Even if the log's in a separate file, the browser is going to struggle fetching and ansi-highlighting a ~900Mb log.

Yes, I guess that's what breaks my job output pages in the browser. My system seems to be able to retrieve the data from the database quickly, but struggles with inflating it back to the raw ~900 MB version in the single-threaded process, and/or with rendering (displaying/highlighting/etc.) that much data on a single page.

@jbglaw's system seems to struggle with storing that ~900 MB raw data into the database sometimes (presumably during compression before inserting to database), and with displaying the landing page ("initial overview query"). I look forward to learn more details about that exact setup soon together.

I believe now we dig deeper into performance characteristics with huge datasets, we identified at least three different topics already, so it might be wise to split the discussion at some point into separate issues.

I don't expect the solution to any of those is as simple as "just switch to some other database". For now, I'd say a likely interesting candidate to investigate is the in-process single-threaded (de)compression of large output, which potentially affects two separate cases we've seen so far.

The quick "solution" is to just add > $ARCHIVE/full-log.txt in to your run script. A smarter long-term solution might be having laminar display the last ~10KiB of log and have a button to request the full log.

That sounds like a nice defensive approach to prevent things from escalating when huge outputs are involved 👍 I wonder if it's possible to get only the last 10 kB of a huge compressed output without decompressing the whole, though 🤔

To reiterate on "limiting the amount of job output that gets stored in the database", I meant one or more of multiple options mentioned earlier (which I did not wish to repeat due to length):

  • reduce output size: prevent writing things to the job output which might be OK to skip
  • store the bulk of the output elsewhere:
    • redirect to a file instead of writing to STDOUT (optionally compressed at the end of the job or in an ${job}.after post-build action
    • save the output as an artifact instead (also optionally compress)

I'm happy to chat but don't have a lot of spare time and am in UTC+13 so could only realistically join in your morning hours.

If there's a fitting slot in my booking calendar, feel free to grab it. If there's no fitting slot, I'm happy to find one out of the listed ones (please feel free to reach out to me on any channels listed on my profiles).

@ferki
Copy link
Contributor

ferki commented Oct 15, 2023

I took a closer look at SQLite internals, and I feel I can clarify some of our findings so far.

SQLite internals

SQLite, like many other databases, stores data in pages. The default page size is 4096 bytes since 3.12.0 (2016-03-29). It was 1024 bytes before that. These pages are arranged into a binary tree structure for fast localization (see B-tree Pages).

Since the unit of lookup and retrieval is one page, it is generally a good starting point for performance reasons to match the page size to the native I/O block size of underlying storage media and/or file system.

With small rows, even multiple of them may fit into a single page. Large rows, like the ones we look at here with the large output fields, might not fit into one page. In this case, the surplus is stored in one or more overflow pages, arranged in a linked list (see Cell Payload Overflow Pages).

Also, the part of a query between SELECT and FROM is called the result expression list (see Generation of the set of result rows. Each expression in the result expression list is matched for each row retrieved by the query.

Refined retrieval procedure

Based on the above information, it's a bit of oversimplification from me to state that "SQLite doesn't retrieve columns that was not asked for".

My understanding about what happens is more like:

  • identify a matching row for the query
  • retrieve the page from disk (possible from OS cache) containing the matching row
  • extract the field(s) from the matching row that is part of the desired result expression list
    • the row in this page may contain a small portion of a large field, which then points to its overflow pages
    • if the rest of the field is stored in overflow pages, follow the linked list to retrieve each page until the whole content of the field is retrieved

For the purposes identified in this issue though, that still feels like, and is very close to not retrieving pages from disk (or cache) that it already knows it won't need to answer the query.

Further considerations for performance

@jbglaw: based on the the info you provided to read the whole database with cat, the estimated speed of your storage media is 27889238016 bytes / 129.859 seconds = 209731.971 bytes/sec, or ~200 MB/s. Judging by this speed, is it a good guess that is a rotational media?

If yes, it would worth to check the following:

  • the I/O block size for the media itself, the file system, and perhaps any layer that might be there in between (LVM, LUKS, MDRAID, etc.), and compare it to the page_size of your database

  • some stats from dbstats:

    • check the ratio between the size of data in the database and the amount of disk space used to store that with SELECT sum(pgsize-unused)*100.0/sum(pgsize) FROM dbstat; (this shows how efficiently the data is packed, ideally this is close to 100%)

    • especially on rotational media, the more sequential the disk access, the better it is for performance, let's check the the ratio of sequential pages in the database with:

      CREATE TEMP TABLE s(rowid INTEGER PRIMARY KEY, pageno INT);
      INSERT INTO s(pageno) SELECT pageno FROM dbstat ORDER BY path;
      SELECT sum(s1.pageno+1==s2.pageno)*1.0/count(*)
        FROM s AS s1, s AS s2
       WHERE s1.rowid+1=s2.rowid;
      DROP TABLE s;

      It is also good to be near 100%, and lower values suggest it may be time to run VACUUM.

    (Above queries are from Example uses of the dbstat virtual table.)

Especially on rotational media, the actual disk layout of the database file may be interesting too. A single ~25 GB database file might not be laid out sequentially on disk, but scattered around the plate (which is less of a problem on non-rotational media).

@ferki
Copy link
Contributor

ferki commented Oct 19, 2023

After letting the topic rest for a while, I think there are at least two more aspects that I was not able to investigate fully yet.

Column order

The patched database schema has the BLOB output as last column in the rows, and maybe that's why it's able to skip read the overflow pages in cases when all other SELECTed columns are already retrieved.

In contrast the vanilla unpatched schema has the TEXT output column in the middle, which may cause SQLite having to follow the linked list of overflow pages in the middle of the row in order to retrieve all columns coming after (outputLen, parentJob, parentBuild and reason).

This may cause what was perceived as “SQLite reads the whole row” in the original post, but what I couldn't reproduce yet with the sample database using the BLOB patch.

Locking with rollback journals

Since Laminar does not seem to use the write-ahead log mode of SQLite, I assume the default rollback journal mode is in effect. This and the involved locking may lead to the following situations when all database clients are in the same OS process (laminard):

  • there can be at most one writer, and we know we are writing large rows due to large output in this use case, there is a chance that a slow write causes other read processes to wait until the write operatio is done
  • due to huge/long read operations there might never be a good moment to acquire a write lock (at least within any retry periods), leading to failed writes (“writer starvation”).

Perhaps enabling WAL mode, and even also using BEING CONCURRENT transactions would improve the situation during slow/concurrent operations.

@ferki
Copy link
Contributor

ferki commented Oct 26, 2023

Column order

The patched database schema has the BLOB output as last column in the rows, and maybe that's why it's able to skip read the overflow pages in cases when all other SELECTed columns are already retrieved.

In contrast the vanilla unpatched schema has the TEXT output column in the middle, which may cause SQLite having to follow the linked list of overflow pages in the middle of the row in order to retrieve all columns coming after (outputLen, parentJob, parentBuild and reason).

This may cause what was perceived as “SQLite reads the whole row” in the original post, but what I couldn't reproduce yet with the sample database using the BLOB patch.

I've tested this conjecture by copying all data from the "blob" database into a vanilla database (with output field as TEXT):

ATTACH "laminar.sqlite.blob" AS blob;
ATTACH "laminar.sqlite" AS text;
INSERT INTO text.builds(name, number, node, queuedAt, startedAt, completedAt, result, output, outputLen, parentJob, parentBuild, reason) SELECT name, number, node, queuedAt, startedAt, completedAt, result, output, outputLen, parentJob, parentBuild, reason FROM blob.builds;

Then I ran one of the previously used test query:

SELECT name,number,outputLen FROM builds ORDER BY outputLen DESC LIMIT 1;

This used to cause ~230 MB disk IO according to iotop with the BLOB database, where the output field was the last column in the builds table.

In the vanilla version of the database the output TEXT field is in the middle of the row, and the same query causes ~23.45 GB disk IO according to iotop. Essentially feels like "reading the whole database" (~25 GB), and caching doesn't help much either (since my cache is smaller than the data size to be cached).

Based on this, it should be ensured that the output field is the last column in the builds table to prevent unnecessary reads on a potentially huge column (and probably still worthwhile to switch to a BLOB column). Other "column-packing" opportunities might be interesting too for an optimal layout, but probably less interesting in terms of read performance gains. The alternative is to store large blob values (e.g. >100 kB compressed output) outside the builds table, or the database (e.g. as a file, similar to artifacts).

Retrieving the largest compressed output with:

SELECT writefile('/tmp/blob', output) FROM builds WHERE name = 'netbsd-amd64-x86_64' AND number = 19;

is still very fast (~20-40 ms), reading and then writing ~18MB from disk (uncompressed ~1 GB).

Opening a large output in the UI still freezes, so my next best guess about this part is still around the single-threaded extraction and/or highlighting.

The write problem is still probably related to locking and/or journal mode.

In summary, I believe we have identified at least 3 different topics, some of them with multiple options to address and investigate further – it might be time to split those into their own issues, and maybe use this one as a "tracker" issue.

I would also still be happy to look at specific scenarios on specific hardware together with some of you in case that's helpful to learn more. It certainly looks like there are several opportunities to push performance bottlenecks farther via built-in SQLite settings and improved database design (which is expected to be lower complexity than attempting to migrate to other database technologies).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants