Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High load in 1) SELECT queries when we have 7000 rows and 21000 branches 2) create extra branches when we've already have 21000 #5175

Closed
nkonev opened this issue Jan 20, 2023 · 15 comments
Assignees
Labels
bug Something isn't working performance

Comments

@nkonev
Copy link
Contributor

nkonev commented Jan 20, 2023

I saw very similar #5148 but I'm not sure if there is the same isuue.

So this test can be synthetic, but I think it worth to share the results.

Version is 0.52.8

Problem 1 - high CPU when I use sort

Reproduction:

git clone git@github.com:nkonev/companies.git
cd companies
git checkout select-high-cpu 
docker-compose up -d

Run web app in the first terminal window

./mvnw spring-boot:run

Then create 7000 companies in the another terminal

./mvnw -DenableLoadTests=true -Dtest="name.nkonev.examples.companies.VolumeTest#seven_thousand_companies" test

Then create branch for every company and add a commit three times (sequentially)

./mvnw -DenableLoadTests=true -Dtest="name.nkonev.examples.companies.VolumeTest#create_draft_for_each_company" test
./mvnw -DenableLoadTests=true -Dtest="name.nkonev.examples.companies.VolumeTest#create_draft_for_each_company" test
./mvnw -DenableLoadTests=true -Dtest="name.nkonev.examples.companies.VolumeTest#create_draft_for_each_company" test

Then run every-second requests in three terminal windows in parallel

watch -n 1 curl -Ss --url 'http://localhost:8080/company'
watch -n 1 curl -Ss --url 'http://localhost:8080/company'
watch -n 1 curl -Ss --url 'http://localhost:8080/company'

Those requests turned into SQL like

CALL DOLT_CHECKOUT ('main');
SELECT `company`.`name` AS `name`, `company`.`modified_at` AS `modified_at`, `company`.`id` AS `id`, `company`.`bank_account` AS `bank_account`, `company`.`estimated_size` AS `estimated_size` FROM `company` ORDER BY `company`.`name` ASC LIMIT 0, 20

(You can uncomment JDBC logging in application.yaml)

Now we can see 20-23% CPU of dolt in top.
Screenshot from 2023-01-20 18-44-58 - we have 3 every-second workers

If we open 10 every-second workers we will have
70-100% cpu
Screenshot from 2023-01-20 19-07-37

Note that it isn't special load tool like wrk, it's just every-second requests produced by watch.

I tried to create indexes

ALTER TABLE company
    ADD INDEX company_name USING HASH (name);

ALTER TABLE company
    ADD INDEX company_name USING BTREE (name);

but it doesn't change the CPU load.

Problem 2 - high CPU and linear time of many branch creation

Here I remained working web app and stopped all watch-based workers.
If we start creating branches again (watch'es still makes requests) CPU of dolt will be 120%

./mvnw -DenableLoadTests=true -Dtest="name.nkonev.examples.companies.VolumeTest#create_draft_for_each_company" test

Screenshot from 2023-01-20 18-54-24

Update: I noticed (with help seven_thousand_companies_by_http) that adding more commits doesn't degrade as fast as adding more branches

Update 2: In other words:
We have 7000 companies, created by seven_thousand_companies.
The very first pass of create_draft_for_each_company takes Measured time PT6M47.349344168S
The next pass - Measured time PT14M32.157333959S
The next pass - Measured time PT21M55.941600155S
The next pass - Measured time PT28M17.140720097S
I suspect that the culprit is - array of branches, which bumped into filesize limit here.

@nkonev nkonev changed the title High load in SELECT queries when we have 21000 branches High load in SELECT queries when we have 7000 rows and 21000 branches Jan 20, 2023
@nkonev nkonev closed this as completed Jan 20, 2023
@nkonev
Copy link
Contributor Author

nkonev commented Jan 20, 2023

Because of sort by non-very good selective company name the results were not good (problem 1).

@timsehn timsehn added bug Something isn't working performance labels Jan 20, 2023
@timsehn
Copy link
Contributor

timsehn commented Jan 20, 2023

@Hydrocharged might as well dig into this one as well since he fixed your other, lots of branches problem.

@timsehn timsehn reopened this Jan 20, 2023
@nkonev nkonev changed the title High load in SELECT queries when we have 7000 rows and 21000 branches High load in a) SELECT queries when we have 7000 rows and 21000 branches b) create extra branches when we've already have 21000 Jan 23, 2023
@nkonev nkonev changed the title High load in a) SELECT queries when we have 7000 rows and 21000 branches b) create extra branches when we've already have 21000 High load in 1) SELECT queries when we have 7000 rows and 21000 branches 2) create extra branches when we've already have 21000 Jan 24, 2023
@Hydrocharged
Copy link
Contributor

@nkonev I apologize for the delay, I initially had some issues working with Docker on Windows, but I've finally got some answers! There are two core issues here. The first being that your SELECT statements are consuming a lot of resources, and the second being that you experience increasing slowdown as you add more branches.

For the first issue (slow SELECT statements), there are two points of slowdown. The first is that we are doing a full table scan with those queries. Even when you add an index on company.name, we don't yet support using indexes on ORDER BY. Right now we only support indexes on filters, such as with the WHERE clause and on joins. I'll file an issue for us to add this functionality. The second point of slowdown is due to accumulated files as the queries are run. To fix this, you can run garbage collection. We have two modes of garbage collection: standard and shallow. If you run dolt gc from a terminal window, then by default it runs the standard collection, which will give the speed increase. The stored procedure DOLT_GC() performs a shallow collection, and in my testing it did give a performance increase, but only about half of the standard option, so try and perform a dolt gc from a terminal when you can (requires temporarily stopping the server). Automatic GC is on our roadmap.

For the second issue (slow branching), this deals with an implementation detail for the dolt_branch_control table. You can read a bit on how they work here: https://docs.dolthub.com/sql-reference/server/branch-permissions. In summary, the aforementioned table controls which users may modify which branches on which databases. Whenever a user creates a branch, a new entry is added to the table, granting that user admin privileges over that branch. In our case, as we're creating tens of thousands of branches, the table becomes increasingly large as we add new admin entries for the user, and therefore each query iterates over even more rows.

One feature of the table is that we do not add new entries if an existing entry would already grant the same benefits. For example, if you have an admin entry with the branch portion being 'a%', then we won't add any more entries for that user when a branch starts with the letter a. As the branches are being created by a user named root in your example, I ran the following query: INSERT INTO dolt_branch_control VALUES ('%', '%', 'root', '%', 'admin');. This resulted in:

$ dolt sql -q "SELECT * FROM dolt_branch_control;"
+----------+--------+------+------+-------------+
| database | branch | user | host | permissions |
+----------+--------+------+------+-------------+
| %        | %      | %    | %    | write       |
| %        | %      | root | %    | admin       |
+----------+--------+------+------+-------------+

With this change, the cost of adding new branches dropped significantly, with adding branches 21,000->28,000 spending a similar amount of time as adding branches 0->7,000, which essentially fixes the issue completely (at least for this test case).

I'm also going to investigate changing how the dolt_branch_control tables work. We do an array search to support pattern matching, but we could also do a tree search. For smaller branch counts (up to 100), my initial testing showed that the array search was faster (and easier to implement/maintain), but I'd expect a tree search to outperform the array search when dealing with counts over 10,000. If you're able to add that entry to the dolt_branch_control table then that would be great, otherwise I can assign a higher priority to this investigation (such as if you'll have many unknown users each creating their own branches, rather than a single user or a small set of known users).

@Hydrocharged
Copy link
Contributor

Also, I'll mention that the increasing file size of branch_control.db doesn't have much of an impact at all. I modified the dolt binary (based on v0.52.8 to keep things comparable) to do two entry comparisons and return true, similar to what it would do with the above fix, and the times were identical.

@nkonev
Copy link
Contributor Author

nkonev commented Jan 24, 2023

About the second problem:

If you're able to add that entry to the dolt_branch_control table

@Hydrocharged Thank you a lot for response. Yes, I'm able to apply your SQL. I've applied it, so I have

SELECT * FROM dolt_branch_control;

%,%,%,%,write
%,%,root,%,admin

I have the following results of create_draft_for_each_company

Measured time PT4M41.325401001S
Measured time PT7M24.038263521S
Measured time PT10M14.348827364S

The speed was increased, but there is still linear degradation. May be I'm doing something wrong. I performed insert into dolt_branch_control in migrations, they implicitly wraps all statements in transaction. I will move insert in other place. Doesn't matter where I insert into dolt_branch_control.

I didn't perform any GC.
Did you perform GC between create_draft_for_each_company test runs ?

Did you do something else?

@Hydrocharged
Copy link
Contributor

Hydrocharged commented Jan 25, 2023

Yes, you'll need to perform GC between the create_draft_for_each_company runs, and also verify that there are only two entries in dolt_branch_control via SELECT count(1) FROM dolt_branch_control; after the first run to make sure we're not creating new branch entries. If we still are, then we should change the user from root to the name of the user that is in those entries.

I spoke with @timsehn and I'm going to look into that aforementioned investigation of moving our table search to a tree search, which I'm assuming should help substantially even in a worst-case scenario.

@nkonev
Copy link
Contributor Author

nkonev commented Jan 25, 2023

Thank you, @Hydrocharged

With both INSERT INTO dolt_branch_control VALUES ('%', '%', 'root', '%', 'admin'); and dolt gc before every create_draft_for_each_company test I have
Measured time PT3M34.363319853S (0->7000)
Measured time PT4M30.713584004S (7000->14000)
Measured time PT5M11.280711257S (14000->21000)
Measured time PT6M37.215517519S (21000->28000)

It's much better result.

I'm looking forward both
a) moving our table search to a tree search
b) Automatic GC
as well as using indexes in ORDER BY

@Hydrocharged
Copy link
Contributor

Fantastic to hear! I'll close this issue for now, as I've created another issue for ORDER BY, and it looks like everything has been resolved.

@bpf120
Copy link

bpf120 commented Jan 25, 2023

Hi @nkonev, Thanks for filing this issue. We'd love to learn more about your Dolt use case. It really helps us out. Feel free to come by Discord or send me an email if you want to share.

@nkonev
Copy link
Contributor Author

nkonev commented Jan 25, 2023

@bpf120 Hi,
please check out here #5135 (comment) - there I wrote my use case

@Hydrocharged
Copy link
Contributor

Hydrocharged commented Feb 1, 2023

@nkonev Just wanted to let you know that, as of #5238, branch creation is significantly faster! These changes are in the latest release.

@nkonev
Copy link
Contributor Author

nkonev commented Feb 1, 2023

@Hydrocharged Thank you! I saw your PR. Will check without inserting in the branch control table.

@Hydrocharged
Copy link
Contributor

Hydrocharged commented Feb 1, 2023

Sounds good! I do want to mention that it will not outperform inserting that additional row, as that's the intended and designed use of the table. For those cases where users and branches are not known ahead of time and will be seemingly random, then this is a significant speed up.

@nkonev
Copy link
Contributor Author

nkonev commented Feb 1, 2023

Test with dolthub/dolt-sql-server:0.52.16, commit in the test, results are:

Measured time PT2M18.115461131S (0-7000 branches)
- call dolt_gc() -
Measured time PT2M51.115057602S (7001-14000)
- call dolt_gc() -
Measured time PT3M47.193516366S (14001-21000)
- call dolt_gc() -
Measured time PT3M49.033888737S (21001-28000)

@Hydrocharged
Copy link
Contributor

Results are far better than before! Wonderful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

No branches or pull requests

4 participants