[BUG] - missing resource isolation to prefer critical tasks over query processing #5984

andrejpodzimek · 2024-09-13T02:05:12Z

Internal/External
External

Area
Other

Summary
Leader log queries impede critical validator processing and cause extreme numbers of missed slot leader checks.

Steps to reproduce

Watch the frequency of missed slot leader checks over time.
Run a demanding cardano-cli query in a loop against the validator (example below).
Watch the disaster unfold: In my case, there were 7% of missed slot leader checks due to a repeated query.
Repeat the test with a regular relay node. Tip differences will run sky high (>100) when queries are processed.

Expected behavior
Proper resource isolation.

Ongoing query processing never delays inward tip propagation (“height”).
Ongoing query processing on a validator never ever causes missed slot leader checks!
The fact that one should not run queries against a validator is orthogonal; a validator should either process such queries gracefully, without impediment to critical operations, or outright reject them.
Timing-critical tasks must take precedence. (They should not be timing-critical, but sadly are.)

System info (please complete the following information):

OS Name: ArchLinux
OS Version: Only bad distros have this.

Node version (output of cardano-node --version):

cardano-node 9.1.1 - linux-x86_64 - ghc-8.10
git rev 66dc08944479792b2823c9e1356914820c9ea059

CLI version (output of cardano-cli --version):

cardano-cli 9.2.1.0 - linux-x86_64 - ghc-8.10
git rev 66dc08944479792b2823c9e1356914820c9ea059

Screenshots and attachments
An example query to expose resource isolation problems:

cardano-cli query leadership-schedule \
  --socket-path /run/cardano-validator/socket \
  --genesis config/mainnet-shelley-genesis.json \
  --mainnet \
  --vrf-signing-key-file keys/mainnet/vrf.skey \
  --stake-pool-id ... \
  --next

RTS options:

... +RTS -N -A64m -H -Iw59 --nonmoving-gc -RTS ...

Additional context
This case could be dismissed with “use a workaround”, i.e. “have a separate relay node for slot leader queries only”, i.e. not for routing to a validator. However, such an idea is suboptimal, increasing the amount of resources a pool operator must set aside by up to 50%, compared to the simplest relay + validator setup.

The lack of proper resource isolation may have been a contributing factor to my problem of never successfully validating a block, described in this post and above.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-10-14T01:59:55Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions · 2024-11-14T02:08:29Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

karknu · 2024-11-14T08:03:03Z

@andrejpodzimek I've been working on something that may alleviate your problem. It was done for relays serving hundreds of clients but perhaps it could work here too.

https://github.com/IntersectMBO/cardano-node/tree/karknu/thread_isolation , based on 10.1.2 so will require a chain replay if you're still on 9.2.1. Experimental so best to test it on your backup BP or on a testnet.

andrejpodzimek added the needs triage Issue / PR needs to be triaged. label Sep 13, 2024

spannercode mentioned this issue Sep 25, 2024

Issue Management Ticket cardanoapi/hardfork-testing#23

Closed

2 tasks

github-actions bot added the Stale label Oct 14, 2024

erikd removed the Stale label Oct 14, 2024

github-actions bot added the Stale label Nov 14, 2024

github-actions bot removed the Stale label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - missing resource isolation to prefer critical tasks over query processing #5984

[BUG] - missing resource isolation to prefer critical tasks over query processing #5984

andrejpodzimek commented Sep 13, 2024 •

edited

Loading

github-actions bot commented Oct 14, 2024

github-actions bot commented Nov 14, 2024

karknu commented Nov 14, 2024

[BUG] - missing resource isolation to prefer critical tasks over query processing #5984

[BUG] - missing resource isolation to prefer critical tasks over query processing #5984

Comments

andrejpodzimek commented Sep 13, 2024 • edited Loading

github-actions bot commented Oct 14, 2024

github-actions bot commented Nov 14, 2024

karknu commented Nov 14, 2024

andrejpodzimek commented Sep 13, 2024 •

edited

Loading