Cache does not work unexpectedly #1588

jvinolas · 2024-11-23T14:11:29Z

Description

Cache was working for some days and we've got two high iowait that let the cache not working. The cache is caching and we see dirties but suddendly does not work anymore, high iowait and no dmesg core dump messages, only tasks hung infos.

First one was while high dirty cache and swapping from wb to wt. We tried to bring it back to wb but we kept with high iowait from the first swap and no cache response anymore. The only solution was to flush the cache, stop it, force recreate it and start it again.

The second one was about 3am today while we were rsync files to this cache and suddenly it did fail, high iowait again and tasks got hung at dmesg info messages.

Expected Behavior

Cache behaviour should be somewhat predictable and do not stop working.

Actual Behavior

High iowait and cache does seem to not work again.

Steps to Reproduce

We could reproduce the first one:

High dirties, cache being written and swapping from wb to wt.

Context

As the cache mountopoint is being served by an NFS with clients connected to it, we had to stop all clients, stop cache, reformat it again and start everything again.

Possible Fix

Logs

This is what suddenly happened this morning while only doing rsyncs:

[Sat Nov 23 03:00:03 2024] workqueue: blk_mq_run_work_fn hogged CPU for >10000us 5 times, consider switching to WQ_UNBOUND
[Sat Nov 23 03:38:06 2024] workqueue: do_global_cleanup [dm_bufio] hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND
[Sat Nov 23 04:44:07 2024] workqueue: blk_mq_run_work_fn hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND
[Sat Nov 23 05:32:23 2024] workqueue: do_global_cleanup [dm_bufio] hogged CPU for >10000us 11 times, consider switching to WQ_UNBOUND
[Sat Nov 23 07:11:04 2024] workqueue: do_global_cleanup [dm_bufio] hogged CPU for >10000us 19 times, consider switching to WQ_UNBOUND

And now happened again:

This time with no log in dmesg.

Configuration files

Your Environment

OpenCAS version (commit hash or tag): c142610
Operating System: 24.04.1 LTS (Noble Numbat)
Kernel version: 6.9.12-060912-generic #202408281052 SMP PREEMPT_DYNAMIC
Cache device type (NAND/Optane/other): ram disk (wa) over nvme (wb)
Core device type (HDD/SSD/other): dm_vdo over an mdadm raid0 over iscii disks
Cache configuration:

# casadm -P -i 1
Cache Id                  1
Cache Size                877915904 [4KiB Blocks] / 3348.98 [GiB]
Cache Device              /dev/nvme0n1p1
Exported Object           -
Core Devices              1
Inactive Core Devices     0
Write Policy              wb
Cleaning Policy           alru
Promotion Policy          always
Cache line size           64 [KiB]
Metadata Memory Footprint 4.0 [GiB]
Dirty for                 16090 [s] / 4 [h] 28 [m] 10 [s]
Status                    Running

# casadm -P -i 2
Cache Id                  2
Cache Size                65453856 [4KiB Blocks] / 249.69 [GiB]
Cache Device              /dev/ram0
Exported Object           -
Core Devices              1
Inactive Core Devices     0
Write Policy              wa
Cleaning Policy           alru
Promotion Policy          always
Cache line size           64 [KiB]
Metadata Memory Footprint 319.3 [MiB]
Dirty for                 0 [s] / Cache clean
Status                    Running

# casadm -L
type    id   disk             status    write policy   device
cache   1    /dev/nvme0n1p1   Running   wb             -
└core   1    /dev/dm-4        Active    -              /dev/cas1-1
cache   2    /dev/ram0        Running   wa             -
└core   1    /dev/cas1-1      Active    -              /dev/cas2-1
# casadm --get-param --name cleaning-alru -i 1
╔═════════════════════════╤═══════╗
║ Parameter name          │ Value ║
╠═════════════════════════╪═══════╣
║ Wake up time [s]        │     5 ║
║ Stale buffer time [s]   │     1 ║
║ Flush max buffers       │  9750 ║
║ Activity threshold [ms] │     0 ║
╚═════════════════════════╧═══════╝

# casadm --get-param --name cleaning-alru -i 2
╔═════════════════════════╤═══════╗
║ Parameter name          │ Value ║
╠═════════════════════════╪═══════╣
║ Wake up time [s]        │    20 ║
║ Stale buffer time [s]   │   120 ║
║ Flush max buffers       │   100 ║
║ Activity threshold [ms] │ 10000 ║
╚═════════════════════════╧═══════╝

- Cache mode: wa rmdisk over wb nvme
- Cache line size: 64
- Promotion policy: always
- Cleaning policy: alru
- Sequential cutoff policy: full

# casadm --get-param --name seq-cutoff -i 1 -j 1
╔═════════════════════════════════════════════════════╤═══════╗
║ Parameter name                                      │ Value ║
╠═════════════════════════════════════════════════════╪═══════╣
║ Sequential cutoff threshold [KiB]                   │  1024 ║
║ Sequential cutoff policy                            │ full  ║
║ Sequential cutoff promotion request count threshold │     8 ║
╚═════════════════════════════════════════════════════╧═══════╝
# casadm --get-param --name seq-cutoff -i 2 -j 1
╔═════════════════════════════════════════════════════╤═══════╗
║ Parameter name                                      │ Value ║
╠═════════════════════════════════════════════════════╪═══════╣
║ Sequential cutoff threshold [KiB]                   │  1024 ║
║ Sequential cutoff policy                            │ full  ║
║ Sequential cutoff promotion request count threshold │     8 ║
╚═════════════════════════════════════════════════════╧═══════╝

Other (e.g. lsblk, casadm -P, casadm -L)

The text was updated successfully, but these errors were encountered:

jvinolas · 2024-11-23T17:48:03Z

Could it be related to having the ramdisk in /dev/ram0? Here you state to not use it http://open-cas.com/guide_advanced_options.html#multi-level-caching even in recent kernels documentations say it is not a problem.

EDIT: We removed the brd ramdisk cache device and doing now the same rsyncs... we'll see if it was that. Also, no errors shown in casadm -P, so we don't know why cache decided to 'deactivate' itself.

jvinolas · 2024-11-23T23:21:44Z

Without the ram cache over the nvme, just nvme in wb mode with same parameters happened again:

No logs in dmesg at all this time also.

Cache Id                  1
Cache Size                877915904 [4KiB Blocks] / 3348.98 [GiB]
Cache Device              /dev/nvme0n1p1
Exported Object           -
Core Devices              1
Inactive Core Devices     0
Write Policy              wb
Cleaning Policy           alru
Promotion Policy          always
Cache line size           64 [KiB]
Metadata Memory Footprint 4.0 [GiB]
Dirty for                 13014 [s] / 3 [h] 36 [m] 54 [s]
Status                    Running

╔══════════════════╤═══════════╤═══════╤═════════════╗
║ Usage statistics │   Count   │   %   │   Units     ║
╠══════════════════╪═══════════╪═══════╪═════════════╣
║ Occupancy        │ 877915904 │ 100.0 │ 4KiB Blocks ║
║ Free             │         0 │   0.0 │ 4KiB Blocks ║
║ Clean            │ 642185680 │  73.1 │ 4KiB Blocks ║
║ Dirty            │ 235730224 │  26.9 │ 4KiB Blocks ║
╚══════════════════╧═══════════╧═══════╧═════════════╝

╔══════════════════════╤════════╤═══════╤══════════╗
║ Request statistics   │ Count  │   %   │ Units    ║
╠══════════════════════╪════════╪═══════╪══════════╣
║ Read hits            │   1039 │   0.2 │ Requests ║
║ Read partial misses  │      0 │   0.0 │ Requests ║
║ Read full misses     │   2309 │   0.4 │ Requests ║
║ Read total           │   3348 │   0.6 │ Requests ║
╟──────────────────────┼────────┼───────┼──────────╢
║ Write hits           │ 154914 │  29.7 │ Requests ║
║ Write partial misses │     14 │   0.0 │ Requests ║
║ Write full misses    │ 345926 │  66.2 │ Requests ║
║ Write total          │ 500854 │  95.9 │ Requests ║
╟──────────────────────┼────────┼───────┼──────────╢
║ Pass-Through reads   │      0 │   0.0 │ Requests ║
║ Pass-Through writes  │  17963 │   3.4 │ Requests ║
║ Serviced requests    │ 504202 │  96.6 │ Requests ║
╟──────────────────────┼────────┼───────┼──────────╢
║ Total requests       │ 522165 │ 100.0 │ Requests ║
╚══════════════════════╧════════╧═══════╧══════════╝

╔══════════════════════════════════╤════════════╤═══════╤═════════════╗
║ Block statistics                 │   Count    │   %   │   Units     ║
╠══════════════════════════════════╪════════════╪═══════╪═════════════╣
║ Reads from core(s)               │       2561 │   0.0 │ 4KiB Blocks ║
║ Writes to core(s)                │  745094229 │ 100.0 │ 4KiB Blocks ║
║ Total to/from core(s)            │  745096790 │ 100.0 │ 4KiB Blocks ║
╟──────────────────────────────────┼────────────┼───────┼─────────────╢
║ Reads from cache                 │  644049412 │  42.2 │ 4KiB Blocks ║
║ Writes to cache                  │  881997809 │  57.8 │ 4KiB Blocks ║
║ Total to/from cache              │ 1526047221 │ 100.0 │ 4KiB Blocks ║
╟──────────────────────────────────┼────────────┼───────┼─────────────╢
║ Reads from exported object(s)    │       4661 │   0.0 │ 4KiB Blocks ║
║ Writes to exported object(s)     │  983061770 │ 100.0 │ 4KiB Blocks ║
║ Total to/from exported object(s) │  983066431 │ 100.0 │ 4KiB Blocks ║
╚══════════════════════════════════╧════════════╧═══════╧═════════════╝

╔════════════════════╤═══════╤═════╤══════════╗
║ Error statistics   │ Count │  %  │ Units    ║
╠════════════════════╪═══════╪═════╪══════════╣
║ Cache read errors  │     0 │ 0.0 │ Requests ║
║ Cache write errors │     0 │ 0.0 │ Requests ║
║ Cache total errors │     0 │ 0.0 │ Requests ║
╟────────────────────┼───────┼─────┼──────────╢
║ Core read errors   │     0 │ 0.0 │ Requests ║
║ Core write errors  │     0 │ 0.0 │ Requests ║
║ Core total errors  │     0 │ 0.0 │ Requests ║
╟────────────────────┼───────┼─────┼──────────╢
║ Total errors       │     0 │ 0.0 │ Requests ║
╚════════════════════╧═══════╧═════╧══════════╝

The pattern that seems to be repeating is 3 hours after we start the rsyncs...

This is the actual server status while still doing the rsyncs and cache not working suddenly:

jvinolas · 2024-11-24T10:04:42Z

Could it be due to unresponsiveness of core device (high iowait) and fullfilling this buffers?

Some dmesg messages if that is the case would be helpful. We now activated the echo 8 kernel debug level messages and see if anything is thrown.

jvinolas · 2024-11-24T19:34:05Z

It is reproducible, now with

╔═════════════════════════╤═══════╗
║ Parameter name          │ Value ║
╠═════════════════════════╪═══════╣
║ Wake up time [s]        │     5 ║
║ Stale buffer time [s]   │     1 ║
║ Flush max buffers       │   500 ║
║ Activity threshold [ms] │    25 ║
╚═════════════════════════╧═══════╝

robertbaldyga · 2024-11-25T18:42:36Z

Hi @jvinolas ! Do you trace cache occupancy over time? Can the failure be correlated with the moment the occupancy hits 100%?

jvinolas · 2024-11-25T23:54:17Z

Well, it seems... what's the solution? I mean, it goes pass through as I assume, but is there any way we can avoid that? Or event any way we could activate it again without having to stop and recreate the cache?

We are not using any io class right now. Do the default io class solve that problem?

jvinolas · 2024-11-27T06:54:12Z

We tried the default io class, also without success in evicting old data...

jvinolas · 2024-11-28T16:04:34Z

Looking at the code, could it be this function having problems in multi cpu (processor) socket servers? Ours is dual socket and we've got also numad active. We'll try to disable numad to see if that could interfere.

robertbaldyga · 2024-11-28T17:17:54Z

@jvinolas I think we got a reproduction of that behavior. I'll let you know once we identify the root cause.

jvinolas · 2024-11-28T18:29:44Z

Thanks. Meanwhile, is there any setup that gives us at least write cache and does not increment occupancy over the dirty so we don't get stuck when it reaches full occupancy?

robertbaldyga · 2024-11-28T19:38:16Z

I think the problem is not only the occupancy, but also the fact that it's 100% dirty data, which means that cache needs to perform on-demand cleaning in order to be able to perform any eviction. If you need it to work as a write buffer it may we worth trying ACP cleaning policy. For long, continuous stream of intensive writes it will not help much, but if your workload has periodic bursts of writes, maybe ACP will be able to clean the cache between the bursts.

robertbaldyga · 2024-11-29T15:14:41Z

Also if there is any chance that your workload contains long ranges of sequential writes, setting sequential cutoff policy to never may help.

jvinolas · 2024-11-30T13:00:05Z

we'll try and see, thanks

jvinolas · 2024-12-03T06:57:55Z

With write back, we set up at 23:10 ACP and seq-cutoff to never the dirty data keeps lower, as it should. And seems that eviction came correctly after we hit 100% occupancy doing some rsyncs: dirty still working, so cache working and not deactivated as it happened with ALRU:

jvinolas · 2024-12-09T15:54:44Z

We found that if we change to pt and back to wb the acp policy does not do on-demand cleaning, so we have to stop and start it again. For now we keep it in wb.

jvinolas · 2024-12-11T20:24:58Z

We are hitting more or less the same problem when occupancy gets 100%, low performance... Will wo mode avoid having clean lines at the cost of not caching reads? If we got a ram cache in wa mode in front of the nvme wo will it do the trick and keep occupancy just at the dirty line in the underlying nvme cache, thus avoiding the bug when hitting 100% occupancy?

This shows how the cache is not working well after occupancy hits 100%

UPDATED: Aplying seq-cutoff to never (we forgot after last cache stop) seems to bring back performance

UPDATE2: It worked for about 24 minutes, then came back to low performance...

jvinolas added the bug Something isn't working label Nov 23, 2024

jvinolas mentioned this issue Nov 27, 2024

with cache full (but clean), seeing very high number of pass-through requests #1566

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache does not work unexpectedly #1588

Cache does not work unexpectedly #1588

jvinolas commented Nov 23, 2024

jvinolas commented Nov 23, 2024 •

edited

Loading

jvinolas commented Nov 23, 2024 •

edited

Loading

jvinolas commented Nov 24, 2024

jvinolas commented Nov 24, 2024 •

edited

Loading

robertbaldyga commented Nov 25, 2024

jvinolas commented Nov 25, 2024 •

edited

Loading

jvinolas commented Nov 27, 2024

jvinolas commented Nov 28, 2024

robertbaldyga commented Nov 28, 2024

jvinolas commented Nov 28, 2024

robertbaldyga commented Nov 28, 2024

robertbaldyga commented Nov 29, 2024 •

edited

Loading

jvinolas commented Nov 30, 2024

jvinolas commented Dec 3, 2024

jvinolas commented Dec 9, 2024

jvinolas commented Dec 11, 2024 •

edited

Loading

Cache does not work unexpectedly #1588

Cache does not work unexpectedly #1588

Comments

jvinolas commented Nov 23, 2024

Description

Expected Behavior

Actual Behavior

Steps to Reproduce

Context

Possible Fix

Logs

Configuration files

Your Environment

jvinolas commented Nov 23, 2024 • edited Loading

jvinolas commented Nov 23, 2024 • edited Loading

jvinolas commented Nov 24, 2024

jvinolas commented Nov 24, 2024 • edited Loading

robertbaldyga commented Nov 25, 2024

jvinolas commented Nov 25, 2024 • edited Loading

jvinolas commented Nov 27, 2024

jvinolas commented Nov 28, 2024

robertbaldyga commented Nov 28, 2024

jvinolas commented Nov 28, 2024

robertbaldyga commented Nov 28, 2024

robertbaldyga commented Nov 29, 2024 • edited Loading

jvinolas commented Nov 30, 2024

jvinolas commented Dec 3, 2024

jvinolas commented Dec 9, 2024

jvinolas commented Dec 11, 2024 • edited Loading

jvinolas commented Nov 23, 2024 •

edited

Loading

jvinolas commented Nov 23, 2024 •

edited

Loading

jvinolas commented Nov 24, 2024 •

edited

Loading

jvinolas commented Nov 25, 2024 •

edited

Loading

robertbaldyga commented Nov 29, 2024 •

edited

Loading

jvinolas commented Dec 11, 2024 •

edited

Loading