-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARC writes causing major system slowdown on ZFS versions after v2.1.0 (may be macOS specific) #16046
Comments
Please file these on the macOS project's tracker unless you reproduce them on not-macOS until that project gets merged here. |
points at where I linked the macOS port Github issue and explained the rationale behind purposefully and knowingly cross-posting 😝 We don't have enough ARC code experts working on the macOS project, and this problem has been extremely difficult to debug as the system quickly becomes unresponsive. The issue appears to be within the core ARC code somewhere, so does not appear to be a downstream problem as such, but for whatever reason it either does not affect Linux, or is not producing obvious slowdown in the same way (or it would have been reported and fixed already). While the issue affecting macOS severely makes it a macOS problem, the code that appears to be causing it isn't macOS specific as far as those of us testing can tell. |
There are better venues to ask for people to look at things than "I'm going to post a bug". Slack, mailing lists, and IRC are right there. |
But it is a bug though; something ARC related has been changed in openzfs between v2.1.0 and v2.1.6. If you don't want people to report bugs, don't have a bug report tool. |
I'm not trying to discourage people from reporting bugs. You are saying the behavior changed between 2.1.0 and 2.1.6 on macOS, and that is causing severe issues in the macOS port. Sure. My remark is that this does not appear to be causing problems in the upstream project, or even a behavior change, so I wouldn't consider it a bug here, absent more information pointing to it being one, and your reason for posting it here appears to be "I wanted more eyeballs than the macOS project gets", which seems less like a rationale for "this is a bug in upstream OpenZFS" and more a reason to ask people in OpenZFS or elsewhere who know the code well for their $0.02. It's not like we have a hard and fast policy, as far as I'm aware, about what is and isn't a bug, but this seemed like what you wanted would have been better fulfilled by just asking in one of the other places developers hang out and communicate than opening a bug, which was why I closed it as not a bug here, and suggested communicating that way. I don't have a strong preference one way or another about this being open or closed, other than the fact that the issue tracker is full of bugs that are years old and I'd like to cut down on that, but I think you will be better served asking in Slack or on the mailing list for people's insights than using a bug report for a behavior change that is specific to the macOS port. |
#14054 seems to already be fixed? If the issue is related to xattrs as that one was (not an unfair assumption as macOS loves its xattrs) then this would still make that an upstream problem post v2.1.0. In that case Linux just might not be as affected since it doesn't dump 4+ xattrs onto almost literally every file it touches, unlike macOS which absolutely does. But I don't have a suitable Linux test system to find out; most of my machines are Macs, plus one Windows gaming PC that actively defies dual booting (hell, it barely runs Windows). The Linux systems with zfs that I work with sadly aren't for me to mess about with testing theories. While #15214 seems to focus around testing very large files, I'll give disabling prefetch a try when I get a chance, as I do see a fair number of references to |
System information
Describe the problem you're observing
When running macOS ZFS versions newer than v2.1.0, system performance declines rapidly with the amount of write activity passing through primary ARC (datasets have
primarycache=metadata
and/orprimarycache=all
set). As the primary ARC grows, the problem grows worse until the system becomes unusable or processes begin crashing.Setting
primarycache=none
on all active datasets immediately "resolves" this issue, at the cost of having no cached (meta)data.While this issue appears to be macOS specific I'm cross-posting here because I'd like to make sure this isn't an issue also affecting Linux in a much less noticeable way, we also haven't thought of any obvious macOS specific cause for the problem so far. Plus it would be nice to have more theories about what the root cause might be from anyone more familiar than we are with the ARC internals (my knowledge is mostly limited to screwing around with tunables).
The issue does not affect
zfs-macOS-2.1.0-1
which is the last known good version for macOS systems running with larger primary ARCs, as systems with smaller ARC sizes are either unaffected, or negligibly affected. I have two older Macs using ZFS to host a single zvol for Time Machine backups, but these have primary ARC maximum sizes around 256mb.Unfortunately macOS never received releases of v2.1.1-2.1.5, the next version available is v2.1.6, so that doesn't help with narrowing down what change(s) might have caused this problem.
Describe how to reproduce the problem
primarycache=all
on at least some active (mounted) datasets.rsync
.Expected Results
Performance should be similar to, or superior, when compared to v2.1.0 (due to improved support for hardware acceleration).
Actual Results
As the primary ARC size increases, performance will decline.
On my main system (2018 i7 hexacore Mac Mini) the system begins locking up briefly at around 1gb primary ARC, and becomes increasingly unusable beyond around ~4gb of primary ARC, especially with moderate to heavy write activity.
Additional Notes
This is a modified cross-post from the macOS port's Github issue, and you can read more about the problem in the related forum thread (I've linked to the first post where we started narrowing down what was and wasn't causing problems). However my quick summary would be:
primarycache=none
on all active datasets) results in an immediate, substantial improvement. Performance roughly in line with v2.1.0 running without ARC, or slightly better.We've had a hard time gathering useful statistics on affected systems; by the time the issue begins it becomes very hard to run tests and capture results. I'm open to any suggestions of additional data to gather, though as my main affected system is a working machine it takes time before I can do it (since it requires upgrading temporarily, testing, then downgrading back to v2.1.0), and as I say, it's difficult to run tests when the problem is at its most noticeable.
I'm attaching the latest macOS spindump that I was able to gather; I grabbed this after disabling compressed ARC, and while copying data between two datasets using
rsync
(rather than using ZFS' own disgustingly efficientsend
feature 😉). macOS spindumps are basically a stacktrace of all running processes at the time they are taken, and when I triggered it I was seeingkernel_task
CPU utilisation of over 200% (though as I say, the system is so sluggish at this point it's hard to reliably capture these quickly enough).I'm hoping examination of the
rsync
processes might give some clues as to what ZFS is doing. To jump to the receiving side of the transfer you can search the text for "rsync [9056]".spindump.v2.2.3rc4.compresse_arc_disabled.zip
Apologies for cross-posting when the issue is likely macOS specific for some reason, but we're struggling to figure out why, and hopefully it won't hurt to be sure the issue isn't something that could affect other platforms (but less noticeably).
The text was updated successfully, but these errors were encountered: