Skip to content

IO against Ceph backed preservation storage is hanging indefinitely (steps to address IO problems, and follow on cleanup)

Johnathan Martin edited this page Aug 4, 2022 · 8 revisions

WIP

Restarting virtual machines to deal with indefinitely stuck Ceph IO; auditing for and remediating any resulting corruption

Starting in late June 2022, we've experienced an issue where most reads and/or writes to Ceph-backed preservation storage will hang indefinitely, neither throwing errors nor completing. See e.g. https://github.com/sul-dlss/preservation_catalog/issues/1896 and https://github.com/sul-dlss/operations-tasks/issues/3100

This page is meant to be a guide to help:

  1. determine whether that problem is occurring at any given moment
  2. alleviate the problem so that preservationIngestWF and thus accessionWF can resume (i.e. "unwedge accessioning")
  3. find and remediate any data corruption that may have been introduced

Determining whether this is necessary

The two most common ways that we become suspicious that this problem is occurring are:

  • Someone who does accessioning complains that it's taking an inordinately long time for a newly accessioned version to finish going through the pipeline.
  • An infrastructure developer or ops engineer who is watching the robot Resque console or preservation catalog's Resque console notices that preservation storage related jobs (e.g. UpdateMoab on preservation robots, or validate_moab on preservation catalog) are taking an inordinately long time (e.g. an UpdateMoab or validate_moab job for a 100 MB Moab has been running for 2 hours).

Additionally, sometimes ops will notice Ceph slow MDS request alerts, urging us to look for one of the above manifestations of the issue.

So how do I know if things are really stuck?

tip: if you're babysitting Resque workers to determine whether things are stuck/unstuck, you may want to turn on the "Live Poll" mode. There's a link in the bottom right corner of the Resque web console for this. It'll auto-refresh the page every 2 seconds, so you don't have to constantly manually refresh. Note that it will eventually fail to keep refreshing when your webauth session times out; when this happens, refresh the page manually, acknowledge 2FA if need be, and hit "Live Poll" again once the console page has reloaded.

Sometimes it does legitimately take a long time for a large object to be processed. We don't, unfortunately, have systematically collected stats or detailed expectations about throughput overall or processing time for a single object given a number of files and/or a total file size. But, for example, most objects that are 100 MB or less should take no more than a minute or two to be processed by any given accessionWF or preservationIngestWF step (once picked up! it is of course possible that there's a long backlog of work during a period of heavy activity, leaving many objects in the queue being unworked at any given time). Similarly, it would not be surprising for an object that is tens of GB to take tens of minutes at any given step involving preservation storage IO, and it would be typical for a very large object (e.g. a 1 TB media object) to take many hours (even a day or two might be reasonable.

One thing you can do to help determine whether IO is truly hung is to work with ops or Andrew to determine whether any file IO is actually happening, either by using Linux CLI tools to examine disk and CPU activity of relevant worker processes; in the case of operations like TransferObject and UpdateMoab, you might also use du to see if bytes are still being written to the target storage.

An additional confounding factor: we've recently seen a situation where many objects, often sharing the first couple characters of their respective druids (e.g. kj123bc4567 and kj456df7890), will get stuck on an IO heavy step (e.g. validate_moab in pres cat or UpdateMoab in pres robots) for 15 minutes, at which point the traffic jam will clear on its own, and the workflows will continue moving.

First things to try if it seems things are actually stuck

If none of the above (large objects taking long time to process as expected, objects sitting in a deep queue, a self-clearing 15-ish minute jam) seem to apply, it is likely that things are actually stuck. If possible, consult with ops or Andrew to see if someone who has dealt with the issue before agrees that things are stuck based on prior first-hand experience.

Before trying to deal with the problem by rebooting, which is the most disruptive action and the one most likely to introduce corruption, but also the one most likely to get things moving if they are truly stuck, you can first try:

  • MDS failover -- this is something that you must request from ops. They will failover one or more Ceph MDS servers that are being slow to respond. This may completely alleviate the issue, in which case it is unlikely that data corruption will be introduced. After the failover(s), watch the queues and determine per the above guidance whether the blockage has been resolved.
  • killing Resque worker processes for stuck jobs -- you may manually do this from the Linux command line on which the worker process is running. You'd want to use either kill -9 <PID of the child Resque process doing the work> or kill -s TERM <PID of the parent process that forked off the process that's actually doing the work>. You can find parent process IDs by looking at the number after the colon on the "Where" column of Resque console for the job you're curious about. You can get the child process ID by doing ps -ef | grep resque from the command-line of the worker VM and finding the ID of the process that the aforementioned parent forked. NOTE: at present there is one pres robots worker VM, but there are 3 pres cat worker VMs -- the "Where" column on Resque console should tell you what host the job you care about is running on.
    • This may introduce data corruption regardless of the worker process being killed, and will almost certainly introduce data corruption if the process killed is a pres robots worker performing UpdateMoab or TransferObject. If the worker process that was terminated was a validate_moab job in pres cat, you can retry it from Resque console. Because the robot system has a bit more indirection and management over robots Resque instances, any pres robots jobs that are killed should be retried via the workflow UI in Argo or by manually resetting workflow steps from the workflow-server-rails rails console. More detailed info on that and the rest of remediation is linked later in this document.

Sometimes an MDS failover or the targeted termination of a worker process will clear up the jam entirely. Sometimes it will clear things for a few minutes, only for things to get stuck again. You'll have to watch the Resque consoles for a few minutes after either of the above measures and determine which is the case.

What to do if it's determined that rebooting pres robots and/or pres cat is necessary

  1. warn #dlss-aaas that things are stuck (if they didn't already raise the issue)
  2. graceful shutdown of resque pool master (likely pres robots, maybe also pres cat)
  3. termination of stuck worker processes (maybe pres robots, maybe pres cat -- either way, note stuck druids and age of hung jobs)
  4. all clear to ops to reboot (likely just pres robots VM, maybe also pres cat VMs, but not the redis VM)
  5. ops indicates VM reboot(s) complete
  6. bring back pres cat workers, let them work off any backlog
  7. bring back pres robots workers, let them work off any backlog
  8. keep an eye on argo wf grid
  9. all clear (for now) to #dlss-aaas
  10. checksum validation audits and replication audits for stuff accessioned within a day before the first reboot through the day after the last reboot, if there were multiple reboots within a few days before auditing was done
Clone this wiki locally