Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not "STATE_UNKNOWN" on unfinished backups #12

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

GuillaumeFromage
Copy link

Hi there,

I've made a hack so that check_borg send "STATE_OK" when the backups aren't finished and there is a process running that have the same BORG_REPO as the one we're looking to check.

It might be a bit overly complicated with the date to convert the ps output, but it works for us.

G

check_borg Outdated
[ "$?" = 0 ] || error "Cannot list repository archives. Repo Locked?"
if [ "$?" = 0 ]
then
if not ps aux | grep "${BORG}" | grep 'create' | grep "$BORG_REPO" >> /dev/null
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to grep for the exact order by using grep -E 'PATTERN1.*PATTERN2' or by using awk '/PATTERN1.*PATTERN2/' or sed '/PATTERN1.*PATTERN2/!d'.

check_borg Outdated
;;
esac

if [ -z "${last}" ]; then
echo "BORG CRITICAL, no archive in repository"
exit "${STATE_CRITICAL}"
if ps aux | grep "${BORG}" | grep 'create' | grep "$BORG_REPO" >> /dev/null
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

check_borg Outdated
if ps aux | grep "${BORG}" | grep 'create' | grep "$BORG_REPO" >> /dev/null
then
# A process most likely on the same repo is running
hours=$(ps -xo etime,cmd | grep "${BORG}" | grep 'create' | grep -v 'python' | grep "${BORG_REPO}" | sed 's/^[ ]*//g' | cut -d ' ' -f 1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nearly impossible to read IMHO. Needs some tweaking.

@bebehei
Copy link
Owner

bebehei commented Aug 5, 2020

What about borg with-lock true?

If there's an operation running on the borg repository blocking our commands, borg with-lock will give an exact result.

@GuillaumeFromage
Copy link
Author

What about borg with-lock true?

If there's an operation running on the borg repository blocking our commands, borg with-lock will give an exact result.

I tried that first but I couldn't get with-lock to work (on borg 1.1.8). Could anyone point me to a command that would get the status of a locked backup ?

@GuillaumeFromage
Copy link
Author

I've at least documented the complexity, and realized the code had issues on finished backups (crashing).

@z3dm4n
Copy link

z3dm4n commented Aug 12, 2020

Thanks a lot!

It's still not the best approach, but better than the first for sure. The best approach IMHO would be to use borg itself to check for locks as @bebehei already mentioned. Due to lack of time, I have not been able to check borg with-lock true, but the documentation sounds promising.

@bebehei
Copy link
Owner

bebehei commented Aug 12, 2020

The current implementation of this PR is just checking if there is a process running on your local machine.

It doesn't catch if there is a process running on another machine.

But let's take a step back at first and think about the design of the plugin. I've got a few thought on this


- It's fine if there is a current snapshot (borg list successfully)
- It's problematic if there is **no current snapshot** (borg list successfully)
- It's fine if there is currently a backup running. (borg list failed)
- It gets problematic if the backup is **running too long**. (borg list failed)

So we've got a matrix of possibilities and we should check either if there is a backup running and in time or if there is a list and it got current snapshots.

We could implement this either via a nagios specific-way or we could implement this in the plugin.

In nagios this would be easy quickfix. We just explicitly use UNKNOWN state in the nagios plugin and emit this if it's running. Additionally you could use a "scheduled flexible downtime", which is scheduled to be started right with the cronjob. And if the plugin switches to UNKNOWN, it automatically gets downtimed for a specified amount of time. So you ignore your backup during the backup time.

If we implement this in the plugin, we have to check the timestamp of the lock. According to my research, there is no borg interface right now, giving you lock information. There is only lock.roster in the repository during backup. It's a JSON-FIle with a timestamp. Here's a current example:

{"exclusive": [["falafel@158445875416263", 22722, 0]]}#  

You have to remove the last 5 chars of the Timestamp to have a Unix-timestamp.

So whenever a backup is running, you check the roster and check if there it's in range. Depending on the range the plugin then states OK/WARN/CRIT.

I like the second solution. It bloats up the plugin, but it works correctly. And with an interface from the actual borg executable (e.g. borg lock-info), this would be awesome.

@GuillaumeFromage
Copy link
Author

Hi there,

Thanks for being so responsive !

You are right. This patch works for us, as we run it as a local check on the host that does push the backups. I was sending to y'all as a courtesy in case that helps. I'll read the code for with-lock on borg side to figure out how that works and maybe make a feature request to get the "running since" value from borg.

G

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants