Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve alarms #1336

Merged
merged 5 commits into from
Nov 26, 2024
Merged

Improve alarms #1336

merged 5 commits into from
Nov 26, 2024

Conversation

yrong
Copy link
Contributor

@yrong yrong commented Nov 20, 2024

  • Move alarm configuration out of api package
  • Make alarm evaluation variables configurable
  • Monitor with SCAN_INTERVAL more sensitive

Resolves: https://linear.app/snowfork/issue/SNO-1233

@yrong yrong marked this pull request as ready for review November 21, 2024 05:48
@yrong
Copy link
Contributor Author

yrong commented Nov 21, 2024

Reinitialize alarms with threshold more sensitive and monitor process restarted.

Just watch in CloudWatch for sometime to see if that make sense.

@yrong yrong requested a review from alistair-singh November 21, 2024 05:52
Copy link
Contributor

@alistair-singh alistair-singh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, approving. Although I think we should move away from using boolean style metrics and we should just log numbers where possible. Maybe we can address in a separate PR.

web/packages/operations/.env.example Show resolved Hide resolved
Value: Number(
channel.toEthereum.outbound < channel.toEthereum.inbound
),
Value: Number(channel.toEthereum.outbound < channel.toEthereum.inbound),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of logging booleans, maybe we should log the difference. This would then be a metric of undelivered messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

785b59b metrics for the difference.

? parseInt(process.env["CheckIntervalToEthereum"])
: status.BlockLatencyThreshold.ToEthereum) &&
metrics.bridgeStatus.toEthereum.latestPolkadotBlockOnEthereum <=
metrics.bridgeStatus.toEthereum.blockLatency > BlockLatencyThreshold.ToEthereum &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should also just log the blockLatency as a plain number instead of a boolean. Easier threshold management.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yrong yrong merged commit 0d47ebe into main Nov 26, 2024
1 check passed
@yrong yrong deleted the ron/improve-alarms branch November 26, 2024 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants