metrics:
- missing_blocks
- signing_info
- validator_status
- delegation_shares
- voting_power
metrics:
- CPU usage
- Memory utilization
- Disk I/O
- Network traffic
- Storage capacity
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'validator'
static_configs:
- targets: ['localhost:26660']
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
# Installation
apt install -y prometheus-node-exporter
# Service configuration
systemctl enable node_exporter
systemctl start node_exporter
Dashboard Components:
- Validator status
- Block production
- Resource usage
- Network health
- Alert status
groups:
- name: validator_alerts
rules:
- alert: MissedBlocks
expr: increase(cosmos_validator_missed_blocks[5m]) > 0
- alert: LowPeerCount
expr: cosmos_p2p_peers < 10
- alert: HighCPUUsage
expr: cpu_usage_percent > 80
- alert: DiskSpaceLow
expr: disk_free_percent < 20
groups:
- name: warning_alerts
rules:
- alert: HighMemoryUsage
expr: memory_usage_percent > 80
- alert: PeerCountDecreasing
expr: rate(cosmos_p2p_peers[15m]) < 0
┌────────────────┐
│ Validator Info │
├────────────────┤
│ Block Height │
│ Missed Blocks │
│ Peer Count │
└────────────────┘
┌────────────────┐
│ System Status │
├────────────────┤
│ CPU Usage │
│ Memory Usage │
│ Disk I/O │
└────────────────┘
Priority 1 (Immediate):
- Missed blocks
- Node offline
- Security breach
Priority 2 (30 mins):
- High resource usage
- Network issues
- Peer count low
Priority 3 (4 hours):
- Performance warnings
- Minor anomalies
-
Detection
- Alert received
- Issue confirmed
- Severity assessed
-
Response
- Team notified
- Actions taken
- Status updated
-
Resolution
- Issue fixed
- Root cause analyzed
- Documentation updated
# Validator logs
journalctl -u cosmosd -f
# System logs
tail -f /var/log/syslog
# Security logs
tail -f /var/log/auth.log
Important Patterns:
- Error messages
- Warning signs
- Performance issues
- Security events
-
Block Production
- Block time
- Block size
- Transaction count
-
Network Health
- Peer count
- Bandwidth usage
- Latency
-
Resource Usage
- CPU load
- Memory profile
- Disk operations
# Check backup status
0 */6 * * * /scripts/verify_backup.sh
# Verify snapshot integrity
0 0 * * * /scripts/check_snapshot.sh
- Weekly backup tests
- Monthly recovery drills
- Quarterly DR exercises
# Monitor SSH access
tail -f /var/log/auth.log
# Track sudo usage
journalctl _COMM=sudo
# Monitor connections
netstat -tunlp
# Track bandwidth
iftop -i eth0
- Alert handling
- Incident response
- Backup verification
- Performance tuning
- Node recovery
- Network issues
- Security incidents
- Data corruption
- Prometheus
- Grafana
- Node Exporter
- Cosmos Exporter
- Technical team
- Community resources
- Documentation
- Emergency contacts