High-availability monitor for VPN tunnels.
Side A Side B
+----------+ active +----------+
| Active | <=======================> | Active |
+----------o x----------+
. o x .
. o x .
. o x .
. x o .
. backup x o backup .
. x o .
+----------x o----------+
| Backup | < - - - - - - - - - - - > | Backup |
+----------+ backup +----------+
Typical setup would be:
- 2x VPN hosts on each side of the bridge.
- One host on each side configured as
active
, the other asstandby
. - Each host on each side has VPN tunnels configured to both of the other side hosts.
- Only
active
-active
tunnel is used, the others are there for backup.
In the event of the downtime (tunnel is broken, one of active hosts is down), the monitor would:
- Promote the remaining tunnel to become
active
. - Trigger custom command scripts to adjust for the situation (to re-configure the routes, for example).
In order to achieve that, vpnham
does the following:
-
Regularly send UDP-datagrams to the each of the peers on the other side (connections
<==>
,< - >
,< x >
, and< o >
on the diagram above).- The peer adds its bit into the datagram and sends it back.
- This is how
vpnham
determines theup
/down
status of the tunnel (it accounts for the sent probes with their sequence numbers, and expects them to come back).
-
Regularly poll the partner's bridge (i.e. the
active
bridge polls thestandby
one, and vice versa; connections< . >
above).- Failure to poll means the partner is
down
. - If both of the tunnels are
down
, the bridge marks itselfdown
as well and reports itself accordingly to its partner.
- Failure to poll means the partner is
-
If
active
tunnel isdown
, thestandby
is promoted toactive
. -
If
active
bridge isdown
, thestandby
is promoted toactive
. -
Once the
active
(by configuration) tunnel gets backup
, it will reclaim theactive
status (in other words, the ties are broken via configuration). -
Similar approach with the bridges.
There are configurable scripts (per bridge, or globally):
-
bridge_activate
is triggered when a bridge is promoted toactive
.- Recognised placeholders are:
${proto}
${bridge_peer_cidr}
${bridge_interface}
${bridge_interface_ip}
- Recognised placeholders are:
-
tunnel_activate
is triggered when a tunnel is markedactive
.- Recognised placeholders are the same as for
bridge_activate
, plus:${tunnel_interface}
${tunnel_interface_ip}
- Recognised placeholders are the same as for
-
tunnel_deactivate
is triggered when the tunnel'sactive
mark is removed.- Recognised placeholders are the same as for
tunnel_activate
- Recognised placeholders are the same as for
In addition, there is a metrics endpoint where vpnham
reports the following:
-
vpnham_bridge_active
is a gauge for count of active bridges.0
is when no connectivity to the other side (bad).1
is when all is good (yay).2
is when both us and the partner consider themselvesactive
(this means bug).
-
vpnham_bridge_up
is a gauge for the count of online bridges (from0
to2
, the more the merrier). -
vpnham_tunnel_interface_active
is a gauge for count of active tunnels. -
vpnham_tunnel_interface_up
is a gauge for count of online tunnels.
Also (since we have that info at our fingertips through probing), the following metrics are exposed:
-
vpnham_probes_sent_total
is a counter for probes sent. -
vpnham_probes_returned_total
is a counter for probes returned. -
vpnham_probes_failed_total
is a counter for probes failed to send, or to receive. -
vpnham_probes_latency_forward_microseconds
is a histogram for the probes forward latency (on their trip "there"). -
vpnham_probes_latency_return_microseconds
is a histogram for the probe return latency (trip "back")
bridges:
vpnham-dev-lft: # bridge name (must match one at the partner's side)
role: active # our role (`active` or `standby`)
bridge_interface: eth0 # interface on which the bridge connects to VPC
peer_cidr: 10.1.0.0/16 # CIDR range of the VPC we are bridging into
status_addr: 10.0.0.2:8080 # address where our partner polls our status
partner_url: http://10.0.0.3:8080/ # url where we poll the status of the partner
probe_interval: 1s # interval between UDP probes or status polls
probe_location: left/active # location label for the latency metrics
tunnel_interfaces:
eth1: # interface on which VPN tunnel is running
role: active # tunnel role (`active` or `standby`)
addr: 192.168.255.2:3003 # address where we respond to UDP probes
probe_addr: 192.168.255.3:3003 # address where we send the UDP probes to
threshold_down: 5 # count of failed probes/polls to mark peer/partner "down"
threshold_up: 3 # count of successful probes/polls to mark peer/partner "up"
eth2:
role: standby
addr: 192.168.255.18:3003
probe_addr: 192.168.255.19:3003
threshold_down: 7
threshold_up: 5
scripts_timeout: 5s # max amount of time for script commands to finish
metrics:
listen_addr: 0.0.0.0:8000 # where we expose the metrics (at `/metrics` path)
latency_buckets_count: 33 # count of histogram buckets for latency metrics
max_latency_us: 1000000 # max latency bucket in [us]; the buckets are computed
# exponentially, so that
# max_latency == pow(min_latency, buckets_count)
default_scripts: # default scripts (complement the `scripts` on bridge config)
bridge_activate: # script that we will run when bridge becomes `active`
- ["sh", "-c", "echo ${bridge_interface} ${bridge_interface_ip} ${bridge_peer_cidr}"]
- ["sleep", "15"]
interface_activate: # script that we will run when tunnel becomes `active`
- ["sh", "-c", "echo 'activate ${tunnel_interface} ${tunnel_interface_ip}'"]
interface_deactivate: # script that we will run when tunnel becomes `inactive`
- ["sh", "-c", "echo 'deactivate ${tunnel_interface} ${tunnel_interface_ip}'"]
Note
See the following files for the full example:
Also: make docker-compose
vpnham
takes only one cli-parameter --config
that should point to the
yaml-file with full configuration. By default it will seek .vpnham.yaml
file
in the working directory.