
High Availability
Design patterns for fault-tolerant Muti Metroo deployments.
Overview
High availability in Muti Metroo is achieved through:
- Redundant paths: Multiple routes to destinations
- Automatic failover: Route selection based on availability
- Reconnection: Automatic peer reconnection with backoff
- Load distribution: Multiple exit points
Pattern 1: Redundant Transit
Multiple transit paths between ingress and exit.
Architecture
Configuration
Ingress Agent:
agent:
display_name: "Ingress"
peers:
# Primary transit
- id: "${PRIMARY_TRANSIT_ID}"
transport: quic
address: "transit-primary.example.com:4433"
tls:
ca: "./certs/ca.crt"
# Backup transit
- id: "${BACKUP_TRANSIT_ID}"
transport: quic
address: "transit-backup.example.com:4433"
tls:
ca: "./certs/ca.crt"
socks5:
enabled: true
address: "127.0.0.1:1080"
Both transits connect to the exit, which advertises routes. The ingress learns routes via both paths and prefers the one with lower hop count.
Failover Behavior
- Primary transit fails
- Routes via primary expire (TTL, default 5m)
- Only backup route remains
- Traffic flows through backup
- Primary recovers - routes re-advertise
- Traffic returns to primary (lower metric)
Faster Failover
Reduce route TTL for faster detection:
routing:
route_ttl: 1m # 1 minute TTL
advertise_interval: 30s # Advertise every 30s
Pattern 2: Multiple Exits
Multiple exit points for the same routes.
Architecture
Configuration
Exit A:
agent:
display_name: "Exit A (DC East)"
exit:
enabled: true
routes:
- "10.0.0.0/8"
dns:
servers:
- "10.0.0.1:53"
Exit B:
agent:
display_name: "Exit B (DC West)"
exit:
enabled: true
routes:
- "10.0.0.0/8"
dns:
servers:
- "10.0.0.1:53"
The ingress learns the same route from both exits and uses the one with lower metric.
Pattern 3: Geographic Redundancy
Agents in multiple regions for disaster recovery.
Architecture
Cross-Region Peering
# Agent A1 (Region A)
agent:
display_name: "Region A Primary"
peers:
# Local exit
- id: "${EXIT_A_ID}"
transport: quic
address: "exit-a.region-a.internal:4433"
# Cross-region peer
- id: "${AGENT_B1_ID}"
transport: quic
address: "agent-b1.region-b.example.com:4433"
Pattern 4: Active-Active Ingress
Multiple ingress points behind a load balancer.
Architecture
DNS Round-Robin
proxy.example.com. 300 IN A 192.168.1.10 # Ingress 1
proxy.example.com. 300 IN A 192.168.1.11 # Ingress 2
L4 Load Balancer
HAProxy example:
frontend socks5
bind *:1080
mode tcp
default_backend socks5_backends
backend socks5_backends
mode tcp
balance roundrobin
server ingress1 192.168.1.10:1080 check
server ingress2 192.168.1.11:1080 check
Monitoring for HA
Health Checks
# Check each agent
curl http://ingress1:8080/health
curl http://ingress2:8080/health
curl http://exit:8080/health
# Check peer connectivity
curl http://ingress1:8080/healthz | jq '.peers'
Key Metrics
Monitor these for HA:
| Metric | Alert Condition |
|---|---|
peers_connected | < expected count |
routes_total | < expected count |
peer_disconnects_total | Spike in rate |
stream_errors_total | Spike in rate |
Prometheus Alert Rules
groups:
- name: muti-metroo
rules:
- alert: PeerDisconnected
expr: muti_metroo_peers_connected < 2
for: 1m
labels:
severity: warning
annotations:
summary: "Muti Metroo has fewer than 2 peers connected"
- alert: NoRoutes
expr: muti_metroo_routes_total == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Muti Metroo has no routes in routing table"
Reconnection Tuning
Configure aggressive reconnection for faster recovery:
connections:
reconnect:
initial_delay: 500ms # Start fast
max_delay: 30s # Cap at 30s
multiplier: 1.5 # Slower backoff
jitter: 0.3 # 30% jitter
max_retries: 0 # Never give up
Best Practices
- Minimum two paths: Always have redundant routes
- Geographic diversity: Spread agents across regions
- Independent failure domains: Different networks, power, etc.
- Monitor everything: Alerts before users notice
- Test failover: Regularly test by killing components
- Document topology: Know what depends on what
Testing Failover
Manual Testing
# Kill a transit agent
docker stop transit-primary
# Verify traffic still flows
curl -x socks5://localhost:1080 https://example.com
# Check routes updated
curl http://localhost:8080/healthz | jq '.routes'
# Bring transit back
docker start transit-primary
# Verify primary route restored
sleep 120 # Wait for advertisement
curl http://localhost:8080/healthz | jq '.routes'
Chaos Testing
Use the built-in chaos package for automated testing:
// internal/chaos provides fault injection
chaosMonkey := chaos.NewChaosMonkey(agent)
chaosMonkey.InjectNetworkDelay(100 * time.Millisecond)
chaosMonkey.DisconnectRandomPeer()
Next Steps
- Monitoring - Set up monitoring
- Troubleshooting - Debug connectivity
- Deployment Scenarios - More deployment patterns