High Availability

Design patterns for fault-tolerant Muti Metroo deployments.

Overview

High availability in Muti Metroo is achieved through:

Redundant paths: Multiple routes to destinations
Automatic failover: Route selection based on availability
Reconnection: Automatic peer reconnection with backoff
Load distribution: Multiple exit points

Pattern 1: Redundant Transit

Multiple transit paths between ingress and exit.

Architecture

Configuration

Ingress Agent:

agent:
  display_name: "Ingress"

peers:
  # Primary transit
  - id: "${PRIMARY_TRANSIT_ID}"
    transport: quic
    address: "transit-primary.example.com:4433"
    tls:
      ca: "./certs/ca.crt"

  # Backup transit
  - id: "${BACKUP_TRANSIT_ID}"
    transport: quic
    address: "transit-backup.example.com:4433"
    tls:
      ca: "./certs/ca.crt"

socks5:
  enabled: true
  address: "127.0.0.1:1080"

Both transits connect to the exit, which advertises routes. The ingress learns routes via both paths and prefers the one with lower hop count.

Failover Behavior

Primary transit fails
Routes via primary expire (TTL, default 5m)
Only backup route remains
Traffic flows through backup
Primary recovers - routes re-advertise
Traffic returns to primary (lower metric)

Faster Failover

Reduce route TTL for faster detection:

routing:
  route_ttl: 1m              # 1 minute TTL
  advertise_interval: 30s    # Advertise every 30s

Pattern 2: Multiple Exits

Multiple exit points for the same routes.

Architecture

Configuration

Exit A:

agent:
  display_name: "Exit A (DC East)"

exit:
  enabled: true
  routes:
    - "10.0.0.0/8"
  dns:
    servers:
      - "10.0.0.1:53"

Exit B:

agent:
  display_name: "Exit B (DC West)"

exit:
  enabled: true
  routes:
    - "10.0.0.0/8"
  dns:
    servers:
      - "10.0.0.1:53"

The ingress learns the same route from both exits and uses the one with lower metric.

Pattern 3: Geographic Redundancy

Agents in multiple regions for disaster recovery.

Architecture

Cross-Region Peering

# Agent A1 (Region A)
agent:
  display_name: "Region A Primary"

peers:
  # Local exit
  - id: "${EXIT_A_ID}"
    transport: quic
    address: "exit-a.region-a.internal:4433"

  # Cross-region peer
  - id: "${AGENT_B1_ID}"
    transport: quic
    address: "agent-b1.region-b.example.com:4433"

Pattern 4: Active-Active Ingress

Multiple ingress points behind a load balancer.

Architecture

DNS Round-Robin

proxy.example.com.  300  IN  A  192.168.1.10  # Ingress 1
proxy.example.com.  300  IN  A  192.168.1.11  # Ingress 2

L4 Load Balancer

HAProxy example:

frontend socks5
    bind *:1080
    mode tcp
    default_backend socks5_backends

backend socks5_backends
    mode tcp
    balance roundrobin
    server ingress1 192.168.1.10:1080 check
    server ingress2 192.168.1.11:1080 check

Monitoring for HA

Health Checks

# Check each agent
curl http://ingress1:8080/health
curl http://ingress2:8080/health
curl http://exit:8080/health

# Check peer connectivity
curl http://ingress1:8080/healthz | jq '.peers'

Key Metrics

Monitor these for HA:

Metric	Alert Condition
`peers_connected`	< expected count
`routes_total`	< expected count
`peer_disconnects_total`	Spike in rate
`stream_errors_total`	Spike in rate

Prometheus Alert Rules

groups:
  - name: muti-metroo
    rules:
      - alert: PeerDisconnected
        expr: muti_metroo_peers_connected < 2
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Muti Metroo has fewer than 2 peers connected"

      - alert: NoRoutes
        expr: muti_metroo_routes_total == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Muti Metroo has no routes in routing table"

Reconnection Tuning

Configure aggressive reconnection for faster recovery:

connections:
  reconnect:
    initial_delay: 500ms      # Start fast
    max_delay: 30s            # Cap at 30s
    multiplier: 1.5           # Slower backoff
    jitter: 0.3               # 30% jitter
    max_retries: 0            # Never give up

Best Practices

Minimum two paths: Always have redundant routes
Geographic diversity: Spread agents across regions
Independent failure domains: Different networks, power, etc.
Monitor everything: Alerts before users notice
Test failover: Regularly test by killing components
Document topology: Know what depends on what

Testing Failover

Manual Testing

# Kill a transit agent
docker stop transit-primary

# Verify traffic still flows
curl -x socks5://localhost:1080 https://example.com

# Check routes updated
curl http://localhost:8080/healthz | jq '.routes'

# Bring transit back
docker start transit-primary

# Verify primary route restored
sleep 120  # Wait for advertisement
curl http://localhost:8080/healthz | jq '.routes'

Chaos Testing

Use the built-in chaos package for automated testing:

// internal/chaos provides fault injection
chaosMonkey := chaos.NewChaosMonkey(agent)
chaosMonkey.InjectNetworkDelay(100 * time.Millisecond)
chaosMonkey.DisconnectRandomPeer()

Next Steps

Monitoring - Set up monitoring
Troubleshooting - Debug connectivity
Deployment Scenarios - More deployment patterns

Overview​

Pattern 1: Redundant Transit​

Architecture​

Configuration​

Failover Behavior​

Faster Failover​

Pattern 2: Multiple Exits​

Architecture​

Configuration​

Pattern 3: Geographic Redundancy​

Architecture​

Cross-Region Peering​

Pattern 4: Active-Active Ingress​

Architecture​

DNS Round-Robin​

L4 Load Balancer​

Monitoring for HA​

Health Checks​

Key Metrics​

Prometheus Alert Rules​

Reconnection Tuning​

Best Practices​

Testing Failover​

Manual Testing​

Chaos Testing​

Next Steps​

Overview

Pattern 1: Redundant Transit

Architecture

Configuration

Failover Behavior

Faster Failover

Pattern 2: Multiple Exits

Architecture

Configuration

Pattern 3: Geographic Redundancy

Architecture

Cross-Region Peering

Pattern 4: Active-Active Ingress

Architecture

DNS Round-Robin

L4 Load Balancer

Monitoring for HA

Health Checks

Key Metrics

Prometheus Alert Rules

Reconnection Tuning

Best Practices

Testing Failover

Manual Testing

Chaos Testing

Next Steps