Health Checks and Monitoring

Continuous monitoring of Magic Transit tunnels and origin infrastructure ensures high availability and automatic failover.

Overview

Magic Transit uses multi-layered health checks to verify tunnel connectivity, origin availability, and service health. Failed checks trigger automatic failover and alerting.

Health Check Types

Tunnel Health Checks

ICMP Probes:

  • Sent from pfSense to Cloudflare tunnel endpoint

  • 1-second intervals

  • Monitors tunnel connectivity

Current Status (MT_GRE_1_TUNNELV4):

  • Latency: 9.308ms average

  • Jitter: 0.24ms standard deviation

  • Loss: 0.0%

  • Status: Online

Current Status (MT_GRE_1_TUNNELV6):

  • Latency: 9.343ms average

  • Jitter: 0.276ms

  • Loss: 0.0%

  • Status: Online

Gateway Health Checks

pfSense monitors multiple gateways for availability:

WAN Gateway (WAN_DHCP):

  • Monitor: 198.51.100.2 (example WAN monitor)

  • Source: 198.51.100.1 (pfSense example)

  • Latency: 1.3ms

  • Jitter: 0.364ms

  • Status: Online

Magic Transit Gateways:

  • IPv4 and IPv6 tunnel endpoints

  • Continuous monitoring

  • Automatic failover on failure

Origin Health Checks

Cloudflare monitors origin infrastructure health:

  • HTTP/HTTPS Checks: Web service availability

  • TCP Checks: Port connectivity

  • Custom Checks: Application-specific health

  • Frequency: Configurable (default: 60 seconds)

Monitoring Configuration

pfSense Gateway Monitoring

Gateway health checks configured in pfSense:

Parameters:

  • Probe Interval: 1 second

  • Loss Threshold: 20% (triggers warning)

  • Latency Threshold: 500ms (triggers warning)

  • Down Threshold: 10 consecutive failures

  • Alert Actions: Email, SNMP trap, log

Monitored Metrics:

  • Round-trip latency

  • Packet loss percentage

  • Standard deviation (jitter)

  • Gateway status (online/warning/down)

Cloudflare Health Checks

Configured in Cloudflare Dashboard:

Tunnel Health:

  • Automatic bidirectional checks

  • Failure detection within seconds

  • Automatic route withdrawal on failure

Origin Health:

  • Customizable check intervals

  • Multiple check types (ICMP, TCP, HTTP)

  • Regional health monitoring

  • Custom headers and expected responses

Failover Behavior

Automatic Failover

When health checks detect failures, automatic actions occur:

Tunnel Failure:

  1. pfSense marks gateway as down

  2. Traffic routes to backup gateway (if configured)

  3. Cloudflare stops forwarding to failed tunnel

  4. Alert notifications sent to administrators

Origin Failure:

  1. Cloudflare health checks detect failure

  2. Traffic diverted to backup origin (if configured)

  3. Error page served to users (if no backup)

  4. Origin marked unhealthy in dashboard

Manual Failover

Administrators can manually trigger failover:

  • Disable gateway in pfSense

  • Disable tunnel in Cloudflare dashboard

  • Adjust traffic steering policies

  • Force maintenance mode

Alerting

Alert Triggers

Alerts are generated for:

  • Tunnel health degradation

  • Gateway status changes

  • High latency or packet loss

  • Repeated health check failures

  • Complete service outages

Alert Channels

Notifications delivered via:

  • Email to administrators

  • SNMP traps to monitoring systems

  • pfSense system logs

  • Cloudflare dashboard notifications

  • PagerDuty/Slack integrations (if configured)

Performance Baselines

Expected Performance

Normal Operation:

  • Latency: < 15ms average

  • Jitter: < 1ms standard deviation

  • Packet Loss: 0.0%

  • Availability: 99.99%+

Warning Thresholds:

  • Latency: > 50ms

  • Jitter: > 5ms

  • Packet Loss: > 1%

Critical Thresholds:

  • Latency: > 100ms

  • Packet Loss: > 5%

  • Complete Outage: > 10s

Historical Performance

Maintaining historical metrics enables:

  • Trend analysis

  • Capacity planning

  • SLA compliance verification

  • Incident investigation

Troubleshooting

High Latency

Possible Causes:

  • WAN circuit congestion

  • Upstream ISP issues

  • Cloudflare edge routing problems

  • Geographic distance to edge

Investigation Steps:

  1. Check WAN utilization

  2. Test latency to ISP gateway

  3. Review Cloudflare status page

  4. Contact Cloudflare support if persists

Packet Loss

Possible Causes:

  • MTU mismatch

  • Network interface errors

  • WAN circuit problems

  • Firewall state table exhaustion

Investigation Steps:

  1. Check interface statistics for errors

  2. Verify MTU settings (1476 for GRE)

  3. Review firewall logs for dropped packets

  4. Monitor state table utilization

Tunnel Down

Possible Causes:

  • WAN connectivity loss

  • pfSense interface down

  • Cloudflare endpoint unreachable

  • Configuration error

Investigation Steps:

  1. Verify WAN interface status

  2. Check pfSense system logs

  3. Test connectivity to Cloudflare endpoint

  4. Review recent configuration changes

Best Practices

Regular Monitoring:

  • Review health check metrics daily

  • Investigate anomalies promptly

  • Maintain performance baselines

  • Document incidents and resolutions

Proactive Maintenance:

  • Test failover procedures quarterly

  • Update alert thresholds as needed

  • Review and optimize check intervals

  • Coordinate maintenance windows with Cloudflare

Documentation:

  • Document all health check configurations

  • Maintain runbooks for common issues

  • Record baseline performance metrics

  • Update procedures after incidents

Integration with Monitoring Tools

SNMP Integration:

  • Export pfSense metrics to monitoring system

  • Graph latency and packet loss trends

  • Correlate with other infrastructure metrics

Cloudflare API:

  • Programmatic access to health check data

  • Automated alerting and remediation

  • Integration with SIEM and ticketing systems

Custom Scripts:

  • Automated health check validation

  • Performance reporting

  • Capacity planning analysis