Health Checks and Monitoring
Continuous monitoring of Magic Transit tunnels and origin infrastructure ensures high availability and automatic failover.
Overview
Magic Transit uses multi-layered health checks to verify tunnel connectivity, origin availability, and service health. Failed checks trigger automatic failover and alerting.
Health Check Types
Tunnel Health Checks
ICMP Probes:
Sent from pfSense to Cloudflare tunnel endpoint
1-second intervals
Monitors tunnel connectivity
Current Status (MT_GRE_1_TUNNELV4):
Latency: 9.308ms average
Jitter: 0.24ms standard deviation
Loss: 0.0%
Status: Online
Current Status (MT_GRE_1_TUNNELV6):
Latency: 9.343ms average
Jitter: 0.276ms
Loss: 0.0%
Status: Online
Gateway Health Checks
pfSense monitors multiple gateways for availability:
WAN Gateway (WAN_DHCP):
Monitor: 198.51.100.2 (example WAN monitor)
Source: 198.51.100.1 (pfSense example)
Latency: 1.3ms
Jitter: 0.364ms
Status: Online
Magic Transit Gateways:
IPv4 and IPv6 tunnel endpoints
Continuous monitoring
Automatic failover on failure
Origin Health Checks
Cloudflare monitors origin infrastructure health:
HTTP/HTTPS Checks: Web service availability
TCP Checks: Port connectivity
Custom Checks: Application-specific health
Frequency: Configurable (default: 60 seconds)
Monitoring Configuration
pfSense Gateway Monitoring
Gateway health checks configured in pfSense:
Parameters:
Probe Interval: 1 second
Loss Threshold: 20% (triggers warning)
Latency Threshold: 500ms (triggers warning)
Down Threshold: 10 consecutive failures
Alert Actions: Email, SNMP trap, log
Monitored Metrics:
Round-trip latency
Packet loss percentage
Standard deviation (jitter)
Gateway status (online/warning/down)
Cloudflare Health Checks
Configured in Cloudflare Dashboard:
Tunnel Health:
Automatic bidirectional checks
Failure detection within seconds
Automatic route withdrawal on failure
Origin Health:
Customizable check intervals
Multiple check types (ICMP, TCP, HTTP)
Regional health monitoring
Custom headers and expected responses
Failover Behavior
Automatic Failover
When health checks detect failures, automatic actions occur:
Tunnel Failure:
pfSense marks gateway as down
Traffic routes to backup gateway (if configured)
Cloudflare stops forwarding to failed tunnel
Alert notifications sent to administrators
Origin Failure:
Cloudflare health checks detect failure
Traffic diverted to backup origin (if configured)
Error page served to users (if no backup)
Origin marked unhealthy in dashboard
Manual Failover
Administrators can manually trigger failover:
Disable gateway in pfSense
Disable tunnel in Cloudflare dashboard
Adjust traffic steering policies
Force maintenance mode
Alerting
Alert Triggers
Alerts are generated for:
Tunnel health degradation
Gateway status changes
High latency or packet loss
Repeated health check failures
Complete service outages
Alert Channels
Notifications delivered via:
Email to administrators
SNMP traps to monitoring systems
pfSense system logs
Cloudflare dashboard notifications
PagerDuty/Slack integrations (if configured)
Performance Baselines
Expected Performance
Normal Operation:
Latency: < 15ms average
Jitter: < 1ms standard deviation
Packet Loss: 0.0%
Availability: 99.99%+
Warning Thresholds:
Latency: > 50ms
Jitter: > 5ms
Packet Loss: > 1%
Critical Thresholds:
Latency: > 100ms
Packet Loss: > 5%
Complete Outage: > 10s
Historical Performance
Maintaining historical metrics enables:
Trend analysis
Capacity planning
SLA compliance verification
Incident investigation
Troubleshooting
High Latency
Possible Causes:
WAN circuit congestion
Upstream ISP issues
Cloudflare edge routing problems
Geographic distance to edge
Investigation Steps:
Check WAN utilization
Test latency to ISP gateway
Review Cloudflare status page
Contact Cloudflare support if persists
Packet Loss
Possible Causes:
MTU mismatch
Network interface errors
WAN circuit problems
Firewall state table exhaustion
Investigation Steps:
Check interface statistics for errors
Verify MTU settings (1476 for GRE)
Review firewall logs for dropped packets
Monitor state table utilization
Tunnel Down
Possible Causes:
WAN connectivity loss
pfSense interface down
Cloudflare endpoint unreachable
Configuration error
Investigation Steps:
Verify WAN interface status
Check pfSense system logs
Test connectivity to Cloudflare endpoint
Review recent configuration changes
Best Practices
Regular Monitoring:
Review health check metrics daily
Investigate anomalies promptly
Maintain performance baselines
Document incidents and resolutions
Proactive Maintenance:
Test failover procedures quarterly
Update alert thresholds as needed
Review and optimize check intervals
Coordinate maintenance windows with Cloudflare
Documentation:
Document all health check configurations
Maintain runbooks for common issues
Record baseline performance metrics
Update procedures after incidents
Integration with Monitoring Tools
SNMP Integration:
Export pfSense metrics to monitoring system
Graph latency and packet loss trends
Correlate with other infrastructure metrics
Cloudflare API:
Programmatic access to health check data
Automated alerting and remediation
Integration with SIEM and ticketing systems
Custom Scripts:
Automated health check validation
Performance reporting
Capacity planning analysis