Health Checks and Monitoring ============================= Continuous monitoring of Magic Transit tunnels and origin infrastructure ensures high availability and automatic failover. Overview -------- Magic Transit uses multi-layered health checks to verify tunnel connectivity, origin availability, and service health. Failed checks trigger automatic failover and alerting. Health Check Types ------------------ Tunnel Health Checks ~~~~~~~~~~~~~~~~~~~~ **ICMP Probes**: - Sent from pfSense to Cloudflare tunnel endpoint - 1-second intervals - Monitors tunnel connectivity **Current Status** (MT_GRE_1_TUNNELV4): - **Latency**: 9.308ms average - **Jitter**: 0.24ms standard deviation - **Loss**: 0.0% - **Status**: Online **Current Status** (MT_GRE_1_TUNNELV6): - **Latency**: 9.343ms average - **Jitter**: 0.276ms - **Loss**: 0.0% - **Status**: Online Gateway Health Checks ~~~~~~~~~~~~~~~~~~~~~ pfSense monitors multiple gateways for availability: **WAN Gateway** (WAN_DHCP): - **Monitor**: 198.51.100.2 (example WAN monitor) - **Source**: 198.51.100.1 (pfSense example) - **Latency**: 1.3ms - **Jitter**: 0.364ms - **Status**: Online **Magic Transit Gateways**: - IPv4 and IPv6 tunnel endpoints - Continuous monitoring - Automatic failover on failure Origin Health Checks ~~~~~~~~~~~~~~~~~~~~ Cloudflare monitors origin infrastructure health: - **HTTP/HTTPS Checks**: Web service availability - **TCP Checks**: Port connectivity - **Custom Checks**: Application-specific health - **Frequency**: Configurable (default: 60 seconds) Monitoring Configuration ------------------------ pfSense Gateway Monitoring ~~~~~~~~~~~~~~~~~~~~~~~~~~ Gateway health checks configured in pfSense: **Parameters**: - **Probe Interval**: 1 second - **Loss Threshold**: 20% (triggers warning) - **Latency Threshold**: 500ms (triggers warning) - **Down Threshold**: 10 consecutive failures - **Alert Actions**: Email, SNMP trap, log **Monitored Metrics**: - Round-trip latency - Packet loss percentage - Standard deviation (jitter) - Gateway status (online/warning/down) Cloudflare Health Checks ~~~~~~~~~~~~~~~~~~~~~~~~~ Configured in Cloudflare Dashboard: **Tunnel Health**: - Automatic bidirectional checks - Failure detection within seconds - Automatic route withdrawal on failure **Origin Health**: - Customizable check intervals - Multiple check types (ICMP, TCP, HTTP) - Regional health monitoring - Custom headers and expected responses Failover Behavior ----------------- Automatic Failover ~~~~~~~~~~~~~~~~~~ When health checks detect failures, automatic actions occur: **Tunnel Failure**: 1. pfSense marks gateway as down 2. Traffic routes to backup gateway (if configured) 3. Cloudflare stops forwarding to failed tunnel 4. Alert notifications sent to administrators **Origin Failure**: 1. Cloudflare health checks detect failure 2. Traffic diverted to backup origin (if configured) 3. Error page served to users (if no backup) 4. Origin marked unhealthy in dashboard Manual Failover ~~~~~~~~~~~~~~~ Administrators can manually trigger failover: - Disable gateway in pfSense - Disable tunnel in Cloudflare dashboard - Adjust traffic steering policies - Force maintenance mode Alerting -------- Alert Triggers ~~~~~~~~~~~~~~ Alerts are generated for: - Tunnel health degradation - Gateway status changes - High latency or packet loss - Repeated health check failures - Complete service outages Alert Channels ~~~~~~~~~~~~~~ Notifications delivered via: - Email to administrators - SNMP traps to monitoring systems - pfSense system logs - Cloudflare dashboard notifications - PagerDuty/Slack integrations (if configured) Performance Baselines --------------------- Expected Performance ~~~~~~~~~~~~~~~~~~~~ **Normal Operation**: - **Latency**: < 15ms average - **Jitter**: < 1ms standard deviation - **Packet Loss**: 0.0% - **Availability**: 99.99%+ **Warning Thresholds**: - **Latency**: > 50ms - **Jitter**: > 5ms - **Packet Loss**: > 1% **Critical Thresholds**: - **Latency**: > 100ms - **Packet Loss**: > 5% - **Complete Outage**: > 10s Historical Performance ~~~~~~~~~~~~~~~~~~~~~~ Maintaining historical metrics enables: - Trend analysis - Capacity planning - SLA compliance verification - Incident investigation Troubleshooting --------------- High Latency ~~~~~~~~~~~~ **Possible Causes**: - WAN circuit congestion - Upstream ISP issues - Cloudflare edge routing problems - Geographic distance to edge **Investigation Steps**: 1. Check WAN utilization 2. Test latency to ISP gateway 3. Review Cloudflare status page 4. Contact Cloudflare support if persists Packet Loss ~~~~~~~~~~~ **Possible Causes**: - MTU mismatch - Network interface errors - WAN circuit problems - Firewall state table exhaustion **Investigation Steps**: 1. Check interface statistics for errors 2. Verify MTU settings (1476 for GRE) 3. Review firewall logs for dropped packets 4. Monitor state table utilization Tunnel Down ~~~~~~~~~~~ **Possible Causes**: - WAN connectivity loss - pfSense interface down - Cloudflare endpoint unreachable - Configuration error **Investigation Steps**: 1. Verify WAN interface status 2. Check pfSense system logs 3. Test connectivity to Cloudflare endpoint 4. Review recent configuration changes Best Practices -------------- **Regular Monitoring**: - Review health check metrics daily - Investigate anomalies promptly - Maintain performance baselines - Document incidents and resolutions **Proactive Maintenance**: - Test failover procedures quarterly - Update alert thresholds as needed - Review and optimize check intervals - Coordinate maintenance windows with Cloudflare **Documentation**: - Document all health check configurations - Maintain runbooks for common issues - Record baseline performance metrics - Update procedures after incidents Integration with Monitoring Tools ---------------------------------- **SNMP Integration**: - Export pfSense metrics to monitoring system - Graph latency and packet loss trends - Correlate with other infrastructure metrics **Cloudflare API**: - Programmatic access to health check data - Automated alerting and remediation - Integration with SIEM and ticketing systems **Custom Scripts**: - Automated health check validation - Performance reporting - Capacity planning analysis