Health Checks and Monitoring
=============================

Continuous monitoring of Magic Transit tunnels and origin infrastructure ensures high availability and automatic failover.

Overview
--------

Magic Transit uses multi-layered health checks to verify tunnel connectivity, origin availability, and service health. Failed checks trigger automatic failover and alerting.

Health Check Types
------------------

Tunnel Health Checks
~~~~~~~~~~~~~~~~~~~~

**ICMP Probes**:

- Sent from pfSense to Cloudflare tunnel endpoint
- 1-second intervals
- Monitors tunnel connectivity

**Current Status** (MT_GRE_1_TUNNELV4):

- **Latency**: 9.308ms average
- **Jitter**: 0.24ms standard deviation
- **Loss**: 0.0%
- **Status**: Online

**Current Status** (MT_GRE_1_TUNNELV6):

- **Latency**: 9.343ms average
- **Jitter**: 0.276ms
- **Loss**: 0.0%
- **Status**: Online

Gateway Health Checks
~~~~~~~~~~~~~~~~~~~~~

pfSense monitors multiple gateways for availability:

**WAN Gateway** (WAN_DHCP):

- **Monitor**: 198.51.100.2 (example WAN monitor)
- **Source**: 198.51.100.1 (pfSense example)
- **Latency**: 1.3ms
- **Jitter**: 0.364ms
- **Status**: Online

**Magic Transit Gateways**:

- IPv4 and IPv6 tunnel endpoints
- Continuous monitoring
- Automatic failover on failure

Origin Health Checks
~~~~~~~~~~~~~~~~~~~~

Cloudflare monitors origin infrastructure health:

- **HTTP/HTTPS Checks**: Web service availability
- **TCP Checks**: Port connectivity
- **Custom Checks**: Application-specific health
- **Frequency**: Configurable (default: 60 seconds)

Monitoring Configuration
------------------------

pfSense Gateway Monitoring
~~~~~~~~~~~~~~~~~~~~~~~~~~

Gateway health checks configured in pfSense:

**Parameters**:

- **Probe Interval**: 1 second
- **Loss Threshold**: 20% (triggers warning)
- **Latency Threshold**: 500ms (triggers warning)
- **Down Threshold**: 10 consecutive failures
- **Alert Actions**: Email, SNMP trap, log

**Monitored Metrics**:

- Round-trip latency
- Packet loss percentage
- Standard deviation (jitter)
- Gateway status (online/warning/down)

Cloudflare Health Checks
~~~~~~~~~~~~~~~~~~~~~~~~~

Configured in Cloudflare Dashboard:

**Tunnel Health**:

- Automatic bidirectional checks
- Failure detection within seconds
- Automatic route withdrawal on failure

**Origin Health**:

- Customizable check intervals
- Multiple check types (ICMP, TCP, HTTP)
- Regional health monitoring
- Custom headers and expected responses

Failover Behavior
-----------------

Automatic Failover
~~~~~~~~~~~~~~~~~~

When health checks detect failures, automatic actions occur:

**Tunnel Failure**:

1. pfSense marks gateway as down
2. Traffic routes to backup gateway (if configured)
3. Cloudflare stops forwarding to failed tunnel
4. Alert notifications sent to administrators

**Origin Failure**:

1. Cloudflare health checks detect failure
2. Traffic diverted to backup origin (if configured)
3. Error page served to users (if no backup)
4. Origin marked unhealthy in dashboard

Manual Failover
~~~~~~~~~~~~~~~

Administrators can manually trigger failover:

- Disable gateway in pfSense
- Disable tunnel in Cloudflare dashboard
- Adjust traffic steering policies
- Force maintenance mode

Alerting
--------

Alert Triggers
~~~~~~~~~~~~~~

Alerts are generated for:

- Tunnel health degradation
- Gateway status changes
- High latency or packet loss
- Repeated health check failures
- Complete service outages

Alert Channels
~~~~~~~~~~~~~~

Notifications delivered via:

- Email to administrators
- SNMP traps to monitoring systems
- pfSense system logs
- Cloudflare dashboard notifications
- PagerDuty/Slack integrations (if configured)

Performance Baselines
---------------------

Expected Performance
~~~~~~~~~~~~~~~~~~~~

**Normal Operation**:

- **Latency**: < 15ms average
- **Jitter**: < 1ms standard deviation
- **Packet Loss**: 0.0%
- **Availability**: 99.99%+

**Warning Thresholds**:

- **Latency**: > 50ms
- **Jitter**: > 5ms
- **Packet Loss**: > 1%

**Critical Thresholds**:

- **Latency**: > 100ms
- **Packet Loss**: > 5%
- **Complete Outage**: > 10s

Historical Performance
~~~~~~~~~~~~~~~~~~~~~~

Maintaining historical metrics enables:

- Trend analysis
- Capacity planning
- SLA compliance verification
- Incident investigation

Troubleshooting
---------------

High Latency
~~~~~~~~~~~~

**Possible Causes**:

- WAN circuit congestion
- Upstream ISP issues
- Cloudflare edge routing problems
- Geographic distance to edge

**Investigation Steps**:

1. Check WAN utilization
2. Test latency to ISP gateway
3. Review Cloudflare status page
4. Contact Cloudflare support if persists

Packet Loss
~~~~~~~~~~~

**Possible Causes**:

- MTU mismatch
- Network interface errors
- WAN circuit problems
- Firewall state table exhaustion

**Investigation Steps**:

1. Check interface statistics for errors
2. Verify MTU settings (1476 for GRE)
3. Review firewall logs for dropped packets
4. Monitor state table utilization

Tunnel Down
~~~~~~~~~~~

**Possible Causes**:

- WAN connectivity loss
- pfSense interface down
- Cloudflare endpoint unreachable
- Configuration error

**Investigation Steps**:

1. Verify WAN interface status
2. Check pfSense system logs
3. Test connectivity to Cloudflare endpoint
4. Review recent configuration changes

Best Practices
--------------

**Regular Monitoring**:

- Review health check metrics daily
- Investigate anomalies promptly
- Maintain performance baselines
- Document incidents and resolutions

**Proactive Maintenance**:

- Test failover procedures quarterly
- Update alert thresholds as needed
- Review and optimize check intervals
- Coordinate maintenance windows with Cloudflare

**Documentation**:

- Document all health check configurations
- Maintain runbooks for common issues
- Record baseline performance metrics
- Update procedures after incidents

Integration with Monitoring Tools
----------------------------------

**SNMP Integration**:

- Export pfSense metrics to monitoring system
- Graph latency and packet loss trends
- Correlate with other infrastructure metrics

**Cloudflare API**:

- Programmatic access to health check data
- Automated alerting and remediation
- Integration with SIEM and ticketing systems

**Custom Scripts**:

- Automated health check validation
- Performance reporting
- Capacity planning analysis