Intermittent 502s and resets when connecting to leaf clusters

Issue:

users experience sporadic and intermittent 502s when attempting to connect to leaf cluster through the root cluster.

Log(s):

WARN [PROXY:AGE] Unable to continue processing requests: heartbeat: connection closed. leaseID:4195 target:staging-proxy-internal-25hg08sgxe867gs6.elb.us-west-2.amazonaws.com:3024 reversetunnel/agent.go:358

Analysis:

The above log messages indicate that the Teleport leaf proxy is attempting to communicate with the Teleport root cluster by sending a heartbeat over the reverse tunnel, but unable to do so.

Users should review the /etc/teleport.yaml configuration parameters on both root and leaf clusters for the following settings:

keep_alive_interval:
keep_alive_count_max:

Keep_alive_interval settings determine the rate at which Teleport will send keep-alive messages between services. The default is set to 5 minutes (300 seconds) to stay lower than the common load balancer timeout of 350 seconds. Keep_alive_count_max is the number of missed keep-alive messages allowed before the server tears down the connection to the client.

Solution(s):

All root and leaf clusters should be set with same keep_alive interval settings to ensure that there are no timeouts or intermittent blips in communication caused by services falling out of sync.

For cluster services deployed behind an AWS NLBs users should turn off cross-zone load balancing if enabled and setting keep_alive intervals to same settings fails to resolve the issue by itself.