Teleport Auth Server rejecting all session recordings

Teleport version: 4.4.6

My single teleport auth/proxy/node server is being killed by a torrential inflow of incorrect session recordings. The docker logs look like this:

WARN [AUTH]      Rejecting session recording from 7d2e19cd-515a-4de2-801b-6b08c111bf89: server ID 834afe58-6256-4fc0-992b-7c75500f6316 not valid. System may be under attack, a node is attempting to submit events for an identity other than its own. auth/apiserver.go:1981
WARN [AUTH]      Rejecting session recording from 7d2e19cd-515a-4de2-801b-6b08c111bf89: server ID 02d82ef1-2b94-403d-a9f9-fd7f9aaa1968 not valid. System may be under attack, a node is attempting to submit events for an identity other than its own. auth/apiserver.go:1981
WARN [AUTH]      Rejecting session recording from 7d2e19cd-515a-4de2-801b-6b08c111bf89: server ID 1be4a891-78bc-49bd-aa1e-84bfaa9dafd9 not valid. System may be under attack, a node is attempting to submit events for an identity other than its own. auth/apiserver.go:1981
WARN [AUTH]      Rejecting session recording from 7d2e19cd-515a-4de2-801b-6b08c111bf89: server ID 834afe58-6256-4fc0-992b-7c75500f6316 not valid. System may be under attack, a node is attempting to submit events for an identity other than its own. auth/apiserver.go:1981

At hundreds of logs per minute.

I’m running the teleport server as an EC2 instance behind an internal load balancer, which I’m aware isn’t a great setup. In my teleport.yaml I’ve set:

proxy_service:
  enabled: "yes"
  listen_addr: 0.0.0.0:3023
  web_listen_addr: 0.0.0.0:3080
  tunnel_listen_addr: 0.0.0.0:3024
  public_addr: <private-ip>:3080

Which I’m not too happy about (I’d much prefer to set the actual DNS name instead of the private IP), but it’s a “legacy” system and I’m not in the position to go changing that easily.

Any ideas if there is a workaround for the session recording rejection logs? To be fair, I don’t really want to keep the session recordings (or even make them), but some research hasn’t shown me a way to disable making/storing them.

some related online duckduckgo’ings:

Hi Sam

If you really want to disable session recording completely, you can set this in your /etc/teleport.yaml on the auth server and restart:

auth_service:
  session_recording: none

As to why this error is happening, I’m not entirely sure - has this always been the case, or did it just start recently? Did you upgrade or change anything?

Thanks
Gus

Hi Gus,

Thanks for your response! Teleport immediately gave me an error that I should use off instead of none for the session_recording value, fyi. However, setting it to off still results in many thousand of session recording rejection logs. Maybe the nodes need to be reinitialized/re-added to the pool to stop this from happening?

Some more background on the situation:

  • About 3 months ago we upgrade from teleport 4.2.7 (I think) to 4.3.5;
  • Right after the upgrade, I tested the webinterface by logging in, teleporting to a node, and running some basic commands (ls, cd, etc);
  • Upon logging out from the node and checking the auth-server logs, I noticed this problem.
  • We ran it for 3 months with occasional failure, which isn’t a huge deal in our case, and I could just restart the docker-compose setup easily.
  • 2 weeks ago I upgraded our auth server and all nodes to 4.4.6 (I tried 5.0.2 but that had other issues relating to DNS that I couldn’t resolve) hoping it would stop this error.
  • It didn’t.

Hey Sam,

If you’ve got the time to chat, we’d like to dig deeper over a Zoom call. Please send me an email: evan@goteleport.com

Hi,

I’ve emailed you.

After our research today I went ahead and locked down the teleport auth-server ec2 instance inbounds traffic ports using the security groups, allowing only port 22 from my own IP, and that did stop the inflow of recording uploads. Tomorrow I will add back IPs 1 by 1 to find out which server is incorrectly trying to upload, but I can already spoiler that it’s multiple servers in multiple regions. Fun times.

What are the minimum viable open ports (in security group terms) to run teleport with? I’ve now got loadbalancer 443 mapped to 3080, and 3080 open on the teleport server for the loadbalancer only, 22 for my IP, and 3022-3025 for my IP only. Do I need more IPs for 3022-3025? I think I do looking at Running Teleport on AWS

Okay, I’ve found the (approximate) problem and the workaround. Somehow some nodes in my cluster had an outdated server_id in /var/lib/teleport, I think, and the workaround solution is just to run something like this on all nodes:

rm -rf /var/lib/teleport/*
chown -r user:user /var/lib/teleport
service restart teleport

Which completely resets the teleport configuration for the node.
My guess is that the teleport auth-server rebooting may have caused this in the first place, as - and this is just my assumption - the IP of the auth-server never changed, but somehow restarting the docker container did change the auth-server server_id.