Problem:
Customer had an issue with pods being stuck in the ContainerCreating state in their 4-node cluster. 2 of the nodes with the issues were cordoned.
Analysis:
After investigation and looking through the logs, we suspected that some of the system’s resources were being exhausted so we tweaked the fs.inotify.max_user_watches and user.max_user_namespaces kernel parameters. Also, one of the nodes was having issues launching containers and docker/planet could not be restarted (froze) so we had to reboot the host after which the node came back online.
Solution:
Here’s the gist of changes that were eventually made to the nodes:
1. Set fs.inotify.max_user_watches = 1048576 and user.max_user_namespaces = 15000 kernel parameters.
2. Updated /etc/sysctl.conf with the same parameters to make the changes persist across reboots.
3. Ran sysctl -p to reload the changes.
4. Restarted kubelet on the nodes.
5. Uncordoned the nodes.
Recommendation:
If in your application you are launching a couple of pods per “project”, we recommend adding these kernel settings to your application requirements to prevent resource exhaustion when many containers are running.
Prepared by: @r0mant