Issues with netfilter module

For the past few months, across multiple versions (6.1.12 and 6.1.39), we’re seeing issues with the netfilter module.

Specifically, the sequence of events is something like this:

Sequence 1:

  1. Bring up a new EC2 machine (vanilla Amazon Linux 2 AMI)
  2. Install Gravity, everything works fine, it auto-enables the module.
  3. Reboot the cluster.
  4. Cluster fails to come up, as netfilter is disabled, and need to run sysctl net.bridge.bridge-nf-call-iptables.

Sequence 2:

  1. Bring up a new EC2 machine (vanilla Amazon Linux 2 AMI)
  2. Install Gravity, everything works fine, it auto-enables the module.
  3. Uninstall Gravity, module is still enabled post-install.
  4. Re-install Gravity on the same node (same installer), module gets disabled in the middle of the install (it really is the middle, it takes a minute or two until it shows up as disabled).
  5. Install is stuck until someone enables it or eventually the install times out.

In general, it seems there is some problem in setting/retaining this value properly, especially on reboots/reinstalls.

Additionally, this instruction in the docs is wrong:

echo net.bridge.bridge-nf-call-iptables=1 >> /etc/sysctl.d/10-bridge-nf-call-iptables.conf

This directory doesn’t exist anymore on Amazon Linux 2, I believe it’s /etc/system.d or something of the sort.

For scenario 1, my guess is wherever we’re configuring the modules on boot, is not what amazon linux pays attention to by default. We’ll need to look into that.

For scenario 2, This is sort of a known issue / design issue in the module code. The module code in gravity is clever, in that it detects the loaded modules, and then only implements the minimum configuration to load modules that aren’t already loaded. This creates a problem in the install -> uninstall -> install scenario, as the second install will see required modules as already loaded, and assume the host was configured that way and not write configuration to load any modules.

We ran into a similar issue with sysctls, and went through and made sure all sysctls are always set by the installer. But haven’t done the same pass for kernel modules.

@knisbet thanks for the quick response.

For scenario 1, I feel that we’ve seen it on non-Amazon Linux 2 machines (we’re working with one prospect that is on-prem) as well.

For scenario 2, I guess to be clear, the thing that is needed to “resolve” it when it occurs is to run sysctl net.bridge.bridge-nf-call-iptables=1, not to reload the module.

It’s also odd that the value here is 1 at the beginning of the install and then somewhere in the middle something sets it to 0.