Error in installation

We have an instance where Gravity fails to install, specifically at the dns-app stage. It doesn’t seem like it’s related to dns-app per se, but rather that it is the first Kubernetes thing that is installed.

Gravity: 6.1.39
Environment: AWS (though it’s installed as generic)
AMI: RHEL and then Amazon Linux 2 (failed in both)

Originally, the base AMI that was being used was RHEL7, and there were several things failing (e.g. flannel didn’t come up properly), but we traced it back to etcdctl segfaulting immediately on startup (you couldn’t even get it to print the help string). We chalked this up to RHEL being on an old kernel and got the environment to switch to Amazon Linux 2.

With Amazon Linux 2 (specifically, Linux 2ENV-GC-en41105 4.14.193-149.317.amzn2.x86_64 #1 SMP Thu Sep 3 19:04:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux), we’re seeing a similar issue, where runc is segfaulting. Some useful output from the journalctl output:

createPodSandbox for pod "dns-app-install-5f0068-5z92k_kube-system(57b0922d-3117-4027-bd87-199c40cf66d2)" failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "dns-app-install-5f0068-5z92k": Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/7a64b6f88d6a06acda1a2a870bfb41df22576ffd4dfedb97eece77eed972dd1c/log.json: no such file or directory): runc did not terminate sucessfully: unknown

And then the log has a bunch of these:

_CMDLINE=/usr/bin/dockerd --bridge=none --iptables=false --ip-masq=false --graph=/ext/docker --storage-driver=overlay2 --exec-opt native.cgroupdriver=cgroupfs --log-opt max-size=50m --log-opt max-file=9 --storage-opt=overlay2.override_kernel_check=1
MESSAGE=time="2020-10-02T17:51:47.028519964Z" level=warning msg="failed to retrieve runc version: signal: segmentation fault"

We have no idea why it might be segfaulting - the same binaries work fine on other Amazon Linux 2 stock machines.

The only thing we can think of is that this is a hardened/modified AMI that is being used in this environment, and it’s possible SELinux is enabled (journalctl seems to say it’s in permissive mode, and we need to check what the runtime state is).

Any ideas? Is anything here expected? How would we debug this?

I would try to see if runc binary fails immediately to execute which would provide a way to debug this. If that’s the case (e.g. /usr/bin/docker-runc segfaults to start), you could capture a crash dump by running this inside gdb.
We don’t have gdb inside the container, but it should be possible to test this outside the container by using the path to rootfs:

gdb /var/lib/gravity/local/packages/unpacked/<planet-version>/rootfs/usr/bin/docker-runc
$ run
$ gcore file.dmp

which you can then share.

@Dmitri_Shelenin thanks for the reply - sorry for the delay in getting back to you, it took us a little while to get another working session to get it.

The good news is - we figured it out. Conclusion at the end :slight_smile:

In regards to GDB - this unfortunately was not helpful. Specifically, the executable died immediately, before it even executed. We tried running it with strace, and we could see it was reporting an error from execve, specifically saying that there is an Exec format error.

However, we knew the binaries were good, and checked their sha1sum against another install we have locally (in our environment, not our customer environment) on the same kernel, and everything matched.

After a bunch of headscratching and some dead ends, we ended up diffing the loaded kernel modules between the machine that was working (in our environment) and this broken environment.

Short story - they had a kernel module from the Cylance anti-virus something or other that was rejecting binaries at the kernel level (I guess when doing execve) with the unuseful error that there was a format error, and thus the binary got rejected. Once we knew this was the problem (we confirmed it by disabling this kernel module and everything immediately worked), we checked the journalctl on the host for the Cylance service and could see that it was rejecting our binary.

One thing that would have helped is if in the gravity debug log tarball it included journalctl from the host as well and not just from the planet container. It’s possible it’s there and I’m just blind, but I couldn’t find it.

Thank you again for the help!

the journal logs from the host should also be part of the report tarball. The file might be named differently depending on the version and contain a different time frame using defaults. The latest changes (ca. 7.0.17) renamed the files to gravity-journal<-export>.log.gz and include text dump and journal export formatted dump.
Prior to 7.0.17, there was a single file called gravity-journal.log.gz and was a capture of all host logs and prior to 7.0.12 it was called gravity-system.log.gz.

Good to hear that you sorted this out.

I’ll chalk it up to me missing the logs in the package :slight_smile: