Degraded Gravity cluster for no reason

We have several Gravity clusters (test clusters, etc) that we deploy, and I would say invariably they all go into a degraded state for no discernible reason, and don’t seem to correct themselves.

In this case, here is gravity status for a 6 node cluster:

$ sudo gravity status
Cluster name:		bravemestorf2902
Cluster status:		degraded (one or more of cluster nodes are not healthy)
Application:		...
Gravity version:	6.1.39 (client) / 6.1.39 (server)
Join token:		...
Last completed operation:
    * Remove node ip-10-1-10-74.us-west-2.compute.internal (10.1.10.74)
      ID:		9affc744-d94f-46b1-a85a-a2853084a07d
      Started:		Thu Oct 15 23:14 UTC (1 hour ago)
      Completed:	Thu Oct 15 23:14 UTC (1 hour ago)
Cluster endpoints:
    * Authentication gateway:
        - 10.1.10.13:32009
    * Cluster management URL:
        - https://10.1.10.13:32009
Cluster nodes:
    Masters:
        * ip-10-1-10-13.us-west-2.compute.internal / 10.1.10.13 / master
            Status:		healthy
            Remote access:	online
    Nodes:
        * ip-10-1-10-66.us-west-2.compute.internal / 10.1.10.66 / worker
            Status:		healthy
            Remote access:	online
        * ip-10-1-10-64.us-west-2.compute.internal / 10.1.10.64 / worker
            Status:		healthy
            Remote access:	online
        * ip-10-1-10-152.us-west-2.compute.internal / 10.1.10.152 / worker
            Status:		healthy
            Remote access:	online
        * ip-10-1-10-199.us-west-2.compute.internal / 10.1.10.199 / worker
            Status:		healthy
            Remote access:	online
        * ip-10-1-10-68.us-west-2.compute.internal / 10.1.10.68 / worker
            Status:		healthy
            Remote access:	online

I can’t tell from the logs or audit log why it went into degraded state, and all the nodes report as healthy. I end up having to do a status-reset in order to restore it so I can grow the cluster.

Is there something I should be looking at to figure out what’s going on? This has happening with different versions, including 6.1.12.

Hi @itay, a couple of items you can check are the following typically:

  • gravity-site pod logs

kubectl -nkube-system logs -f $(kubectl -nkube-system get po -l app=gravity-site -o=jsonpath='{.items[?(@.status.containerStatuses[*].ready==true)].metadata.name}')

  • Output of gravity exec serf members

What we’ve seen before is lingering serf members can cause degraded status (#943). This was partially resolved in 6.1.19.

In 6.1.42 we added a serf cluster reconciler to address this issue further as some other customers had reported still facing the same problem.

Hope this helps!

Thanks

-Abdu

To add to Abdu’s suggestions, additional reason for the cluster degrading might be the cluster test that the active gravity-site Pod is running. One of the reasons not visible in the output of gravity status could be the failing cluster status hook (if you have one defined).
To clear these things up, pull and the share the log of the gravity-site pod using the command that Abdu recommended (but not in follow mode).
Also verify the state of the serf cluster (serf members) against the actual set of cluster servers as it might be an issue as well.

Thanks everyone - I’ve upgraded our base packaging to 6.1.44 and will monitor.