While not explicitly listed as a supported OS, we were hopeful that an install on Oracle Linux (based on RHEL) would work out. This has turned out not to be the case, and I just wanted to see if this was known and if there are any workarounds. We’ve actually seen two failure scenarios: one with the Enterprise support version and one using an AMI provided by Oracle on AWS.
Bare metal OEL 7.6
For the first, with OEL: installation gets to Wait for kubernetes to become available
but fails repeatedly with:
2019-06-04T09:30:15-04:00 DEBU dialing leader.telekube.local:6443 install/hook.go:56
2019-06-04T09:30:15-04:00 WARN "Failed to dial with local resolver: \nERROR REPORT:\nOriginal Error: *net.OpError read udp 127.0.0.1:16976->127.0.0.2:53: read: connection refused\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/dns.go:45 github.com/gravitational/gravity/lib/utils.ResolveAddr\n\t/gopath/src/github.com/gravitational/gravity/lib/httplib/client.go:248 github.com/gravitational/gravity/lib/httplib.DialWithLocalResolver\n\t/gopath/src/github.com/gravitational/gravity/lib/httplib/client.go:220 github.com/gravitational/gravity/lib/httplib.DialFromEnviron.func1\n\t/go/src/net/http/transport.go:916 net/http.(*Transport).dial\n\t/go/src/net/http/transport.go:1240 net/http.(*Transport).dialConn\n\t/go/src/net/http/transport.go:999 net/http.(*Transport).getConn.func4\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to resolve leader.telekube.local:6443\n." install/hook.go:56
Until eventually failing the phase:
2019-06-04T09:35:14-04:00 DEBU [FSM:INSTA] Applied StateChange(Phase=/wait, State=failed, Error=context deadline exceeded). opid:360acde5-7d55-4f65-85e4-38bb8125e36c install/hook.go:56
2019-06-04T09:35:14-04:00 ERRO [INSTALLER] "Failed to execute plan: \nERROR REPORT:\nOriginal Error: context.deadlineExceededError context deadline exceeded\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/postsystem.go:82 github.com/gravitational/gravity/lib/install/phases.(*waitExecutor).Execute\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:421 github.com/gravitational/gravity/lib/fsm.(*FSM).executeOnePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:355 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:315 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:192 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:150 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePlan\n\t/gopath/src/github.com/gravitational/gravity/lib/install/flow.go:335 github.com/gravitational/gravity/lib/install.(*Installer).startFSM\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to execute phase \"/wait\"\n." install/hook.go:56
2019-06-04T09:35:14-04:00 INFO [OPS] ops.SetOperationStateRequest{State:"failed", Progress:(*ops.ProgressEntry)(0xc0008a5f00)} install/hook.go:56
2019-06-04T09:35:14-04:00 DEBU [OPS] Created: ops.ProgressEntry{ID:"", SiteDomain:"logrocket", OperationID:"360acde5-7d55-4f65-85e4-38bb8125e36c", Created:time.Time{wall:0x34976137, ext:63695252114, loc:(*time.Location)(nil)}, Completion:100, Step:9, State:"failed", Message:"Operation failure: context deadline exceeded"}. install/hook.go:56
2019-06-04T09:35:14-04:00 DEBU [FSM:INSTA] Marked operation complete. opid:360acde5-7d55-4f65-85e4-38bb8125e36c install/hook.go:56
2019-06-04T09:35:15-04:00 INFO Operation failed. install/hook.go:56
I initially thought this may have been an issue with dnsmasq
running on the node, but even after ensuring no such service was running the same issue repeated. We also had them try a different CIDR range inside 172.16.0.0/12
for pods and services as they use some ranges inside 10.0.0.0/8
for their internal network.
AWS OL7.6
On the AWS side, using an OL7.6 AMI, the failure scenario is slightly different. Instead of failing to even resolve the leader, the system breaks fails to “update a remote for tag” during the docker population phase:
2019-06-05T20:15:51Z ERRO Phase execution failed: failed to update remote for tag "kubernetes-helm/tiller:v2.12.3". advertise-ip:172.31.14.26 hostname:ip-172-31-14-26.ec2.internal phase:/export/ip-172-31-14-26.ec2.internal install/hook.go:56
2019-06-05T20:15:51Z DEBU [FSM:INSTA] Applied StateChange(Phase=/export/ip-172-31-14-26.ec2.internal, State=failed, Error=failed to update remote for tag "kubernetes-helm/tiller:v2.12.3"). opid:e94fde77-fa9e-4f7f-b39e-1d2fc3d4afe7 install/hook.go:56
2019-06-05T20:15:51Z WARN "Failed to execute phase \"/export/ip-172-31-14-26.ec2.internal\": \nERROR REPORT:\nOriginal Error: *client.UnexpectedHTTPStatusError received unexpected HTTP status: 500 Internal Server Error\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/app/docker/imageservice.go:501 github.com/gravitational/gravity/lib/app/docker.(*remoteStore).updateRepo\n\t/gopath/src/github.com/gravitational/gravity/lib/app/docker/imageservice.go:206 github.com/gravitational/gravity/lib/app/docker.(*imageService).Sync\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/export.go:154 github.com/gravitational/gravity/lib/install/phases.(*exportExecutor).exportApp\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/export.go:106 github.com/gravitational/gravity/lib/install/phases.(*exportExecutor).Execute\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:421 github.com/gravitational/gravity/lib/fsm.(*FSM).executeOnePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:355 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:315 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:192 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:379 github.com/gravitational/gravity/lib/fsm.(*FSM).executeSubphasesConcurrently.func1\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to update remote for tag \"kubernetes-helm/tiller:v2.12.3\"\n." install/hook.go:56
2019-06-05T20:15:51Z ERRO [INSTALLER] "Failed to execute plan: \nERROR REPORT:\nOriginal Error: trace.aggregate failed to update remote for tag \"kubernetes-helm/tiller:v2.12.3\", failed to execute phase \"/export/ip-172-31-14-26.ec2.internal\"\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/collecterrors.go:65 github.com/gravitational/gravity/lib/utils.Collect\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/collecterrors.go:27 github.com/gravitational/gravity/lib/utils.CollectErrors\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:387 github.com/gravitational/gravity/lib/fsm.(*FSM).executeSubphasesConcurrently\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:358 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:287 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:192 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:150 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePlan\n\t/gopath/src/github.com/gravitational/gravity/lib/install/flow.go:335 github.com/gravitational/gravity/lib/install.(*Installer).startFSM\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to execute phase \"/export\"\n." install/hook.go:56
2019-06-05T20:15:51Z INFO [OPS] ops.SetOperationStateRequest{State:"failed", Progress:(*ops.ProgressEntry)(0xc00144eb00)} install/hook.go:56
2019-06-05T20:15:51Z DEBU [OPS] Created: ops.ProgressEntry{ID:"", SiteDomain:"logrocket", OperationID:"e94fde77-fa9e-4f7f-b39e-1d2fc3d4afe7", Created:time.Time{wall:0x11436b0e, ext:63695362551, loc:(*time.Location)(nil)}, Completion:100, Step:9, State:"failed", Message:"Operation failure: failed to update remote for tag \"kubernetes-helm/tiller:v2.12.3\", failed to execute phase \"/export/ip-172-31-14-26.ec2.internal\""}. install/hook.go:56
2019-06-05T20:15:51Z DEBU [FSM:INSTA] Marked operation complete. opid:e94fde77-fa9e-4f7f-b39e-1d2fc3d4afe7 install/hook.go:56
2019-06-05T20:15:51Z INFO Operation failed. install/hook.go:56
I struggled to find anywhere I could find logs that may have provided information on what the 500 error actually was.