At work, we're utilizing Tectonic to bootstrap our Kubernetes cluster for running some of our infrastructure.
As running a self-hosted Kubernetes is a decently garguantuan beast, you are likely to encounter hiccups as you are learning the tool. This post is largely just a braindump of what I had to do to recover our cluster's control plane after it had crashed from a manual configuration change. Hopefully, it is useful to somebody in dire need of aid.
Before you start, you will likely want to use a terminal that can echo keyboard output to multiple panes at once to avoid repetition. iTerm 2 (macOS only) and tmux can both do this.
sshto all master nodes:
ssh -A -i path/to/your/key core@your-ip
- you must have
-Ato forward your key to other boxes that we will
- which key to use and
your-ipcan be found on the EC2 dashboard within AWS
- you must have
Follow Cleaning up Kubernetes resources from the Tectonic troubleshooting guide
sshto an etcd server from the master
- Must be done from a master because the default security groups will only allow entry from another etcd node or a master node
ps auxf | grep etcdand copy the full path to
Back on the master node,
scp core@your-etcd-ip:/path/to/tls.zip .
- These certs likely exist on the master node itself, but I wasn't sure which ones to actually use, as there were many
Unzip the TLS certs to a directory and
./bootkube recover \ --recovery-dir=recovered \ --etcd-servers=https://your-etcd-0:2379,https://your-etcd-1:2379,https://your-etcd-2:2379 \ --kubeconfig=/etc/kubernetes/kubeconfig \ --etcd-ca-path=ca.crt \ --etcd-certificate-path=peer.crt \ --etcd-private-key-path=peer.key
- This command was copied from the Tectonic troubleshooting guide, so feel free to reference it if something has changed with your version
If this failure is a direct cause of config that was changed manually, make any adjustments or reverts to the manifests in
recovered/bootstrap-manifests. Then, run:
./bootkube start --asset-dir=recovered
If this failure was a direct cause of modifying API server config, you will have a small window to correct this via the tectonic console or
kubectl while the bootstrap API server is crash-looping