Recovering Tectonic Using an Externally Provisioned etcd on AWS

At work, we're utilizing Tectonic to bootstrap our Kubernetes cluster for running some of our infrastructure.

As running a self-hosted Kubernetes is a decently garguantuan beast, you are likely to encounter hiccups as you are learning the tool. This post is largely just a braindump of what I had to do to recover our cluster's control plane after it had crashed from a manual configuration change. Hopefully, it is useful to somebody in dire need of aid.

Before you start, you will likely want to use a terminal that can echo keyboard output to multiple panes at once to avoid repetition. iTerm 2 (macOS only) and tmux can both do this.

ssh to all master nodes:
```
ssh -A -i path/to/your/key core@your-ip
```
- you must have -A to forward your key to other boxes that we will ssh to
- which key to use and your-ip can be found on the EC2 dashboard within AWS
Follow Cleaning up Kubernetes resources from the Tectonic troubleshooting guide
ssh to an etcd server from the master
- Must be done from a master because the default security groups will only allow entry from another etcd node or a master node
Run ps auxf | grep etcd and copy the full path to tls.zip
Back on the master node, scp core@your-etcd-ip:/path/to/tls.zip .
- These certs likely exist on the master node itself, but I wasn't sure which ones to actually use, as there were many
Unzip the TLS certs to a directory and cd into it
Download bootkube

Run:

./bootkube recover \
    --recovery-dir=recovered \
    --etcd-servers=https://your-etcd-0:2379,https://your-etcd-1:2379,https://your-etcd-2:2379 \
    --kubeconfig=/etc/kubernetes/kubeconfig \
    --etcd-ca-path=ca.crt \
    --etcd-certificate-path=peer.crt \
    --etcd-private-key-path=peer.key

This command was copied from the Tectonic troubleshooting guide, so feel free to reference it if something has changed with your version

If this failure is a direct cause of config that was changed manually, make any adjustments or reverts to the manifests in recovered/bootstrap-manifests. Then, run:

./bootkube start --asset-dir=recovered

If this failure was a direct cause of modifying API server config, you will have a small window to correct this via the tectonic console or kubectl while the bootstrap API server is crash-looping

About Ryan Koval