Recovering Tectonic Using an Externally Provisioned etcd on AWS

Recovering Tectonic Using an Externally Provisioned etcd on AWS

At work, we're utilizing Tectonic to bootstrap our Kubernetes cluster for running some of our infrastructure.

As running a self-hosted Kubernetes is a decently garguantuan beast, you are likely to encounter hiccups as you are learning the tool. This post is largely just a braindump of what I had to do to recover our cluster's control plane after it had crashed from a manual configuration change. Hopefully, it is useful to somebody in dire need of aid.

Before you start, you will likely want to use a terminal that can echo keyboard output to multiple panes at once to avoid repetition. iTerm 2 (macOS only) and tmux can both do this.

  1. ssh to all master nodes:

    ssh -A -i path/to/your/key core@your-ip
    • you must have -A to forward your key to other boxes that we will ssh to
    • which key to use and your-ip can be found on the EC2 dashboard within AWS
  2. Follow Cleaning up Kubernetes resources from the Tectonic troubleshooting guide

  3. ssh to an etcd server from the master

    • Must be done from a master because the default security groups will only allow entry from another etcd node or a master node
  4. Run ps auxf | grep etcd and copy the full path to

  5. Back on the master node, scp core@your-etcd-ip:/path/to/ .

    • These certs likely exist on the master node itself, but I wasn't sure which ones to actually use, as there were many
  6. Unzip the TLS certs to a directory and cd into it

  7. Download bootkube

  8. Run:

    ./bootkube recover \
        --recovery-dir=recovered \
        --etcd-servers=https://your-etcd-0:2379,https://your-etcd-1:2379,https://your-etcd-2:2379 \
        --kubeconfig=/etc/kubernetes/kubeconfig \
        --etcd-ca-path=ca.crt \
        --etcd-certificate-path=peer.crt \

If this failure is a direct cause of config that was changed manually, make any adjustments or reverts to the manifests in recovered/bootstrap-manifests. Then, run:

./bootkube start --asset-dir=recovered

If this failure was a direct cause of modifying API server config, you will have a small window to correct this via the tectonic console or kubectl while the bootstrap API server is crash-looping

About Ryan Koval

coding. gaming. live music.