At work, we're utilizing Tectonic to bootstrap our Kubernetes cluster for running some of our infrastructure.

As running a self-hosted Kubernetes is a decently garguantuan beast, you are likely to encounter hiccups as you are learning the tool. This post is largely just a braindump of what I had to do to recover our cluster's control plane after it had crashed from a manual configuration change. Hopefully, it is useful to somebody in dire need of aid.

Before you start, you will likely want to use a terminal that can echo keyboard output to multiple panes at once to avoid repetition. iTerm 2 (macOS only) and tmux can both do this.

  1. ssh to all master nodes:

    ssh -A -i path/to/your/key core@your-ip
    
    • you must have -A to forward your key to other boxes that we will ssh to
    • which key to use and your-ip can be found on the EC2 dashboard within AWS
  2. Follow Cleaning up Kubernetes resources from the Tectonic troubleshooting guide

  3. ssh to an etcd server from the master

    • Must be done from a master because the default security groups will only allow entry from another etcd node or a master node
  4. Run ps auxf | grep etcd and copy the full path to tls.zip

  5. Back on the master node, scp core@your-etcd-ip:/path/to/tls.zip .

    • These certs likely exist on the master node itself, but I wasn't sure which ones to actually use, as there were many
  6. Unzip the TLS certs to a directory and cd into it

  7. Download bootkube

  8. Run:

    ./bootkube recover \
        --recovery-dir=recovered \
        --etcd-servers=https://your-etcd-0:2379,https://your-etcd-1:2379,https://your-etcd-2:2379 \
        --kubeconfig=/etc/kubernetes/kubeconfig \
        --etcd-ca-path=ca.crt \
        --etcd-certificate-path=peer.crt \
        --etcd-private-key-path=peer.key
    

If this failure is a direct cause of config that was changed manually, make any adjustments or reverts to the manifests in recovered/bootstrap-manifests. Then, run:

./bootkube start --asset-dir=recovered

If this failure was a direct cause of modifying API server config, you will have a small window to correct this via the tectonic console or kubectl while the bootstrap API server is crash-looping