Moving to etcd3 and/or adopting etcd-manager
Nonetheless, this remains a higher-risk upgrade than most other kubernetes upgrades - you are strongly recommended to plan accordingly: back up critical data, schedule the upgrade during a maintenance window, think about how you could recover onto a new cluster, try it on non-production clusters first.
To minimize the pain of this migration, we are making some other supporting changes at the same time:
- We enable TLS for both clients & peers with etcd3
- Calico configurations move from talking to etcd directly to using CRDs (talking to etcd is considered deprecated)
This does introduce the risk that we are changing more at the same time, and we provide some mitigation steps for breaking up the upgrade, though most of these therefore involve multiple disruptive upgrades (e.g. etc2 -> etcd3 is disruptive, non-TLS to TLS is disruptive).
Note: Even if you are already using etcd3 and have TLS enabled, it is recommended to use to etcd-manager and the steps in this document still apply to you. If you would like to delay using etcd-manager, there are steps at the bottom of this doc that outlines how to do that.
When upgrading to kubernetes 1.12 with kOps 1.12, by default:
- Calico and Cilium will be updated to a configuration that uses CRDs
- We will automatically start using etcd-manager
- Using etcd-manager will default to etcd3
- Using etcd3 will use TLS for all etcd communications
Calico/Cilium users
If you are using calico the switch to CRDs will effectively cause a network partition during the rolling-update. Your application might tolerate this, but it probably won’t. We therefore recommend rolling your nodes as fast as possible also:
DANGER: Using the procedure to quickly roll your masters can result in downtime for any workloads using Service LoadBalancers. (The “Hammer 🔨” Method)
Any time you kill off all three masters with--cloudonly
and--master-interval=1s
, you may experience worker nodes go into aNotReady
state when the new masters come online and reconcile the cluster state. This can lead to Kubernetes Service LoadBalancers removing nodes in aNotReady
state. In some cases, larger clusters have all nodes in aNotReady
state, causing a cluster-wide Service LoadBalancer disruption. See for workarounds.
kops rolling-update cluster --cloudonly --master-interval=1s --node-interval=1s
kops rolling-update cluster --cloudonly --master-interval=1s --node-interval=1s --yes
If you would like to upgrade more gradually, we offer the following strategies to spread out the disruption over several steps. Note that they likely involve more disruption and are not necessarily lower risk.
If you don’t already have TLS enabled with etcd, you can adopt etcd-manager before kOps 1.12 & kubernetes 1.12 by running:
Delay adopting etcd-manager with kOps 1.12
To delay adopting etcd-manager with kOps 1.12, specify the provider as type legacy
:
To remove, kops edit
your cluster and delete the provider: Legacy
lines from both etcdCluster blocks.
To delay adopting etcd3 with kOps 1.12, specify the etcd version as 2.2.1
To remove, kops edit
your cluster and delete the version: 2.2.1
lines from both etcdCluster blocks.
AWS ELB Mitigation
When quickly rolling all your masters, you can hit conditions which lead to nodes entering a NotReady
state. Kubernetes, by default, will remove any nodes from ELBs managed by Services. To avoid possible ELB service interruption, you can add a temporary IAM policy which blocks the masters from removing NotReady
nodes from LoadBalancer type services. This policy only needs to be in play while you are performing this upgrade and can be removed once the nodes (masters and workers) are all in a Ready
state. Make sure you remove the policy once the cluster is upgraded and stable, otherwise Kubernetes will not be able to effectively manage your nodes in ELBs.
# Configure your master_node_role_name (Generally "masters.your.cluster.name")
masters_role_name="masters.<your.cluster.name>"
# Install a temporary IAM policy. This avoids nodes being removed from LoadBalancer type services while masters reconcile the state of the cluster.
aws iam put-role-policy \
--role-name "${masters_role_name}" \
--policy-name temporary-etcd-upgrade-deny-lb-changes \
'{"Version": "2012-10-17", "Statement": [{"Action": ["elasticloadbalancing:DeregisterInstancesFromLoadBalancer", "elasticloadbalancing:DeregisterTargets"], "Resource": ["*"], "Effect": "Deny"}]}'
Removing the Temporary Policy