Rolling Updates
Rolling updates are performed using the kops rolling-update cluster command.
Cloud instances are chosen to be updated (replaced) if at least one of the following is true:
- The instance was created with a specification that is older than that generated by the last .
- The instance was detached for surging by a previous (failed or interrupted) rolling update.
- The node has a
kops.k8s.io/needs-update
annotation.
A rolling update will update instances from one instance group at a time. First, it will update bastion instance groups. Next, it will update master instance groups, then apiserver instance groups. Finally, it will update node instance groups. Within an instance group role it will update instance groups in alphabetical order.
A rolling update may be restricted to instance groups of particular roles (“Bastion”, “Master”, “APIServer”, and/or “Node”) with the --instance-group-roles
flag. A rolling update may be restricted to particular instance groups with the --instance-group
flag.
The first thing rolling update will do when updating an instance group is validate the cluster, as for . If the cluster fails validation at this time then the entire rolling update will stop with an error.
Next, rolling update will apply a PreferNoSchedule (soft) taint to the instance group’s nodes that have been chosen to be updated. This will prevent new pods, including replacements for evicted pods, from being scheduled on the old nodes unless there is no other place to schedule them.
This validation and tainting will not be performed if either of the following is true:
- The instance group is of role “Bastion”.
- The
--cloudonly
flag was given to thekops rolling-update cluster
command.
Finally, rolling update will replace the instance group’s chosen nodes, respecting the limits configured in that group’s rolling update strategy.
When being updated, a node is first cordoned to prevent any new pods from being scheduled on it. The cordoning also causes some cloud provider load balancers to remove the node from the set of available destinations. Next, the node is drained, voluntarily evicting all pods not managed by a DaemonSet. This eviction respects any pod disruption budgets.
Instances will not be cordoned or drained if at least one of the following is true:
- They are bastions.
- They were not registered as nodes.
- The
--cloudonly
flag was given to thekops rolling-update cluster
command.
Rolling update will then terminate the instance. Unless the instance had been detached for surging, this will cause the cloud provider to create a new instance with the current specification.
Rolling update then waits for 15 seconds to allow the Kubernetes APIserver to notice the termination. The amount of time to wait may be changed with the --bastion-interval
, --master-interval
, and/or --node-interval
flags.
Unless the --cloudonly
flag was given, rolling update then waits until the cluster validates successfully. This is done in order to ensure the replacement instance is working before rolling update proceeds to update another instance.
Configurable rolling update strategies
The behavior of rolling update within an instance group may be configured through the rollingUpdate
field of the group’s .
Cluster-wide defaults may be configured through the field of the ClusterSpec.
maxUnavailable
The maxUnavailable
field specifies the maximum number of nodes that can be unavailable during the rolling update. Increasing this setting allows more instances to be updated in parallel.
The value can be an absolute number (for example 5) or a percentage of the nodes in the group (for example “10%”). The absolute number is calculated from a percentage by rounding down.
For example, to permit two instances to be updated in parallel:
If there are no instances that have been created with the current specification, then a rolling update will start with updating a single instance. It does this to limit the damage in case the new specification results in non-working nodes.
maxSurge
Surging is temporarily increasing the number of instances in an instance group during a rolling update. Instead of first draining and terminating an instance and then creating a new one, it effectively first creates a new instance and then drains and terminates the old one.
Surging is implemented by “detaching” instances, making them not count toward the desired number of instances in the instance group. This causes the cloud provider to create new instances in order to satisfy the group’s desired number. The detached instances are drained and terminated last; when they are terminated the cloud provider does not replace them.
The maxSurge
is the maximum number of extra instances that can be created during the update. Increasing this setting allows more instances to be updated in parallel. Rolling update will not create more new instances than the number of instances selected for update.
The value can be an absolute number (for example 5) or a percentage of the nodes in the group (for example “10%”). The absolute number is calculated from a percentage by rounding up.
Masters are unable to surge. Any cluster-wide default setting will be ignored for instance groups of role “Master”. Setting this value on the InstanceGroupSpec for an instance group of role “Master” will result in an API validation error.
For example, to add a maximum of two additional instances to the group during a rolling update, allowing two to be updated in parallel:
If there are no instances that have been created with the current specification, then rolling update will start with creating a single new instance. It does this to limit the damage in case the new specification results in non-working nodes. Once the new instance validates successfully, it then creates any remaining surge instances.
Disabling rolling updates
Rolling updates may be partially disabled for an instance group by setting the drainAndTerminate
field to .