WorkloadSpread
WorkloadSpread can distribute Pods of workload to different types of Node according to some polices, which empowers single workload the abilities for multi-domain deployment and elastic deployment.
Some common policies include:
- fault toleration spread (for example, spread evenly among hosts, az, etc)
- spread according to the specified ratio (for example, deploy Pod to several specified az according to the proportion)
- subset management with priority, such as
- deploy Pods to ecs first, and then deploy to eci when its resources are insufficient.
- deploy a fixed number of Pods to ecs first, and the rest Pods are deployed to eci.
- subset management with customization, such as
- control how many pods in a workload are deployed in different cpu arch
- enable pods in different cpu arch to have different resource requirements
The feature of WorkloadSpread is similar with UnitedDeployment in OpenKruise community. Each WorkloadSpread defines multi-domain called . Each domain may provide the limit to run the replicas number of pods called maxReplicas
. WorkloadSpread injects the domain configuration into the Pod by Webhook, and it also controls the order of scale in and scale out.
Kruise with version lower than 1.3.0
supports CloneSet
, Deployment
, ReplicaSet
.
Sine Kruise 1.3.0
, WorkloadSpread supports StatefulSet
.
In particular, for StatefulSet
, WorkloadSpread supports manage its subsets only when scale up
. The order of scale down
is still controlled by StatefulSet controller. The subset management of StatefulSet is based on ordinals of Pods, and more details can be found here.
targetRef
: specify the target workload. Can not be mutated,and one workload can only correspond to one WorkloadSpread.
subsets
subsets
consists of multiple domain called subset
, and each topology has different configuration.
name
: the name ofsubset
, it is distinct in a WorkloadSpread, which represents a topology.maxReplicas
:the replicas limit ofsubset
, and must be Integer and >= 0. There is no replicas limit while themaxReplicas
is nil.requiredNodeSelectorTerm
: match zone hardly。preferredNodeSelectorTerms
: match zone softly。
Caution:requiredNodeSelectorTerm
corresponds the requiredDuringSchedulingIgnoredDuringExecution
of nodeAffinity. preferredNodeSelectorTerms
corresponds the preferredDuringSchedulingIgnoredDuringExecution
of nodeAffinity.
tolerations
: the tolerations of Pod insubset
.
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
patch
: customize the Pod configuration ofsubset
, such as Annotations, Labels, Env.
Example:
# patch pod with a topology label:
metadata:
labels:
topology.application.deploy/zone: "zone-a"
# patch pod container env with a zone name:
patch:
containers:
- name: main
env:
- name: K8S_AZ_NAME
value: zone-a
WorkloadSpread provides two kind strategies, the default strategy is Fixed
.
scheduleStrategy:
type: Adaptive | Fixed
adaptive:
rescheduleCriticalSeconds: 30
Fixed:
Workload is strictly spread according to the definition of the subset.
Adaptive:
Reschedule: Kruise will check the unschedulable Pods of subset. If it exceeds the defined duration, the failed Pods will be rescheduled to the other
subset
.
Requirements
WorkloadSpread defaults to be disabled. You have to configure the feature-gate WorkloadSpread when install or upgrade Kruise:
Pod Webhook
If the PodWebhook
feature-gate is set to false, WorkloadSpread will also be disabled.
CloneSet
has supported deletion-cost feature in the latest versions.
The other native workload need kubernetes version >= 1.21. (In 1.21, users need to enable PodDeletionCost feature-gate, and since 1.22 it will be enabled by default)
The workload managed by WorkloadSpread will scale according to the defined order in spec.subsets
.
The order of subset
in spec.subsets
can be changed, which can adjust the scale order of workload.
Scale out
- The Pods are scheduled in the subset order defined in the
spec.subsets
. It will be scheduled in the nextsubset
while the replica number reaches the maxReplicas ofsubset
- When the replica number of the
subset
is greater than themaxReplicas
, the extra Pods will be removed in a high priority. - According to the
subset
order in thespec.subsets
, the Pods of thesubset
at the back are deleted before the Pods at the front.
# subset-a subset-b subset-c
# maxReplicas 10 10 nil
# pods number 10 10 10
# deletion order: c -> b -> a
# subset-a subset-b subset-c
# maxReplicas 10 10 nil
# pods number 20 20 20
# deletion order: b -> a -> c
feature-gates
WorkloadSpread feature is turned off by default, if you want to turn it on set feature-gates WorkloadSpread.
$ helm install kruise https://... --set featureGates="WorkloadSpread=true"
Elastic deployment
zone-a
(ACK) holds 100 Pods, zone-b
(ECI) as an elastic zone holds additional Pods.
- Create a WorkloadSpread instance.
- Creat a corresponding workload, the number of replicas ca be adjusted freely.
Effect
- When the number of
replicas
<= 100, the Pods are scheduled inACK
zone. - When the number of
replicas
> 100, the 100 Pods are in zone, the extra Pods are scheduled inECI
zone. - The Pods in
ECI
elastic zone are removed first when scaling in.
Deploy 100 Pods to two zone
(zone-a, zone-b) separately.
apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
name: ws-demo
namespace: deploy
spec:
targetRef:
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
name: workload-xxx
subsets:
- name: subset-a
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-a
maxReplicas: 100
patch:
metadata:
labels:
deploy/zone: zone-a
- name: subset-b
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-b
maxReplicas: 100
patch:
metadata:
labels:
deploy/zone: zone-b
If the spread of zone needs to be changed, first adjust the
maxReplicas
ofsubset
, and then change the of workload.