Pod Scheduling Readiness

    Pods were considered ready for scheduling once created. Kubernetes scheduler does its due diligence to find nodes to place all pending Pods. However, in a real-world case, some Pods may stay in a “miss-essential-resources” state for a long period. These Pods actually churn the scheduler (and downstream integrators like Cluster AutoScaler) in an unnecessary manner.

    By specifying/removing a Pod’s .spec.schedulingGates, you can control when a Pod is ready to be considered for scheduling.

    The schedulingGates field contains a list of strings, and each string literal is perceived as a criteria that Pod should be satisfied before considered schedulable. This field can be initialized only when a Pod is created (either by the client, or mutated during admission). After creation, each schedulingGate can be removed in arbitrary order, but addition of a new scheduling gate is disallowed.

    Figure. Pod SchedulingGates

    To mark a Pod not-ready for scheduling, you can create it with one or more scheduling gates like this:

    Pod Scheduling Readiness - 图2

    1. kubectl get pod test-pod

    The output reveals it’s in SchedulingGated state:

    1. NAME READY STATUS RESTARTS AGE

    You can also check its schedulingGates field by running:

    The output is:

    1. [{"name":"example.com/foo"},{"name":"example.com/bar"}]

    To inform scheduler this Pod is ready for scheduling, you can remove its schedulingGates entirely by re-applying a modified manifest:

    pods/pod-without-scheduling-gates.yaml

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. spec:
    5. containers:
    6. image: registry.k8s.io/pause:3.6

    You can check if the schedulingGates is cleared by running:

    The output is expected to be empty. And you can check its latest status by running:

    1. kubectl get pod test-pod -o wide
    1. NAME READY STATUS RESTARTS AGE IP NODE
    2. test-pod 1/1 Running 0 15s 10.0.0.4 node-2

    The metric scheduler_pending_pods comes with a new label "gated" to distinguish whether a Pod has been tried scheduling but claimed as unschedulable, or explicitly marked as not ready for scheduling. You can use to check the metric result.

    FEATURE STATE: Kubernetes v1.27 [beta]

    You can mutate scheduling directives of Pods while they have scheduling gates, with certain constraints. At a high level, you can only tighten the scheduling directives of a Pod. In other words, the updated directives would cause the Pods to only be able to be scheduled on a subset of the nodes that it would previously match. More concretely, the rules for updating a Pod’s scheduling directives are as follows:

    1. For .spec.nodeSelector, only additions are allowed. If absent, it will be allowed to be set.

    2. For spec.affinity.nodeAffinity, if nil, then setting anything is allowed.

    3. If NodeSelectorTerms was empty, it will be allowed to be set. If not empty, then only additions of NodeSelectorRequirements to matchExpressions or fieldExpressions are allowed, and no changes to existing matchExpressions and fieldExpressions will be allowed. This is because the terms in .requiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms, are ORed while the expressions in nodeSelectorTerms[].matchExpressions and nodeSelectorTerms[].fieldExpressions are ANDed.