Using sysctls in a Kubernetes Cluster

    This document describes how to configure and use kernel parameters within a Kubernetes cluster using the sysctl interface.

    Note: Starting from Kubernetes version 1.23, the kubelet supports the use of either / or . as separators for sysctl names. Starting from Kubernetes version 1.25, setting Sysctls for a Pod supports setting sysctls with slashes. For example, you can represent the same sysctl name as kernel.shm_rmid_forced using a period as the separator, or as kernel/shm_rmid_forced using a slash as a separator. For more sysctl parameter conversion method details, please refer to the page from the Linux man-pages project.

    You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using or you can use one of these Kubernetes playgrounds:

    For some steps, you also need to be able to reconfigure the command line options for the kubelets running on your cluster.

    In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available via the /proc/sys/ virtual process file system. The parameters cover various subsystems such as:

    • kernel (common prefix: kernel.)
    • virtual memory (common prefix: vm.)
    • MDADM (common prefix: )
    • More subsystems are described in .

    To get a list of all parameters, you can run

    Sysctls are grouped into safe and unsafe sysctls. In addition to proper namespacing, a safe sysctl must be properly isolated between pods on the same node. This means that setting a safe sysctl for one pod

    • must not have any influence on any other pod on the node
    • must not allow to harm the node’s health
    • must not allow to gain CPU or memory resources outside of the resource limits of a pod.
    • kernel.shm_rmid_forced,
    • net.ipv4.tcp_syncookies,
    • net.ipv4.ping_group_range (since Kubernetes 1.18),
    • net.ipv4.ip_unprivileged_port_start (since Kubernetes 1.22).

    Note: The example net.ipv4.tcp_syncookies is not namespaced on Linux kernel version 4.4 or lower.

    This list will be extended in future Kubernetes versions when the kubelet supports better isolation mechanisms.

    All safe sysctls are enabled by default.

    All unsafe sysctls are disabled by default and must be allowed manually by the cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be scheduled, but will fail to launch.

    With the warning above in mind, the cluster admin can allow certain unsafe sysctls for very special situations such as high-performance or real-time application tuning. Unsafe sysctls are enabled on a node-by-node basis with a flag of the kubelet; for example:

    For , this can be done via the extra-config flag:

    Only namespaced sysctls can be enabled this way.

    The following sysctls are known to be namespaced. This list could change in future versions of the Linux kernel.

    • kernel.shm*,
    • ,
    • kernel.sem,
    • fs.mqueue.*,

    Sysctls with no namespace are called node-level sysctls. If you need to set them, you must manually configure them on each node’s operating system, or by using a DaemonSet with privileged containers.

    Use the pod securityContext to configure namespaced sysctls. The securityContext applies to all containers in the same pod.

    This example uses the pod securityContext to set a safe sysctl kernel.shm_rmid_forced and two unsafe sysctls net.core.somaxconn and kernel.msgmax. There is no distinction between safe and unsafe sysctls in the specification.

    Warning: Only modify sysctl parameters after you understand their effects, to avoid destabilizing your operating system.

    Warning: Due to their nature of being unsafe, the use of unsafe sysctls is at-your-own-risk and can lead to severe problems like wrong behavior of containers, resource shortage or complete breakage of a node.

    It is good practice to consider nodes with special sysctl settings as tainted within a cluster, and only schedule pods onto them which need those sysctl settings. It is suggested to use the Kubernetes to implement this.