Auto-scaling

    OpenFaaS ships with a single auto-scaling rule defined in the mounted configuration file for AlertManager. AlertManager reads usage (requests per second) metrics from Prometheus in order to know when to fire an alert to the API Gateway.

    The API Gateway handles AlertManager alerts through its route.

    The auto-scaling provided by this method can be disabled by either deleting the AlertManager deployment or by scaling the deployment to zero replicas.

    The AlertManager rules () for Swarm can be viewed here and altered as a configuration map.

    All calls made through the gateway whether to a synchronous function /function/ route or via the asynchronous /async-function route count towards this method of auto-scaling.

    The minimum (initial) and maximum replica count can be set at deployment time by adding a label to the function.

    • com.openfaas.scale.min - by default this is set to 1, which is also the lowest value and unrelated to scale-to-zero

    • by default this is set to 20% and has to be a value between 0-100 (including borders)

    For each alert fired the auto-scaler will add a number of replicas, which is a defined percentage of the max replicas. This percentage can be set using com.openfaas.scale.factor. For example setting com.openfaas.scale.factor=100 will instantly scale to max replicas. This label enables to define the overall scaling behavior of the function.

    When using Kubernetes the built-in Horizontal Pod Autoscaler (HPA) can be used instead of AlertManager.

    Scaling from zero is turned on by default, for any function or endpoint, this setting can be toggled on or off. Scaling to zero to recover idle resources is available in OpenFaaS, but is not turned on by default. There are two parts that make up scaling to zero or (zero-scale) in the project.

    For a technical overview see the blog post: .

    The latency between accepting a request for an unavailable function and serving the request is sometimes called a "Cold Start".

    • What exactly happens in a "cold start"?The "Cold Start" consists of the following: creating a request to schedule a container on a node, finding a suitable node, pulling the Docker image and running the initial checks once the container is up and running. This "running" or "ready" state also has to be synchronised between all nodes in the cluster. The total value can be reduced by pre-pulling images on each node and by setting the Kubernetes Liveness and Readiness Probes to run at a faster cadence.

    Instructions for optimizing for a low cold-start are provided in the helm chart for Kubernetes.

    When scalefrom_zero is enabled a cache is maintained in memory indicating the readiness of each function. If when a request is received a function is not ready, then the HTTP connection is blocked, the function is scaled to min replicas, and as soon as a replica is available the request is proxied through as per normal. You will see this process taking place in the logs of the _gateway component.

    Scaling down to zero replicas is also called "idling".

    There are two approaches available for idling functions:

    1) faas-idler

    You can use the project which is currently available from the openfaas-incubator organisation. faas-idler allows some basic presents to be configured and then monitors the built-in Prometheus metrics on a regular basis to determine if a function should be scaled to zero. Only functions with a label of com.openfaas.scale.zero=true are scaled to zero, all others are ignored. Functions are scaled to zero through the OpenFaaS REST API.

    The faas-idler is deployed by default with Kubernetes and Swarm, but runs in a "dryRun" mode. To have it make actual changes to the functions, update the mode to "false".

    2) OpenFaaS REST API