Operator Best Practices
Considerations for Operator developers:
An Operator should manage a single type of application, essentially following the UNIX principle: do one thing and do it well.
If an application consists of multiple tiers or components, multiple Operators should be written one for each of them. For example, if the application consists of Redis, AMQ and MySQL, there should be 3 Operators, not one.
If there is significant orchestration and sequencing involved, an Operator should be written that represents the entire stack, in turn delegating to other Operators for orchestrating their part of it.
Operators should own a CRD and only one Operator should control a CRD on a cluster. Two Operators managing the same CRD is not a recommended best practice. In the case where an API exists but with multiple implementations, this is typically an example of a no-op Operator because it doesn’t have any deployment or reconciliation loop to define the shared API and other Operators depend on this Operator to provide one implementation of the API, e.g. similar to PVCs or Ingress.
Inside an Operator, multiple controllers should be used if multiple CRDs are managed. This helps in separation of concerns and code readability. Note that this doesn’t necessarily mean that we need to have one container image per controller, but rather one reconciliation loop (which could be running as part of the same Operator binary) per CRD.
An Operator shouldn’t deploy or manage other operators (such patterns are known as meta or super operators or include CRDs in its Operands). It’s the Operator Lifecycle Manager’s job to manage the deployment and lifecycle of operators. For further information check Dependency Resolution.
If multiple operators are packaged and shipped as a single entity by the same CSV for example, then it is recommended to add all owned and required CRDs, as well as all deployments for operators that manage the owned CRDs, to the same CSV.
Writing an Operator involves using the Kubernetes API, which in most scenarios will be built using same boilerplate code. Use a framework like the Operator SDK to save yourself time with this and to also get a suite of tooling to ease development and testing.
Operators shouldn’t hard code the namespaces they are watching. This should be configurable - having no namespace supplied is interpreted as watching all namespaces
Semantic versioning (aka semver) should be used to version an Operator. Operators are long-running workloads on the cluster and its APIs are potentially in need of support over a longer period of time. Use the to help determine when and how to bump versions when there are breaking or non-breaking changes.
Kubernetes API versioning guidelines should be used to version Operator CRDs. Use the Kubernetes sig-architecture guidelines to get best practices on when to bump versions and when breaking changes are acceptable.
Operators are instrumented to provide useful, actionable metrics to external systems (e.g. monitoring/alerting platforms). Minimally, metrics should represent the software’s health and key performance indicators, as well as support the creation of such as throughput, latency, availability, errors, capacity, etc.
Operators may create objects as part of their operational duty. Object accumulation can consume unnecessary resources, slow down the API and clutter the user interface. As such it is important for operators to keep good hygiene and to clean up resources when they are not needed. Here are instructions on how to handle cleanup on deletion.
- One Operator per managed application
- Multiple operators should be used for complex, multi-tier application stacks
- CRD can only be owned by a single Operator, shared CRDs should be owned by a separate Operator
- One controller per custom resource definition
- Use a tool like Operator SDK
- Do not hard-code namespaces or resources names
- Make watch namespace configurable
- Use semver / observe Kubernetes guidelines on versioning APIs
- Use OpenAPI spec with structural schema on CRDs
- Operators expose metrics to external systems
- Operators cleanup resources on deletion
Running On-Cluster
Considerations for on-cluster behavior
Like all containers on Kubernetes, Operators need not run as root unless absolutely necessary. Operators should come with their own ServiceAccount and not rely on the .
Operators should not self-register their CRDs. These are global resources and careful consideration needs to be taken when setting those up. Also this requires the Operator to have global privileges which is potentially dangerous compared to that little extra convenience.
Operators need to support updating managed applications (Operands) that were set up by an older version of the Operator. There are multiple models for this:
An Operator should not deploy another Operator - an additional component on cluster should take care of this (OLM).
When Operators change their APIs, CRD conversion (webhooks) should be used to deal with potentially older instances of them using the previous API version.
Operators should make it easy for users to use their APIs - validating and rejecting malformed requests via extensive Open API validation schema on CRDs or via an admission webhook is good practice.
The Operator itself should be really modest in its requirements - it should always be able to deploy by deploying its controllers, no user input should be required to start up the Operator.
If user input is required to change the configuration of the Operator itself, a Configuration CRD should be used. Init-containers as part of the Operator deployments can be used to create a default instance of those CRs and then the Operator manages their lifecycle.
Summary:
On the cluster, an Operator…
- Does not run as root
- Does not self-register CRDs
- Does not install other Operators
- Does rely on dependencies via package manager (OLM)
- Writes meaningful status information on Custom Resources objects unless pure data structure
- Should be capable of updating from a previous version of the Operator
- Should be capable of managing an Operand from an older Operator version
- Uses CRD conversion (webhooks) if API/CRDs change
- Uses OpenAPI validation / Admission Webhooks to reject invalid CRs
- Should always be able to deploy and come up without user input