Kubernetes Operator Design Patterns: how to support the upgrade and downgrade of the operand
If you develop a Kubernetes Operator and will manage it in a long term, you need to leverage some methodologies to manage the evolution of your project across different versions. To make sure your operator can transition smoothly from one version to another, you need to enable the support of upgrade and downgrade. Upgrade is used to keep pace with the latest development, and downgrade is probably used as a fail-over solution, once the upgrade fails or the most up-to-date version falls short of the user requirements. In this article, we are going to talk about how to design your Kubernetes Operator in order to support the upgrade and downgrade.
When you develop a Kubernetes Operator, the first thing is to find out what the operand, the actual workload that the operator manages. Speaking of the upgrade and downgrade, we actually include the upgrade and downgrade for both of the operand and the operator itself. Let’s discuss them one by one.
- Per the upgrade/downgrade of the operator, you need to consider the APIs and the other resources. Kubernetes Operator is a Kubernetes application. The released artifacts consist of multiple Kubernetes resources, such as CRDs, deployments, services, configmaps, roles, rolebinding, etc. They are usually grouped and shipped in one or multiple YAML files. Most of the time, there are images along with them published in the image repository, for the resources, like deployments, statefulsets, etc, to use.
For Kubernetes Operators/Applications, we use CRD to define the new API as the extension. The CRD is the only resource we should thinking of how to version and convert. One solution is to define a conversion webhook for the CRD to fulfill the conversion between different CRD versions. For the other type of resources, add and update can be done by applying the resources against the Kubernetes cluster. However, be aware of the resource deletion, you probably need to remove the obsolete resources manually by using some scripts to clean them up.
How to deal with the the existing CRs in the older version? Even if the CRD is updated to the later version, the existing CRs, the instance of the CRD, are still in the older version. To convert them into the later version, you can program the CR conversion in the your favorite language to do it, for instance, Knative Operator leverages a migration job to migrate the existing CRs to the later version, or follow the guidance here to upgrade them separately.
How can we version the CRD? To be honest, there is no standard answer. During the development of Knative Operator, my pet project, I applied the following rules, which may be some references for you:
- For the CRDs in the same schema version, ONLY add the new fields if necessary.
- For the CRDs in the same schema version, if some fields need to be deleted, still keep them in the schema, but mark them with deprecated.
- For the CRDs in the same schema version, if some fields need to be renamed, still keep them in the schema, but mark them with deprecated. Add the new fields with the new names.
- In the new schema version, remove the deprecated of the older versions, and keep the fields that are not deprecated. Add new fields if necessary.
The principle is to keep the CRDs compatible with the same schema version, and even reduce the work of conversion across different schema versions. We cannot change the version of the CRD schema as often as the release cycle. Once the schema version reaches beta, try to stabilize it without further major changes.
2. Per the upgrade/downgrade of the operand, the most important thing is that the operand itself should support upgrade/downgrade. If the operand does not support, there is no way for the operator to support the upgrade/downgrade of the operand. The operand can be in any form, but mainly it can be categorized into two:
One is the bundled application/service/image. The artifact is a archive. For example, MySQL, MongoDB, etc. To upgrade/downgrade, there is nothing magic, but to install the specific version.
The other is a set of Kubernetes resources. For example, Knative Serving and Eventing release a bunch of Kubernetes resources with the images, so Knative Operator needs to manage them. In this situation, we need to apply the same rules as discussed in the upgrade/downgrade for the operator.
If there are multiple schema versions of the CRDs, use the conversion webhook to transition the newly created CRs. For the existing CRs in the old version, create a migration job to migrate the existing CRs to the later version, or follow the guidance here to upgrade them separately.
For other top-level Kubernetes resources, like deployments, service accounts, services, roles, cluster roles, role bindings, cluster role bindings, etc. They can be treated as below:
- Added resources: do nothing. Simply apply the resource.
- Updated resources: do nothing. Simply apply the resource.
- Delete resources: If the obsolete resource harmless, do nothing, cleanup is nice to have; If the obsolete resource causes conflict, cleaning-up is necessary. For example, in Knative Operator, we implement a mechanism to compare the resources between two versions. If the resources are marked delete, we run a clean-up function to remove them.
I cannot say I have covered everything in terms of making your Kubernetes Operator support upgrade and downgrade, but these are the experiences when we are developing Knative Operator. In summary, the operator should support upgrade and downgrade for itself and for the operand as well. Different types of operands require us to think differently for the upgrade and downgrade path. To the end users, install, upgrade and downgrade are no difference from running a single installation command, but to make this a reality, developers need to implement everything underneath.
Don’t want to derail? Follow Vincent!