Kubernetes as a project is growing at a rapid pace, resulting in many features being added, deprecated, and removed throughout the process. With the announcement of Kubernetes v1.25, there were a total of 40 enhancements and 2 features being either deprecated or removed.
This blog will look at the significant changes coming with the release of Kubernetes v1.25 and how this will impact future use.
What are the enhancements in the new version?
A range of enhancements came into action with the arrival of v1.25 in the form of graduating to stable, beta, and alpha. This section will cover some of the important enhancements which have been made such as timezone support, capacity isolation, and liveness probe grace periods.
Capacity isolation for local ephemeral storage
As part of the LocalStorageCapacityIsolation feature gate, there will be the addition of support for capacity isolation of shared partitions for pods and containers in the form of local ephemeral storage resource management from v1.25. It was first introduced in alpha with v1.17, and now it is graduating to stable. Pods use ephemeral storage, and resource management will help limit that by configuring limits.ephemeral-storage
and requests.ephemeral-storage
.
To learn more about the capacity isolation for local ephemeral storage, click here to expand...
The changes made to the capacity isolation for local ephemeral storage will have the following results:
- Kubernetes will now be able to evict the pods whose containers go over the primarily configured requests and limits.
- In the case of memory and CPU, the scheduler will now check the storage requirements of the pods against the local storage availability before scheduling them on the node.
- With this enhancement, administrators can configure
ResourceQuotas
to set constraints on the total limits and requests on a namespace. They also will be able to configure LimitRanges
to set the default limit for a Pod.
To read more about this change, check out the official k8s documentation here and the KEP-361 here.
Retriable and non-retriable pod failures for Jobs
Another important enhancement involves v1.25 bringing an API to influence retries based on exit codes or reasons related to pod deletions. If you manage a sizable computational workload with many pods running on thousands of nodes, then you need to have a restart policy for infrastructure or Job failures. Currently, Kubernetes Job API offers a policy in which, in case of an infrastructure failure, it sets the backofflimit
greater than 0. Hence, it instructs the job controller to restart the process regardless of the cause of the failure. Thus, it makes a lot of unwanted restarts, which results in wasting both time and computational resources.
To learn more about this new enhancement, click here to expand...
The new enhancement will make Kubernetes more efficient by determining the Job failure caused due to infrastructure error or a bug which will help terminate the Job early without incrementing the counter towards the backofflimit. This will help save time and resources as it will not lead to unnecessary retries. In addition, Jobs will be more reliable to pod evictions, reducing the number of failed Job administrators.
You can read more about this change in the official k8s documentation here.
CRD validation expression language
From v1.25, expression language can be used to validate custom resources. This enhancement is part of the CustomResourceValidationExpressions feature gate. After graduating in v1.23 as alpha, it is now graduating to stable with v1.25. Whilst there is already a validation mechanism for CRDs, this enhancement will allow for another validation mechanism to be implemented for Custom Resource Definitions (CRDs) to complement the existing one based on webhooks.
Discover more about the results of this enhancement by clicking here...
This enhancement results in development becoming more simplified without the need for webhooks, and all the information of a CRD can be self-contained in the same place.
The validation mechanisms of CRDs use Common Expression Language (CEL), and they are in the
CustomResourceDefinition
schemas using the x-Kubernetes-validations extension. The enhancement also provides us with two new metrics for tracking the compilation and evaluation time, such as cel_compilation_duration_seconds
and cel_evaluation_duration_seconds
.
Read more about this change in the KEP-2876.
Addition of minReadySeconds to Statefulsets
With the release of v1.25, we can see another enhancement where minReadySeconds
will be added to Statefulsets. minReadySeconds can be defined as the minimum time for a new pod to get ready. The pod is considered to be available and ready if no containers are crashing for it and all of them are ready. The default value of minReadySeconds is 0. However, it is already available in Deployments, DaemonSets, ReplicaSets, and ReplicationControllers, and with this enhancement, StatefulSets will benefit from it.
Learn more about this enhancement from KEP-2599.
Timezone support in CronJob
In v1.25, the CronJob resource will now have an extension where the user will have the ability to define a timezone when a new Job is created. Though the Jobs created by CronJob are based on the schedule created by the author, the timezone used during creation is the same as the one in which the Kube controller manager is running. With this enhancement, the timezone can be set as the user wants.
KEP-3140 explores more about the enhancement of timezone support in CronJob.
Network policy to support port ranges
NetworkPolicy object is a complex object that helps in allowing a developer to specify the expected traffic behavior of an application and to allow or deny undesired traffic. Currently, when you define a NetworkPolicy, you have to specify all the ports in a range one after another.
To see how this enhancement has impacted the way you define a NetworkPolicy click here...
Before this change, the way you define a NetworkPolicy would look like the below example:
spec:
egress:
- ports:
- protocol: TCP
port: 32000
- protocol: TCP
port: 32001
- protocol: TCP
port: 32002
[...]
- protocol: TCP
port: 32768
Hence, if the user wants to allow a range of ports, every port must be declared an item from the ports
array. Also, while making an exception, you need to get a declaration of all ports.
To mitigate this problem, v1.25 comes with adding a new field that will allow a declaration of a port range simplifying the creation of rules with multiple ports. As a result, this enhancement will add a new
endPort
field inside the ports
array, simplifying the creation of NetworkPolicy for the user. The resulting code after inserting the new field will look like this:
spec:
egress:
- ports:
- protocol: TCP
port: 32000
endPort: 32768
Reserve service IP ranges for Dynamic and Static IP allocation
With Kubernetes services, you can expose an application running on a set of pods abstractly. Services have a virtual ClusterIP that allows load balancing of traffic across different pods. The ClusterIP can be assigned dynamically, where the cluster can pick an IP from within the configured Service IP range and statically, where the user can set up an IP within the configured Service IP range.
Click here to read more about the reserve service IP ranges for Dynamic and Static IP allocation.
Currently, there is no way for you to know in advance about the dynamic assigning of an IP address to the existing service. However, you can set up a static IP address which can be helpful but is risky at the same time.
In v1.25, the ClusterIP range can be subdivided for dynamic and static IP allocation. Furthermore, with this feature, the Service IP range can be logically subdivided to avoid the risk of conflicts between Services that use static and dynamic IP allocation.
This feature is graduating from alpha to beta. The implementation of this feature is based on the segmentation of the Service network. It is passed as a string parameter to the
--service-cluster-ip-range
flag, which is used to allocate IP addresses into two bands to services. The upper band is used for dynamically allocated IP addresses, while the lower band is preferred for Static IP addresses. However, if the upper band gets exhausted, the service will continue using the lower band.
Support for Cgroup V2
Cgroup is a Linux kernel feature that provides a unified control system with enhanced resource management capabilities such as limiting, accounting for, and isolating the resource usage of a collection of processes like CPU, kernel memory, network memory, disk I/O etc. The V2 API of Cgroup has been stable in the Linux kernel for over two years. This enhancement will bring the Cgroup V2 support to Kubernetes.
Cgroup V2 comes with a lot of improvement over Cgroup V1. Click here to find out what they are:
- It brings a single unified hierarchy design in API.
- In Cgroup V2, containers will get a safer sub-tree delegation.
- Cgroup V2 comes with enhanced resource allocation management and isolation across multiple resources than V1. With that, it has unified accounting for different memory allocations such as network memory, kernel memory, etc. In addition, it also accounts for non-immediate resource changes like
page cache write backs
.
Liveness probe grace periods
Liveness probes are periodic checks that help Kubernetes know whether a process is running correctly. If a process does not respond to its liveness probe and gets frozen, one should kill the process immediately. Currently, the liveness probe uses the terminationGracePeriodSeconds
during a normal shutdown and the failing of a probe.
The terminationGracePeriodSeconds
field of a Pod spec instructs Kubernetes to wait for a given period of seconds before killing a container after the termination. Hence, if there is a setting of a long termination period when a liveness probe fails, the workload will not be immediately restarted as it will wait for the entire termination period.
Click here to find out what this new version adds...
The new version adds a configurable gracePeriodSeconds field to the liveness probe, which will maintain a steady behavior. This enhancement will override the
terminationGracePeriodSeconds
for liveness or startup termination and hence will make an instant kill of the failing probe.
It will maintain the current behavior of the process and will provide configuration to address the unintended behavior:
spec:
terminationGracePeriodSeconds: 3600
containers:
- name: test
image: ...
ports:
- name: liveness-port
containerPort: 8080
hostPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: liveness-port
failureThreshold: 1
periodSeconds: 60
# New field #
terminationGracePeriodSeconds: 60
If you are interested in discovering more about this enhancement, check out KEP-2238.
Seccomp by default
Seccomp adds a layer of security for your containers. If enabled by default, it helps in preventing CVEs (Common Vulnerabilities and Exposures) and zero-day vulnerabilities. Currently, Kubernetes allows you to secure your containers by providing a native way to execute and specify your seccomp profile, but it is disabled by default.
Kubernetes v1.25 will now enable a seccomp profile for all workloads by default. As a result, it will implicitly make Kubernetes more secure. SeccompDefault
will help you enable this behavior, turning the existing RuntimeDefault
profile into the default for any container.
Addition of CPUManager policy option to align CPUs by Socket
With the release of Kubernetes v1.22, a new CPUManager flag came into existence which helps in spreading the use of CPUManager policy options. With CPUManager policy options, users don’t have to introduce an entirely new policy and can customise their behavior based on workload requirements. Currently, two policy options exist: full-pcpus-only
and distribute-cpus-across-numa
. These policy options work together to ensure the existence of an optimized CPU set for the workloads running on a cluster.
You can learn more about the addition of CPUManager policy option to align CPUs by Socket by clicking here...
As CPU architectures gradually evolve, there is an increase in the number of NUMA nodes per socket. However, the devices managed by
DeviceManager
are not distributed uniformly across all NUMA nodes. Hence, there are scenarios where there is no perfect alignment between the devices and CPUs. By default, the CPUManager prefers allocating CPUs with a minimum number of NUMA nodes. However, there will not be an optimized performance if the NUMA nodes selected for allocation are spread across sockets. If NUMA alignment is not possible, latency-sensitive applications want resource alignment within the same socket for optimal performance.
With v1.25, the CPUManager policy option will be added to align CPUs by Socket instead of by NUMA node. Hence, the CPUManager will send a broader set of preferred hints to the TopologyManager, increasing the likelihood of the best hint being socket aligned with respect to the CPU and the devices managed by DeviceManager. On the other hand, if the NUMA nodes selected for allocation are socket aligned, then predictable, optimized performances can be achieved.
Auto-refreshing official CVE feed
Kubernetes, as technology is growing at a rapid speed. As a result, there has been an increase in the number of Common Vulnerabilities and Exposure (CVEs). The CVEs that affect Kubernetes directly or indirectly are regularly fixed. But currently, there is no way with which you can pull data out of the fixed CVEs. It is also not possible to filter issues or PRs related to CVEs that Kubernetes announce.
The official CVE feed will now auto-refresh with the new version. In addition, the issues or PRs related to CVEs will now be labeled with official-cve-feed
with the help of automation, which will help you to filter them and announce them to the customers if you are a Kubernetes provider. You will not enable this enhancement in your cluster; rather, consume it with web resources.
Respect podTopologySpread after rolling updates
The podTopologySpread
feature allows the users to define the group of pods over which spreading is applied using LabelSelector. While defining a pod spec, this feature helps the user know the exact label key and value. With podTopologySpread
, you will get better control of the even distribution of the pods that are related to each other.
However, during rolling updates, when a new set of pods are rolled out, the existing pods and the one which will disappear soon are also included in the calculations. This might light to an uneven distribution of the future ones.
Click here to read more about this enhancement...
This enhancement brings the addition of a contemporary field called
MatchLabelKeys
to LabelSelector
in TopologySpreadConstraint
. It represents a set of label keys. The scheduler will use those keys to look up the label values for the incoming pods. Then, those key-label values will be merged with LabelSelector
to identify only the group of existing pods over which the calculation will take place for distribution. Hence, the pods belonging to the same group will be part of the spreading in podTopologySpread
. In this way, podTopologySpread
will be respected after rolling updates.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
matchLabelKeys:
- app
- pod-template-hash
What do we know about the Kubernetes API removal and deprecation process?
The deprecation policy for Kubernetes states that a stable Kubernetes API can only be deprecated if a newer and stable version of the same API is available.
A deprecated API is one which has been marked for removal in a future release. You can use the deprecated API until its removal, but you will get a warning message upon its usage. Once an API is removed in a new version, it is no longer able to be used. In that case, you have to migrate towards the replacement.
The release of v1.25 will bring both deprecations and removals alongside a suite of enhancements, which will be summarised below. For more information on the deprecation policy, you can look here.
Saying goodbye to PodSecurityPolicy and welcoming PodSecurity Admission
PodSecurityPolicy had already been deprecated in v1.21, but v1.25 will see the removal of it. Developers have used PodSecurityPolicy in the past, but its complex and confusing usage often leads to the emergence of necessitated changes which would ultimately result in an error.
As a result, it is being removed and replaced by PodSecurity Admission, which graduates from beta to stable in this release. For migrating to PodSecurity Admission, have a look at this tutorial from the official Kubernetes documentation.
CSI Migration with deprecations and removals for storage drivers
The core CSI Migration feature is going "Generally Available" in v1.25. This will result in the complete removal of in-tree volume plugins. The plugins that were present in Kubernetes core (in-tree) are migrated to be separate projects or out-of-tree.
In v1.25, we will see the deprecation of GlusterFS, which did have a CSI driver, but it did not remain maintained. Also, the Portworx in-tree volume plugin is going to be deprecated. The new version will also remove several in-tree volume plugins as part of the CSI Migration, such as Flocker, Quobyte, and StorageOS.
IPTables chain ownership cleanup
As part of its networking setup, Kubernetes creates IPTables chains to ensure the reaching of network packets to container pods. Kubelet has created IPTables to support components, such as Dockershim and Kube-proxy. Dockershim has been removed and as a result, this chain ownership will get a cleanup. The chains created by Kube-proxy were owned by Kubelet. This chain ownership cleanup will stop Kubelet from creating unnecessary resources and will let Kube-proxy create the IPTables chains by itself.
From v1.25, the Kubelet will not create certain IPTables chains in the NAT table, such as KUBE-MARK-DROP
, KUBE-MARK-MASQ
, and KUBE-POSTROUTING
. Keep in mind that this is not a formal deprecation. Still, some end-users started depending on the specific internal behavior of Kube-proxy
, which is not supported, and further implementations will change their behavior. This fundamental change will be implemented through the IPTablesCleanup
feature. There will not be any functional change with this activity but you will be able to manage a clean cluster.
Conclusion
The arrival of v1.25 brings some major changes in the form of deprecations and removals. You can learn more about the changes, minor bug fixes, and enhancements from the GitHub changelog. Furthermore, you can get more information from the k8s release notes.
You can sign up for Civo and connect with the latest industry news in our monthly newsletter. Also, spin up your first cluster using our Kubernetes platform with a $250 free credit for your first month.