Kubernetes News

-
Kubernetes v1.33: In-Place Pod Resize Graduated to Beta
On behalf of the Kubernetes project, I am excited to announce that the in-place Pod resize feature (also known as In-Place Pod Vertical Scaling), first introduced as alpha in Kubernetes v1.27, has graduated to Beta and will be enabled by default in the Kubernetes v1.33 release! This marks a significant milestone in making resource management for Kubernetes workloads more flexible and less disruptive.
What is in-place Pod resize?
Traditionally, changing the CPU or memory resources allocated to a container required restarting the Pod. While acceptable for many stateless applications, this could be disruptive for stateful services, batch jobs, or any workloads sensitive to restarts.
In-place Pod resizing allows you to change the CPU and memory requests and limits assigned to containers within a running Pod, often without requiring a container restart.
Here's the core idea:
- The
spec.containers[*].resources
field in a Pod specification now represents the desired resources and is mutable for CPU and memory. - The
status.containerStatuses[*].resources
field reflects the actual resources currently configured on a running container. - You can trigger a resize by updating the desired resources in the Pod spec via the new
resize
subresource.
You can try it out on a v1.33 Kubernetes cluster by using kubectl to edit a Pod (requires
kubectl
v1.32+):kubectl edit pod <pod-name> --subresource resize
For detailed usage instructions and examples, please refer to the official Kubernetes documentation: Resize CPU and Memory Resources assigned to Containers.
Why does in-place Pod resize matter?
Kubernetes still excels at scaling workloads horizontally (adding or removing replicas), but in-place Pod resizing unlocks several key benefits for vertical scaling:
- Reduced Disruption: Stateful applications, long-running batch jobs, and sensitive workloads can have their resources adjusted without suffering the downtime or state loss associated with a Pod restart.
- Improved Resource Utilization: Scale down over-provisioned Pods without disruption, freeing up resources in the cluster. Conversely, provide more resources to Pods under heavy load without needing a restart.
- Faster Scaling: Address transient resource needs more quickly. For example Java applications often need more CPU during startup than during steady-state operation. Start with higher CPU and resize down later.
What's changed between Alpha and Beta?
Since the alpha release in v1.27, significant work has gone into maturing the feature, improving its stability, and refining the user experience based on feedback and further development. Here are the key changes:
Notable user-facing changes
resize
Subresource: Modifying Pod resources must now be done via the Pod'sresize
subresource (kubectl patch pod <name> --subresource resize...
).kubectl
versions v1.32+ support this argument.- Resize Status via Conditions: The old
status.resize
field is deprecated. The status of a resize operation is now exposed via two Pod conditions:PodResizePending
: Indicates the Kubelet cannot grant the resize immediately (e.g.,reason: Deferred
if temporarily unable,reason: Infeasible
if impossible on the node).PodResizeInProgress
: Indicates the resize is accepted and being applied. Errors encountered during this phase are now reported in this condition's message withreason: Error
.
- Sidecar Support: Resizing sidecar containers in-place is now supported.
Stability and reliability enhancements
- Refined Allocated Resources Management: The allocation management logic with the Kubelet was significantly reworked, making it more consistent and robust. The changes eliminated whole classes of bugs, and greatly improved the reliability of in-place Pod resize.
- Improved Checkpointing & State Tracking: A more robust system for tracking "allocated" and "actuated" resources was implemented, using new checkpoint files (
allocated_pods_state
,actuated_pods_state
) to reliably manage resize state across Kubelet restarts and handle edge cases where runtime-reported resources differ from requested ones. Several bugs related to checkpointing and state restoration were fixed. Checkpointing efficiency was also improved. - Faster Resize Detection: Enhancements to the Kubelet's Pod Lifecycle Event Generator (PLEG) allow the Kubelet to respond to and complete resizes much more quickly.
- Enhanced CRI Integration: A new
UpdatePodSandboxResources
CRI call was added to better inform runtimes and plugins (like NRI) about Pod-level resource changes. - Numerous Bug Fixes: Addressed issues related to systemd cgroup drivers, handling of containers without limits, CPU minimum share calculations, container restart backoffs, error propagation, test stability, and more.
What's next?
Graduating to Beta means the feature is ready for broader adoption, but development doesn't stop here! Here's what the community is focusing on next:
- Stability and Productionization: Continued focus on hardening the feature, improving performance, and ensuring it is robust for production environments.
- Addressing Limitations: Working towards relaxing some of the current limitations noted in the documentation, such as allowing memory limit decreases.
- VerticalPodAutoscaler (VPA) Integration: Work to enable VPA to leverage in-place Pod resize is already underway. A new
InPlaceOrRecreate
update mode will allow it to attempt non-disruptive resizes first, or fall back to recreation if needed. This will allow users to benefit from VPA's recommendations with significantly less disruption. - User Feedback: Gathering feedback from users adopting the beta feature is crucial for prioritizing further enhancements and addressing any uncovered issues or bugs.
Getting started and providing feedback
With the
InPlacePodVerticalScaling
feature gate enabled by default in v1.33, you can start experimenting with in-place Pod resizing right away!Refer to the documentation for detailed guides and examples.
As this feature moves through Beta, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels (GitHub issues, mailing lists, Slack). You can also review the KEP-1287: In-place Update of Pod Resources for the full in-depth design details.
We look forward to seeing how the community leverages in-place Pod resize to build more efficient and resilient applications on Kubernetes!
- The
-
Announcing etcd v3.6.0
This announcement originally appeared on the etcd blog.
Today, we are releasing etcd v3.6.0, the first minor release since etcd v3.5.0 on June 15, 2021. This release introduces several new features, makes significant progress on long-standing efforts like downgrade support and migration to v3store, and addresses numerous critical & major issues. It also includes major optimizations in memory usage, improving efficiency and performance.
In addition to the features of v3.6.0, etcd has joined Kubernetes as a SIG (sig-etcd), enabling us to improve project sustainability. We've introduced systematic robustness testing to ensure correctness and reliability. Through the etcd-operator Working Group, we plan to improve usability as well.
What follows are the most significant changes introduced in etcd v3.6.0, along with the discussion of the roadmap for future development. For a detailed list of changes, please refer to the CHANGELOG-3.6.
A heartfelt thank you to all the contributors who made this release possible!
Security
etcd takes security seriously. To enhance software security in v3.6.0, we have improved our workflow checks by integrating
govulncheck
to scan the source code andtrivy
to scan container images. These improvements have also been backported to supported stable releases.etcd continues to follow the Security Release Process to ensure vulnerabilities are properly managed and addressed.
Features
Migration to v3store
The v2store has been deprecated since etcd v3.4 but could still be enabled via
--enable-v2
. It remained the source of truth for membership data. In etcd v3.6.0, v2store can no longer be enabled as the--enable-v2
flag has been removed, and v3store has become the sole source of truth for membership data.While v2store still exists in v3.6.0, etcd will fail to start if it contains any data other than membership information. To assist with migration, etcd v3.5.18+ provides the
etcdutl check v2store
command, which verifies that v2store contains only membership data (see PR 19113).Compared to v2store, v3store offers better performance and transactional support. It is also the actively maintained storage engine moving forward.
The removal of v2store is still ongoing and is tracked in issues/12913.
Downgrade
etcd v3.6.0 is the first version to fully support downgrade. The effort for this downgrade task spans both versions 3.5 and 3.6, and all related work is tracked in issues/11716.
At a high level, the process involves migrating the data schema to the target version (e.g., v3.5), followed by a rolling downgrade.
Ensure the cluster is healthy, and take a snapshot backup. Validate whether the downgrade is valid:
$ etcdctl downgrade validate 3.5 Downgrade validate success, cluster version 3.6
If the downgrade is valid, enable downgrade mode:
$ etcdctl downgrade enable 3.5 Downgrade enable success, cluster version 3.6
etcd will then migrate the data schema in the background. Once complete, proceed with the rolling downgrade.
For details, refer to the Downgrade-3.6 guide.
Feature gates
In etcd v3.6.0, we introduced Kubernetes-style feature gates for managing new features. Previously, we indicated unstable features through the
--experimental
prefix in feature flag names. The prefix was removed once the feature was stable, causing a breaking change. Now, features will start in Alpha, progress to Beta, then GA, or get deprecated. This ensures a much smoother upgrade and downgrade experience for users.See feature-gates for details.
livez / readyz checks
etcd now supports
/livez
and/readyz
endpoints, aligning with Kubernetes' Liveness and Readiness probes./livez
indicates whether the etcd instance is alive, while/readyz
indicates when it is ready to serve requests. This feature has also been backported to release-3.5 (starting from v3.5.11) and release-3.4 (starting from v3.4.29). See livez/readyz for details.The existing
/health
endpoint remains functional./livez
is similar to/health?serializable=true
, while/readyz
is similar to/health
or/health?serializable=false
. Clearly, the/livez
and/readyz
endpoints provide clearer semantics and are easier to understand.v3discovery
In etcd v3.6.0, the new discovery protocol v3discovery was introduced, based on clientv3. It facilitates the discovery of all cluster members during the bootstrap phase.
The previous v2discovery protocol, based on clientv2, has been deprecated. Additionally, the public discovery service at https://discovery.etcd.io/, which relied on v2discovery, is no longer maintained.
Performance
Memory
In this release, we reduced average memory consumption by at least 50% (see Figure 1). This improvement is primarily due to two changes:
- The default value of
--snapshot-count
has been reduced from 100,000 in v3.5 to 10,000 in v3.6. As a result, etcd v3.6 now retains only about 10% of the history records compared to v3.5. - Raft history is compacted more frequently, as introduced in PR/18825.
Figure 1: Memory usage comparison between etcd v3.5.20 and v3.6.0-rc.2 under different read/write ratios. Each subplot shows the memory usage over time with a specific read/write ratio. The red line represents etcd v3.5.20, while the teal line represents v3.6.0-rc.2. Across all tested ratios, v3.6.0-rc.2 exhibits lower and more stable memory usage.
Throughput
Compared to v3.5, etcd v3.6 delivers an average performance improvement of approximately 10% in both read and write throughput (see Figure 2, 3, 4 and 5). This improvement is not attributed to any single major change, but rather the cumulative effect of multiple minor enhancements. One such example is the optimization of the free page queries introduced in PR/419.
Figure 2: Read throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high write ratio. The read/write ratio is 0.0078, meaning 1 read per 128 writes. The right bar shows the percentage improvement in read throughput of v3.6.0-rc.2 over v3.5.20, ranging from 3.21% to 25.59%.
Figure 3: Read throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high read ratio. The read/write ratio is 8, meaning 8 reads per write. The right bar shows the percentage improvement in read throughput of v3.6.0-rc.2 over v3.5.20, ranging from 4.38% to 27.20%.
Figure 4: Write throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high write ratio. The read/write ratio is 0.0078, meaning 1 read per 128 writes. The right bar shows the percentage improvement in write throughput of v3.6.0-rc.2 over v3.5.20, ranging from 2.95% to 24.24%.
Figure 5: Write throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high read ratio. The read/write ratio is 8, meaning 8 reads per write. The right bar shows the percentage improvement in write throughput of v3.6.0-rc.2 over v3.5.20, ranging from 3.86% to 28.37%.
Breaking changes
This section highlights a few notable breaking changes. For a complete list, please refer to the Upgrade etcd from v3.5 to v3.6 and the CHANGELOG-3.6.
Old binaries are incompatible with new schema versions
Old etcd binaries are not compatible with newer data schema versions. For example, etcd 3.5 cannot start with data created by etcd 3.6, and etcd 3.4 cannot start with data created by either 3.5 or 3.6.
When downgrading etcd, it's important to follow the documented downgrade procedure. Simply replacing the binary or image will result in the incompatibility issue.
Peer endpoints no longer serve client requests
Client endpoints (
--advertise-client-urls
) are intended to serve client requests only, while peer endpoints (--initial-advertise-peer-urls
) are intended solely for peer communication. However, due to an implementation oversight, the peer endpoints were also able to handle client requests in etcd 3.4 and 3.5. This behavior was misleading and encouraged incorrect usage patterns. In etcd 3.6, this misleading behavior was corrected via PR/13565; peer endpoints no longer serve client requests.Clear boundary between etcdctl and etcdutl
Both
etcdctl
andetcdutl
are command line tools.etcdutl
is an offline utility designed to operate directly on etcd data files, whileetcdctl
is an online tool that interacts with etcd over a network. Previously, there were some overlapping functionalities between the two, but these overlaps were removed in 3.6.0.-
Removed
etcdctl defrag --data-dir
The
etcdctl defrag
command only support online defragmentation and no longer supports offline defragmentation. To perform offline defragmentation, use theetcdutl defrag --data-dir
command instead. -
Removed
etcdctl snapshot status
etcdctl
no longer supports retrieving the status of a snapshot. Use theetcdutl snapshot status
command instead. -
Removed
etcdctl snapshot restore
etcdctl
no longer supports restoring from a snapshot. Use theetcdutl snapshot restore
command instead.
Critical bug fixes
Correctness has always been a top priority for the etcd project. In the process of developing 3.6.0, we found and fixed a few notable bugs that could lead to data inconsistency in specific cases. These fixes have been backported to previous releases, but we believe they deserve special mention here.
- Data Inconsistency when Crashing Under Load
Previously, when etcd was applying data, it would update the consistent-index first, followed by committing the data. However, these operations were not atomic. If etcd crashed in between, it could lead to data inconsistency (see issue/13766). The issue was introduced in v3.5.0, and fixed in v3.5.3 with PR/13854.
- Durability API guarantee broken in single node cluster
When a client writes data and receives a success response, the data is expected to be persisted. However, the data might be lost if etcd crashes immediately after sending the success response to the client. This was a legacy issue (see issue/14370) affecting all previous releases. It was addressed in v3.4.21 and v3.5.5 with PR/14400, and fixed in raft side in main branch (now release-3.6) with PR/14413.
- Revision Inconsistency when Crashing During Defragmentation
If etcd crashed during the defragmentation operation, upon restart, it might reapply some entries which had already been applied, accordingly leading to the revision inconsistency issue (see the discussions in PR/14685). The issue was introduced in v3.5.0, and fixed in v3.5.6 with PR/14730.
Upgrade issue
This section highlights a common issue issues/19557 in the etcd v3.5 to v3.6 upgrade that may cause the upgrade process to fail. For a complete upgrade guide, refer to Upgrade etcd from v3.5 to v3.6.
The issue was introduced in etcd v3.5.1, and resolved in v3.5.20.
Key takeaway: users are required to first upgrade to etcd v3.5.20 (or a higher patch version) before upgrading to etcd v3.6.0; otherwise, the upgrade may fail.
For more background and technical context, see upgrade_from_3.5_to_3.6_issue.
Testing
We introduced the Robustness testing to verify correctness, which has always been our top priority. It plays traffic of various types and volumes against an etcd cluster, concurrently injects a random failpoint, records all operations (including both requests and responses), and finally performs a linearizability check. It also verifies that the Watch APIs guarantees have not been violated. The robustness test increases our confidence in ensuring the quality of each etcd release.
We have migrated most of the etcd workflow tests to Kubernetes' Prow testing infrastructure to take advantage of its benefit, such as nice dashboards for viewing test results and the ability for contributors to rerun failed tests themselves.
Platforms
While retaining all existing supported platforms, we have promoted Linux/ARM64 to Tier 1 support. For more details, please refer to issues/15951. For the complete list of supported platforms, see supported-platform.
Dependencies
Dependency bumping guide
We have published an official guide on how to bump dependencies for etcd’s main branch and stable releases. It also covers how to update the Go version. For more details, please refer to dependency_management. With this guide available, any contributors can now help with dependency upgrades.
Core Dependency Updates
bbolt and raft are two core dependencies of etcd.
Both etcd v3.4 and v3.5 depend on bbolt v1.3, while etcd v3.6 depends on bbolt v1.4.
For the release-3.4 and release-3.5 branches, raft is included in the etcd repository itself, so etcd v3.4 and v3.5 do not depend on an external raft module. Starting from etcd v3.6, raft was moved to a separate repository (raft), and the first standalone raft release is v3.6.0. As a result, etcd v3.6.0 depends on raft v3.6.0.
Please see the table below for a summary:
etcd versions bbolt versions raft versions 3.4.x v1.3.x N/A 3.5.x v1.3.x N/A 3.6.x v1.4.x v3.6.x grpc-gateway@v2
We upgraded grpc-gateway from v1 to v2 via PR/16595 in etcd v3.6.0. This is a major step toward migrating to protobuf-go, the second major version of the Go protocol buffer API implementation.
grpc-gateway@v2 is designed to work with protobuf-go. However, etcd v3.6 still depends on the deprecated gogo/protobuf, which is actually protocol buffer v1 implementation. To resolve this incompatibility, we applied a patch to the generated *.pb.gw.go files to convert v1 messages to v2 messages.
grpc-ecosystem/go-grpc-middleware/providers/prometheus
We switched from the deprecated (and archived) grpc-ecosystem/go-grpc-prometheus to grpc-ecosystem/go-grpc-middleware/providers/prometheus via PR/19195. This change ensures continued support and access to the latest features and improvements in the gRPC Prometheus integration.
Community
There are exciting developments in the etcd community that reflect our ongoing commitment to strengthening collaboration, improving maintainability, and evolving the project’s governance.
etcd Becomes a Kubernetes SIG
etcd has officially become a Kubernetes Special Interest Group: SIG-etcd. This change reflects etcd’s critical role as the primary datastore for Kubernetes and establishes a more structured and transparent home for long-term stewardship and cross-project collaboration. The new SIG designation will help streamline decision-making, align roadmaps with Kubernetes needs, and attract broader community involvement.
New contributors, maintainers, and reviewers
We’ve seen increasing engagement from contributors, which has resulted in the addition of three new maintainers:
Their continued contributions have been instrumental in driving the project forward.
We also welcome two new reviewers to the project:
We appreciate their dedication to code quality and their willingness to take on broader review responsibilities within the community.
New release team
We've formed a new release team led by ivanvc and jmhbnz, streamlining the release process by automating many previously manual steps. Inspired by Kubernetes SIG Release, we've adopted several best practices, including clearly defined release team roles and the introduction of release shadows to support knowledge sharing and team sustainability. These changes have made our releases smoother and more reliable, allowing us to approach each release with greater confidence and consistency.
Introducing the etcd Operator Working Group
To further advance etcd’s operational excellence, we have formed a new working group: WG-etcd-operator. The working group is dedicated to enabling the automatic and efficient operation of etcd clusters that run in the Kubernetes environment using an etcd-operator.
Future Development
The legacy v2store has been deprecated since etcd v3.4, and the flag
--enable-v2
was removed entirely in v3.6. This means that starting from v3.6, there is no longer a way to enable or use the v2store. However, etcd still bootstraps internally from the legacy v2 snapshots. To address this inconsistency, We plan to change etcd to bootstrap from the v3store and replay the WAL entries based on theconsistent-index
. The work is being tracked in issues/12913.One of the most persistent challenges remains the large range of queries from the kube-apiserver, which can lead to process crashes due to their unpredictable nature. The range stream feature, originally outlined in the v3.5 release blog/Future roadmaps, remains an idea worth revisiting to address the challenges of large range queries.
For more details and upcoming plans, please refer to the etcd roadmap.
- The default value of
-
Kubernetes 1.33: Job's SuccessPolicy Goes GA
On behalf of the Kubernetes project, I'm pleased to announce that Job success policy has graduated to General Availability (GA) as part of the v1.33 release.
About Job's Success Policy
In batch workloads, you might want to use leader-follower patterns like MPI, in which the leader controls the execution, including the followers' lifecycle.
In this case, you might want to mark it as succeeded even if some of the indexes failed. Unfortunately, a leader-follower Kubernetes Job that didn't use a success policy, in most cases, would have to require all Pods to finish successfully for that Job to reach an overall succeeded state.
For Kubernetes Jobs, the API allows you to specify the early exit criteria using the
.spec.successPolicy
field (you can only use the.spec.successPolicy
field for an indexed Job). Which describes a set of rules either using a list of succeeded indexes for a job, or defining a minimal required size of succeeded indexes.This newly stable field is especially valuable for scientific simulation, AI/ML and High-Performance Computing (HPC) batch workloads. Users in these areas often run numerous experiments and may only need a specific number to complete successfully, rather than requiring all of them to succeed. In this case, the leader index failure is the only relevant Job exit criteria, and the outcomes for individual follower Pods are handled only indirectly via the status of the leader index. Moreover, followers do not know when they can terminate themselves.
After Job meets any Success Policy, the Job is marked as succeeded, and all Pods are terminated including the running ones.
How it works
The following excerpt from a Job manifest, using
.successPolicy.rules[0].succeededCount
, shows an example of using a custom success policy:parallelism:10 completions:10 completionMode:Indexed successPolicy: rules: - succeededCount:1
Here, the Job is marked as succeeded when one index succeeded regardless of its number. Additionally, you can constrain index numbers against
succeededCount
in.successPolicy.rules[0].succeededCount
as shown below:parallelism:10 completions:10 completionMode:Indexed successPolicy: rules: - succeededIndexes:0# index of the leader Pod succeededCount:1
This example shows that the Job will be marked as succeeded once a Pod with a specific index (Pod index 0) has succeeded.
Once the Job either reaches one of the
successPolicy
rules, or achieves itsComplete
criteria based on.spec.completions
, the Job controller within kube-controller-manager adds theSuccessCriteriaMet
condition to the Job status. After that, the job-controller initiates cleanup and termination of Pods for Jobs withSuccessCriteriaMet
condition. Eventually, Jobs obtainComplete
condition when the job-controller finished cleanup and termination.Learn more
- Read the documentation for success policy.
- Read the KEP for the Job success/completion policy
Get involved
This work was led by the Kubernetes batch working group in close collaboration with the SIG Apps community.
If you are interested in working on new features in the space I recommend subscribing to our Slack channel and attending the regular community meetings.
-
Kubernetes v1.33: Updates to Container Lifecycle
Kubernetes v1.33 introduces a few updates to the lifecycle of containers. The Sleep action for container lifecycle hooks now supports a zero sleep duration (feature enabled by default). There is also alpha support for customizing the stop signal sent to containers when they are being terminated.
This blog post goes into the details of these new aspects of the container lifecycle, and how you can use them.
Zero value for Sleep action
Kubernetes v1.29 introduced the
Sleep
action for container PreStop and PostStart Lifecycle hooks. The Sleep action lets your containers pause for a specified duration after the container is started or before it is terminated. This was needed to provide a straightforward way to manage graceful shutdowns. Before the Sleep action, folks used to run thesleep
command using the exec action in their container lifecycle hooks. If you wanted to do this you'd need to have the binary for thesleep
command in your container image. This is difficult if you're using third party images.The sleep action when it was added initially didn't have support for a sleep duration of zero seconds. The
time.Sleep
which the Sleep action uses under the hood supports a duration of zero seconds. Using a negative or a zero value for the sleep returns immediately, resulting in a no-op. We wanted the same behaviour with the sleep action. This support for the zero duration was later added in v1.32, with thePodLifecycleSleepActionAllowZero
feature gate.The
PodLifecycleSleepActionAllowZero
feature gate has graduated to beta in v1.33, and is now enabled by default. The original Sleep action forpreStop
andpostStart
hooks is been enabled by default, starting from Kubernetes v1.30. With a cluster running Kubernetes v1.33, you are able to set a zero duration for sleep lifecycle hooks. For a cluster with default configuration, you don't need to enable any feature gate to make that possible.Container stop signals
Container runtimes such as containerd and CRI-O honor a
StopSignal
instruction in the container image definition. This can be used to specify a custom stop signal that the runtime will used to terminate containers based on that image. Stop signal configuration was not originally part of the Pod API in Kubernetes. Until Kubernetes v1.33, the only way to override the stop signal for containers was by rebuilding your container image with the new custom stop signal (for example, specifyingSTOPSIGNAL
in aContainerfile
orDockerfile
).The
ContainerStopSignals
feature gate which is newly added in Kubernetes v1.33 adds stop signals to the Kubernetes API. This allows users to specify a custom stop signal in the container spec. Stop signals are added to the API as a new lifecycle along with the existing PreStop and PostStart lifecycle handlers. In order to use this feature, we expect the Pod to have the operating system specified withspec.os.name
. This is enforced so that we can cross-validate the stop signal against the operating system and make sure that the containers in the Pod are created with a valid stop signal for the operating system the Pod is being scheduled to. For Pods scheduled on Windows nodes, onlySIGTERM
andSIGKILL
are allowed as valid stop signals. Find the full list of signals supported in Linux nodes here.Default behaviour
If a container has a custom stop signal defined in its lifecycle, the container runtime would use the signal defined in the lifecycle to kill the container, given that the container runtime also supports custom stop signals. If there is no custom stop signal defined in the container lifecycle, the runtime would fallback to the stop signal defined in the container image. If there is no stop signal defined in the container image, the default stop signal of the runtime would be used. The default signal is
SIGTERM
for both containerd and CRI-O.Version skew
For the feature to work as intended, both the versions of Kubernetes and the container runtime should support container stop signals. The changes to the Kuberentes API and kubelet are available in alpha stage from v1.33, which can be enabled with the
ContainerStopSignals
feature gate. The container runtime implementations for containerd and CRI-O are still a work in progress and will be rolled out soon.Using container stop signals
To enable this feature, you need to turn on the
ContainerStopSignals
feature gate in both the kube-apiserver and the kubelet. Once you have nodes where the feature gate is turned on, you can create Pods with a StopSignal lifecycle and a valid OS name like so:apiVersion:v1 kind:Pod metadata: name:nginx spec: os: name:linux containers: - name:nginx image:nginx:latest lifecycle: stopSignal:SIGUSR1
Do note that the
SIGUSR1
signal in this example can only be used if the container's Pod is scheduled to a Linux node. Hence we need to specifyspec.os.name
aslinux
to be able to use the signal. You will only be able to configureSIGTERM
andSIGKILL
signals if the Pod is being scheduled to a Windows node. You cannot specify acontainers[*].lifecycle.stopSignal
if thespec.os.name
field is nil or unset either.How do I get involved?
This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please reach out to us!
You can reach SIG Node by several means:
You can also contact me directly:
- GitHub: @sreeram-venkitesh
- Slack: @sreeram.venkitesh
-
Kubernetes v1.33: Job's Backoff Limit Per Index Goes GA
In Kubernetes v1.33, the Backoff Limit Per Index feature reaches general availability (GA). This blog describes the Backoff Limit Per Index feature and its benefits.
About backoff limit per index
When you run workloads on Kubernetes, you must consider scenarios where Pod failures can affect the completion of your workloads. Ideally, your workload should tolerate transient failures and continue running.
To achieve failure tolerance in a Kubernetes Job, you can set the
spec.backoffLimit
field. This field specifies the total number of tolerated failures.However, for workloads where every index is considered independent, like embarassingly parallel workloads - the
spec.backoffLimit
field is often not flexible enough. For example, you may choose to run multiple suites of integration tests by representing each suite as an index within an Indexed Job. In that setup, a fast-failing index (test suite) is likely to consume your entire budget for tolerating Pod failures, and you might not be able to run the other indexes.In order to address this limitation, Kubernetes introduced backoff limit per index, which allows you to control the number of retries per index.
How backoff limit per index works
To use Backoff Limit Per Index for Indexed Jobs, specify the number of tolerated Pod failures per index with the
spec.backoffLimitPerIndex
field. When you set this field, the Job executes all indexes by default.Additionally, to fine-tune the error handling:
- Specify the cap on the total number of failed indexes by setting the
spec.maxFailedIndexes
field. When the limit is exceeded the entire Job is terminated. - Define a short-circuit to detect a failed index by using the
FailIndex
action in the Pod Failure Policy mechanism.
When the number of tolerated failures is exceeded, the Job marks that index as failed and lists it in the Job's
status.failedIndexes
field.Example
The following Job spec snippet is an example of how to combine backoff limit per index with the Pod Failure Policy feature:
completions:10 parallelism:10 completionMode:Indexed backoffLimitPerIndex:1 maxFailedIndexes:5 podFailurePolicy: rules: - action:Ignore onPodConditions: - type:DisruptionTarget - action:FailIndex onExitCodes: operator:In values:[42]
In this example, the Job handles Pod failures as follows:
- Ignores any failed Pods that have the built-in
disruption condition,
called
DisruptionTarget
. These Pods don't count towards Job backoff limits. - Fails the index corresponding to the failed Pod if any of the failed Pod's containers finished with the exit code 42 - based on the matching "FailIndex" rule.
- Retries the first failure of any index, unless the index failed due to the
matching
FailIndex
rule. - Fails the entire Job if the number of failed indexes exceeded 5 (set by the
spec.maxFailedIndexes
field).
Learn more
- Read the blog post on the closely related feature of Pod Failure Policy Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA
- For a hands-on guide to using Pod failure policy, including the use of FailIndex, see Handling retriable and non-retriable pod failures with Pod failure policy
- Read the documentation for Backoff limit per index and Pod failure policy
- Read the KEP for the Backoff Limits Per Index For Indexed Jobs
Get involved
This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.
If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.
- Specify the cap on the total number of failed indexes by setting the