Kubernetes News

The Kubernetes blog is used by the project to communicate new features, community reports, and any news that might be relevant to the Kubernetes community.
The Kubernetes project logo
  • Volume group snapshots were introduced as an Alpha feature with the Kubernetes 1.27 release. The recent release of Kubernetes v1.32 moved that support to beta. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaims for snapshotting. A key aim is to allow you restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.

    This new feature is only supported for CSI volume drivers.

    An overview of volume group snapshots

    Some storage systems provide the ability to create a crash consistent snapshot of multiple volumes. A group snapshot represents copies made from multiple volumes, that are taken at the same point-in-time. A group snapshot can be used either to rehydrate new volumes (pre-populated with the snapshot data) or to restore existing volumes to a previous state (represented by the snapshots).

    Why add volume group snapshots to Kubernetes?

    The Kubernetes volume plugin system already provides a powerful abstraction that automates the provisioning, attaching, mounting, resizing, and snapshotting of block and file storage.

    Underpinning all these features is the Kubernetes goal of workload portability: Kubernetes aims to create an abstraction layer between distributed applications and underlying clusters so that applications can be agnostic to the specifics of the cluster they run on and application deployment requires no cluster specific knowledge.

    There was already a VolumeSnapshot API that provides the ability to take a snapshot of a persistent volume to protect against data loss or data corruption. However, there are other snapshotting functionalities not covered by the VolumeSnapshot API.

    Some storage systems support consistent group snapshots that allow a snapshot to be taken from multiple volumes at the same point-in-time to achieve write order consistency. This can be useful for applications that contain multiple volumes. For example, an application may have data stored in one volume and logs stored in another volume. If snapshots for the data volume and the logs volume are taken at different times, the application will not be consistent and will not function properly if it is restored from those snapshots when a disaster strikes.

    It is true that you can quiesce the application first, take an individual snapshot from each volume that is part of the application one after the other, and then unquiesce the application after all the individual snapshots are taken. This way, you would get application consistent snapshots.

    However, sometimes the application quiesce can be so time consuming that you want to do it less frequently, or it may not be possible to quiesce an application at all. For example, a user may want to run weekly backups with application quiesce and nightly backups without application quiesce but with consistent group support which provides crash consistency across all volumes in the group.

    Kubernetes APIs for volume group snapshots

    Kubernetes' support for volume group snapshots relies on three API kinds that are used for managing snapshots:

    VolumeGroupSnapshot
    Created by a Kubernetes user (or perhaps by your own automation) to request creation of a volume group snapshot for multiple persistent volume claims. It contains information about the volume group snapshot operation such as the timestamp when the volume group snapshot was taken and whether it is ready to use. The creation and deletion of this object represents a desire to create or delete a cluster resource (a group snapshot).
    VolumeGroupSnapshotContent
    Created by the snapshot controller for a dynamically created VolumeGroupSnapshot. It contains information about the volume group snapshot including the volume group snapshot ID. This object represents a provisioned resource on the cluster (a group snapshot). The VolumeGroupSnapshotContent object binds to the VolumeGroupSnapshot for which it was created with a one-to-one mapping.
    VolumeGroupSnapshotClass
    Created by cluster administrators to describe how volume group snapshots should be created, including the driver information, the deletion policy, etc.

    These three API kinds are defined as CustomResourceDefinitions (CRDs). These CRDs must be installed in a Kubernetes cluster for a CSI Driver to support volume group snapshots.

    What components are needed to support volume group snapshots

    Volume group snapshots are implemented in the external-snapshotter repository. Implementing volume group snapshots meant adding or changing several components:

    • Added new CustomResourceDefinitions for VolumeGroupSnapshot and two supporting APIs.
    • Volume group snapshot controller logic is added to the common snapshot controller.
    • Adding logic to make CSI calls into the snapshotter sidecar controller.

    The volume snapshot controller and CRDs are deployed once per cluster, while the sidecar is bundled with each CSI driver.

    Therefore, it makes sense to deploy the volume snapshot controller and CRDs as a cluster addon.

    The Kubernetes project recommends that Kubernetes distributors bundle and deploy the volume snapshot controller and CRDs as part of their Kubernetes cluster management process (independent of any CSI Driver).

    What's new in Beta?

    • The VolumeGroupSnapshot feature in CSI spec moved to GA in the v1.11.0 release.

    • The snapshot validation webhook was deprecated in external-snapshotter v8.0.0 and it is now removed. Most of the validation webhook logic was added as validation rules into the CRDs. Minimum required Kubernetes version is 1.25 for these validation rules. One thing in the validation webhook not moved to CRDs is the prevention of creating multiple default volume snapshot classes and multiple default volume group snapshot classes for the same CSI driver. With the removal of the validation webhook, an error will still be raised when dynamically provisioning a VolumeSnapshot or VolumeGroupSnapshot when multiple default volume snapshot classes or multiple default volume group snapshot classes for the same CSI driver exist.

    • The enable-volumegroup-snapshot flag in the snapshot-controller and the CSI snapshotter sidecar has been replaced by a feature gate. Since VolumeGroupSnapshot is a new API, the feature moves to Beta but the feature gate is disabled by default. To use this feature, enable the feature gate by adding the flag --feature-gates=CSIVolumeGroupSnapshot=true when starting the snapshot-controller and the CSI snapshotter sidecar.

    • The logic to dynamically create the VolumeGroupSnapshot and its corresponding individual VolumeSnapshot and VolumeSnapshotContent objects are moved from the CSI snapshotter to the common snapshot-controller. New RBAC rules are added to the common snapshot-controller and some RBAC rules are removed from the CSI snapshotter sidecar accordingly.

    How do I use Kubernetes volume group snapshots

    Creating a new group snapshot with Kubernetes

    Once a VolumeGroupSnapshotClass object is defined and you have volumes you want to snapshot together, you may request a new group snapshot by creating a VolumeGroupSnapshot object.

    The source of the group snapshot specifies whether the underlying group snapshot should be dynamically created or if a pre-existing VolumeGroupSnapshotContent should be used.

    A pre-existing VolumeGroupSnapshotContent is created by a cluster administrator. It contains the details of the real volume group snapshot on the storage system which is available for use by cluster users.

    One of the following members in the source of the group snapshot must be set.

    • selector - a label query over PersistentVolumeClaims that are to be grouped together for snapshotting. This selector will be used to match the label added to a PVC.
    • volumeGroupSnapshotContentName - specifies the name of a pre-existing VolumeGroupSnapshotContent object representing an existing volume group snapshot.

    Dynamically provision a group snapshot

    In the following example, there are two PVCs.

    NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
    pvc-0 Bound pvc-6e1f7d34-a5c5-4548-b104-01e72c72b9f2 100Mi RWO csi-hostpath-sc <unset> 2m15s
    pvc-1 Bound pvc-abc640b3-2cc1-4c56-ad0c-4f0f0e636efa 100Mi RWO csi-hostpath-sc <unset> 2m7s
    

    Label the PVCs.

    % kubectl label pvc pvc-0 group=myGroup
    persistentvolumeclaim/pvc-0 labeled
    
    % kubectl label pvc pvc-1 group=myGroup
    persistentvolumeclaim/pvc-1 labeled
    

    For dynamic provisioning, a selector must be set so that the snapshot controller can find PVCs with the matching labels to be snapshotted together.

    apiVersion:groupsnapshot.storage.k8s.io/v1beta1
    kind:VolumeGroupSnapshot
    metadata:
    name:snapshot-daily-20241217
    namespace:demo-namespace
    spec:
    volumeGroupSnapshotClassName:csi-groupSnapclass
    source:
    selector:
    matchLabels:
    group:myGroup
    

    In the VolumeGroupSnapshot spec, a user can specify the VolumeGroupSnapshotClass which has the information about which CSI driver should be used for creating the group snapshot. A VolumGroupSnapshotClass is required for dynamic provisioning.

    apiVersion:groupsnapshot.storage.k8s.io/v1beta1
    kind:VolumeGroupSnapshotClass
    metadata:
    name:csi-groupSnapclass
    annotations:
    kubernetes.io/description:"Example group snapshot class"
    driver:example.csi.k8s.io
    deletionPolicy:Delete
    

    As a result of the volume group snapshot creation, a corresponding VolumeGroupSnapshotContent object will be created with a volumeGroupSnapshotHandle pointing to a resource on the storage system.

    Two individual volume snapshots will be created as part of the volume group snapshot creation.

    NAME READYTOUSE SOURCEPVC RESTORESIZE SNAPSHOTCONTENT AGE
    snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0 true pvc-0 100Mi snapcontent-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0 16m
    snapshot-da577d76bd2106c410616b346b2e72440f6ec7b12a75156263b989192b78caff true pvc-1 100Mi snapcontent-da577d76bd2106c410616b346b2e72440f6ec7b12a75156263b989192b78caff 16m
    

    Importing an existing group snapshot with Kubernetes

    To import a pre-existing volume group snapshot into Kubernetes, you must also import the corresponding individual volume snapshots.

    Identify the individual volume snapshot handles, manually construct a VolumeSnapshotContent object first, then create a VolumeSnapshot object pointing to the VolumeSnapshotContent object. Repeat this for every individual volume snapshot.

    Then manually create a VolumeGroupSnapshotContent object, specifying the volumeGroupSnapshotHandle and individual volumeSnapshotHandles already existing on the storage system.

    apiVersion:groupsnapshot.storage.k8s.io/v1beta1
    kind:VolumeGroupSnapshotContent
    metadata:
    name:static-group-content
    spec:
    deletionPolicy:Delete
    driver:hostpath.csi.k8s.io
    source:
    groupSnapshotHandles:
    volumeGroupSnapshotHandle:e8779136-a93e-11ef-9549-66940726f2fd
    volumeSnapshotHandles:
    - e8779147-a93e-11ef-9549-66940726f2fd
    - e8783cd0-a93e-11ef-9549-66940726f2fd
    volumeGroupSnapshotRef:
    name:static-group-snapshot
    namespace:demo-namespace
    

    After that create a VolumeGroupSnapshot object pointing to the VolumeGroupSnapshotContent object.

    apiVersion:groupsnapshot.storage.k8s.io/v1beta1
    kind:VolumeGroupSnapshot
    metadata:
    name:static-group-snapshot
    namespace:demo-namespace
    spec:
    source:
    volumeGroupSnapshotContentName:static-group-content
    

    How to use group snapshot for restore in Kubernetes

    At restore time, the user can request a new PersistentVolumeClaim to be created from a VolumeSnapshot object that is part of a VolumeGroupSnapshot. This will trigger provisioning of a new volume that is pre-populated with data from the specified snapshot. The user should repeat this until all volumes are created from all the snapshots that are part of a group snapshot.

    apiVersion:v1
    kind:PersistentVolumeClaim
    metadata:
    name:examplepvc-restored-2024-12-17
    namespace:demo-namespace
    spec:
    storageClassName:example-foo-nearline
    dataSource:
    name:snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0
    kind:VolumeSnapshot
    apiGroup:snapshot.storage.k8s.io
    accessModes:
    - ReadWriteOncePod
    resources:
    requests:
    storage:100Mi# must be enough storage to fit the existing snapshot
    

    As a storage vendor, how do I add support for group snapshots to my CSI driver?

    To implement the volume group snapshot feature, a CSI driver must:

    • Implement a new group controller service.
    • Implement group controller RPCs: CreateVolumeGroupSnapshot, DeleteVolumeGroupSnapshot, and GetVolumeGroupSnapshot.
    • Add group controller capability CREATE_DELETE_GET_VOLUME_GROUP_SNAPSHOT.

    See the CSI spec and the Kubernetes-CSI Driver Developer Guide for more details.

    As mentioned earlier, it is strongly recommended that Kubernetes distributors bundle and deploy the volume snapshot controller and CRDs as part of their Kubernetes cluster management process (independent of any CSI Driver).

    As part of this recommended deployment process, the Kubernetes team provides a number of sidecar (helper) containers, including the external-snapshotter sidecar container which has been updated to support volume group snapshot.

    The external-snapshotter watches the Kubernetes API server for VolumeGroupSnapshotContent objects, and triggers CreateVolumeGroupSnapshot and DeleteVolumeGroupSnapshot operations against a CSI endpoint.

    What are the limitations?

    The beta implementation of volume group snapshots for Kubernetes has the following limitations:

    • Does not support reverting an existing PVC to an earlier state represented by a snapshot (only supports provisioning a new volume from a snapshot).
    • No application consistency guarantees beyond any guarantees provided by the storage system (e.g. crash consistency). See this doc for more discussions on application consistency.

    What’s next?

    Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release.

    How can I learn more?

    How do I get involved?

    This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach beta:

    For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.

    We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.

  • Managing Kubernetes clusters efficiently is critical, especially as their size is growing. A significant challenge with large clusters is the memory overhead caused by list requests.

    In the existing implementation, the kube-apiserver processes list requests by assembling the entire response in-memory before transmitting any data to the client. But what if the response body is substantial, say hundreds of megabytes? Additionally, imagine a scenario where multiple list requests flood in simultaneously, perhaps after a brief network outage. While API Priority and Fairness has proven to reasonably protect kube-apiserver from CPU overload, its impact is visibly smaller for memory protection. This can be explained by the differing nature of resource consumption by a single API request - the CPU usage at any given time is capped by a constant, whereas memory, being uncompressible, can grow proportionally with the number of processed objects and is unbounded. This situation poses a genuine risk, potentially overwhelming and crashing any kube-apiserver within seconds due to out-of-memory (OOM) conditions. To better visualize the issue, let's consider the below graph.

    The graph shows the memory usage of a kube-apiserver during a synthetic test. (see the synthetic test section for more details). The results clearly show that increasing the number of informers significantly boosts the server's memory consumption. Notably, at approximately 16:40, the server crashed when serving only 16 informers.

    Why does kube-apiserver allocate so much memory for list requests?

    Our investigation revealed that this substantial memory allocation occurs because the server before sending the first byte to the client must:

    • fetch data from the database,
    • deserialize the data from its stored format,
    • and finally construct the final response by converting and serializing the data into a client requested format

    This sequence results in significant temporary memory consumption. The actual usage depends on many factors like the page size, applied filters (e.g. label selectors), query parameters, and sizes of individual objects.

    Unfortunately, neither API Priority and Fairness nor Golang's garbage collection or Golang memory limits can prevent the system from exhausting memory under these conditions. The memory is allocated suddenly and rapidly, and just a few requests can quickly deplete the available memory, leading to resource exhaustion.

    Depending on how the API server is run on the node, it might either be killed through OOM by the kernel when exceeding the configured memory limits during these uncontrolled spikes, or if limits are not configured it might have even worse impact on the control plane node. And worst, after the first API server failure, the same requests will likely hit another control plane node in an HA setup with probably the same impact. Potentially a situation that is hard to diagnose and hard to recover from.

    Streaming list requests

    Today, we're excited to announce a major improvement. With the graduation of the watch list feature to beta in Kubernetes 1.32, client-go users can opt-in (after explicitly enabling WatchListClient feature gate) to streaming lists by switching from list to (a special kind of) watch requests.

    Watch requests are served from the watch cache, an in-memory cache designed to improve scalability of read operations. By streaming each item individually instead of returning the entire collection, the new method maintains constant memory overhead. The API server is bound by the maximum allowed size of an object in etcd plus a few additional allocations. This approach drastically reduces the temporary memory usage compared to traditional list requests, ensuring a more efficient and stable system, especially in clusters with a large number of objects of a given type or large average object sizes where despite paging memory consumption used to be high.

    Building on the insight gained from the synthetic test (see the synthetic test, we developed an automated performance test to systematically evaluate the impact of the watch list feature. This test replicates the same scenario, generating a large number of Secrets with a large payload, and scaling the number of informers to simulate heavy list request patterns. The automated test is executed periodically to monitor memory usage of the server with the feature enabled and disabled.

    The results showed significant improvements with the watch list feature enabled. With the feature turned on, the kube-apiserver’s memory consumption stabilized at approximately 2 GB. By contrast, with the feature disabled, memory usage increased to approximately 20GB, a 10x increase! These results confirm the effectiveness of the new streaming API, which reduces the temporary memory footprint.

    Enabling API Streaming for your component

    Upgrade to Kubernetes 1.32. Make sure your cluster uses etcd in version 3.4.31+ or 3.5.13+. Change your client software to use watch lists. If your client code is written in Golang, you'll want to enable WatchListClient for client-go. For details on enabling that feature, read Introducing Feature Gates to Client-Go: Enhancing Flexibility and Control.

    What's next?

    In Kubernetes 1.32, the feature is enabled in kube-controller-manager by default despite its beta state. This will eventually be expanded to other core components like kube-scheduler or kubelet; once the feature becomes generally available, if not earlier. Other 3rd-party components are encouraged to opt-in to the feature during the beta phase, especially when they are at risk of accessing a large number of resources or kinds with potentially large object sizes.

    For the time being, API Priority and Fairness assigns a reasonable small cost to list requests. This is necessary to allow enough parallelism for the average case where list requests are cheap enough. But it does not match the spiky exceptional situation of many and large objects. Once the majority of the Kubernetes ecosystem has switched to watch list, the list cost estimation can be changed to larger values without risking degraded performance in the average case, and with that increasing the protection against this kind of requests that can still hit the API server in the future.

    The synthetic test

    In order to reproduce the issue, we conducted a manual test to understand the impact of list requests on kube-apiserver memory usage. In the test, we created 400 Secrets, each containing 1 MB of data, and used informers to retrieve all Secrets.

    The results were alarming, only 16 informers were needed to cause the test server to run out of memory and crash, demonstrating how quickly memory consumption can grow under such conditions.

    Special shout out to @deads2k for his help in shaping this feature.

  • In Kubernetes v1.32, after years of community discussion, we are excited to introduce a strict-cpu-reservation option for the CPU Manager static policy. This feature is currently in alpha, with the associated policy hidden by default. You can only use the policy if you explicitly enable the alpha behavior in your cluster.

    Understanding the feature

    The CPU Manager static policy is used to reduce latency or improve performance. The reservedSystemCPUs defines an explicit CPU set for OS system daemons and kubernetes system daemons. This option is designed for Telco/NFV type use cases where uncontrolled interrupts/timers may impact the workload performance. you can use this option to define the explicit cpuset for the system/kubernetes daemons as well as the interrupts/timers, so the rest CPUs on the system can be used exclusively for workloads, with less impact from uncontrolled interrupts/timers. More details of this parameter can be found on the Explicitly Reserved CPU List page.

    If you want to protect your system daemons and interrupt processing, the obvious way is to use the reservedSystemCPUs option.

    However, until the Kubernetes v1.32 release, this isolation was only implemented for guaranteed pods that made requests for a whole number of CPUs. At pod admission time, the kubelet only compares the CPU requests against the allocatable CPUs. In Kubernetes, limits can be higher than the requests; the previous implementation allowed burstable and best-effort pods to use up the capacity of reservedSystemCPUs, which could then starve host OS services of CPU - and we know that people saw this in real life deployments. The existing behavior also made benchmarking (for both infrastructure and workloads) results inaccurate.

    When this new strict-cpu-reservation policy option is enabled, the CPU Manager static policy will not allow any workload to use the reserved system CPU cores.

    Enabling the feature

    To enable this feature, you need to turn on both the CPUManagerPolicyAlphaOptions feature gate and the strict-cpu-reservation policy option. And you need to remove the /var/lib/kubelet/cpu_manager_state file if it exists and restart kubelet.

    With the following kubelet configuration:

    kind:KubeletConfiguration
    apiVersion:kubelet.config.k8s.io/v1beta1
    featureGates:
    ...
    CPUManagerPolicyOptions:true
    CPUManagerPolicyAlphaOptions:true
    cpuManagerPolicy:static
    cpuManagerPolicyOptions:
    strict-cpu-reservation:"true"
    reservedSystemCPUs:"0,32,1,33,16,48"
    ...
    

    When strict-cpu-reservation is not set or set to false:

    # cat /var/lib/kubelet/cpu_manager_state
    {"policyName":"static","defaultCpuSet":"0-63","checksum":1058907510}
    

    When strict-cpu-reservation is set to true:

    # cat /var/lib/kubelet/cpu_manager_state
    {"policyName":"static","defaultCpuSet":"2-15,17-31,34-47,49-63","checksum":4141502832}
    

    Monitoring the feature

    You can monitor the feature impact by checking the following CPU Manager counters:

    • cpu_manager_shared_pool_size_millicores: report shared pool size, in millicores (e.g. 13500m)
    • cpu_manager_exclusive_cpu_allocation_count: report exclusively allocated cores, counting full cores (e.g. 16)

    Your best-effort workloads may starve if the cpu_manager_shared_pool_size_millicores count is zero for prolonged time.

    We believe any pod that is required for operational purpose like a log forwarder should not run as best-effort, but you can review and adjust the amount of CPU cores reserved as needed.

    Conclusion

    Strict CPU reservation is critical for Telco/NFV use cases. It is also a prerequisite for enabling the all-in-one type of deployments where workloads are placed on nodes serving combined control+worker+storage roles.

    We want you to start using the feature and looking forward to your feedback.

    Further reading

    Please check out the Control CPU Management Policies on the Node task page to learn more about the CPU Manager, and how it fits in relation to the other node-level resource managers.

    Getting involved

    This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please attend the SIG Node meeting for more details.

  • With Kubernetes 1.32, the memory manager has officially graduated to General Availability (GA), marking a significant milestone in the journey toward efficient and predictable memory allocation for containerized applications. Since Kubernetes v1.22, where it graduated to beta, the memory manager has proved itself reliable, stable and a good complementary feature for the CPU Manager.

    As part of kubelet's workload admission process, the memory manager provides topology hints to optimize memory allocation and alignment. This enables users to allocate exclusive memory for Pods in the Guaranteed QoS class. More details about the process can be found in the memory manager goes to beta blog.

    Most of the changes introduced since the Beta are bug fixes, internal refactoring and observability improvements, such as metrics and better logging.

    Observability improvements

    As part of the effort to increase the observability of memory manager, new metrics have been added to provide some statistics on memory allocation patterns.

    • memory_manager_pinning_requests_total - tracks the number of times the pod spec required the memory manager to pin memory pages.

    • memory_manager_pinning_errors_total - tracks the number of times the pod spec required the memory manager to pin memory pages, but the allocation failed.

    Improving memory manager reliability and consistency

    The kubelet does not guarantee pod ordering when admitting pods after a restart or reboot.

    In certain edge cases, this behavior could cause the memory manager to reject some pods, and in more extreme cases, it may cause kubelet to fail upon restart.

    Previously, the beta implementation lacked certain checks and logic to prevent these issues.

    To stabilize the memory manager for general availability (GA) readiness, small but critical refinements have been made to the algorithm, improving its robustness and handling of edge cases.

    Future development

    There is more to come for the future of Topology Manager in general, and memory manager in particular. Notably, ongoing efforts are underway to extend memory manager support to Windows, enabling CPU and memory affinity on a Windows operating system.

    Getting involved

    This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

  • The Kubernetes scheduler is the core component that selects the nodes on which new Pods run. The scheduler processes these new Pods one by one. Therefore, the larger your clusters, the more important the throughput of the scheduler becomes.

    Over the years, Kubernetes SIG Scheduling has improved the throughput of the scheduler in multiple enhancements. This blog post describes a major improvement to the scheduler in Kubernetes v1.32: a scheduling context element named QueueingHint. This page provides background knowledge of the scheduler and explains how QueueingHint improves scheduling throughput.

    Scheduling queue

    The scheduler stores all unscheduled Pods in an internal component called the scheduling queue.

    The scheduling queue consists of the following data structures:

    • ActiveQ: holds newly created Pods or Pods that are ready to be retried for scheduling.
    • BackoffQ: holds Pods that are ready to be retried but are waiting for a backoff period to end. The backoff period depends on the number of unsuccessful scheduling attempts performed by the scheduler on that Pod.
    • Unschedulable Pod Pool: holds Pods that the scheduler won't attempt to schedule for one of the following reasons:
      • The scheduler previously attempted and was unable to schedule the Pods. Since that attempt, the cluster hasn't changed in a way that could make those Pods schedulable.
      • The Pods are blocked from entering the scheduling cycles by PreEnqueue Plugins, for example, they have a scheduling gate, and get blocked by the scheduling gate plugin.

    Scheduling framework and plugins

    The Kubernetes scheduler is implemented following the Kubernetes scheduling framework.

    And, all scheduling features are implemented as plugins (e.g., Pod affinity is implemented in the InterPodAffinity plugin.)

    The scheduler processes pending Pods in phases called cycles as follows:

    1. Scheduling cycle: the scheduler takes pending Pods from the activeQ component of the scheduling queue one by one. For each Pod, the scheduler runs the filtering/scoring logic from every scheduling plugin. The scheduler then decides on the best node for the Pod, or decides that the Pod can't be scheduled at that time.

      If the scheduler decides that a Pod can't be scheduled, that Pod enters the Unschedulable Pod Pool component of the scheduling queue. However, if the scheduler decides to place the Pod on a node, the Pod goes to the binding cycle.

    2. Binding cycle: the scheduler communicates the node placement decision to the Kubernetes API server. This operation bounds the Pod to the selected node.

    Aside from some exceptions, most unscheduled Pods enter the unschedulable pod pool after each scheduling cycle. The Unschedulable Pod Pool component is crucial because of how the scheduling cycle processes Pods one by one. If the scheduler had to constantly retry placing unschedulable Pods, instead of offloading those Pods to the Unschedulable Pod Pool, multiple scheduling cycles would be wasted on those Pods.

    Improvements to retrying Pod scheduling with QueuingHint

    Unschedulable Pods only move back into the ActiveQ or BackoffQ components of the scheduling queue if changes in the cluster might allow the scheduler to place those Pods on nodes.

    Prior to v1.32, each plugin registered which cluster changes could solve their failures, an object creation, update, or deletion in the cluster (called cluster events), with EnqueueExtensions (EventsToRegister), and the scheduling queue retries a pod with an event that is registered by a plugin that rejected the pod in a previous scheduling cycle.

    Additionally, we had an internal feature called preCheck, which helped further filtering of events for efficiency, based on Kubernetes core scheduling constraints; For example, preCheck could filter out node-related events when the node status is NotReady.

    However, we had two issues for those approaches:

    • Requeueing with events was too broad, could lead to scheduling retries for no reason.
      • A new scheduled Pod might solve the InterPodAffinity's failure, but not all of them do. For example, if a new Pod is created, but without a label matching InterPodAffinity of the unschedulable pod, the pod wouldn't be schedulable.
    • preCheck relied on the logic of in-tree plugins and was not extensible to custom plugins, like in issue #110175.

    Here QueueingHints come into play; a QueueingHint subscribes to a particular kind of cluster event, and make a decision about whether each incoming event could make the Pod schedulable.

    For example, consider a Pod named pod-a that has a required Pod affinity. pod-a was rejected in the scheduling cycle by the InterPodAffinity plugin because no node had an existing Pod that matched the Pod affinity specification for pod-a.

    A diagram showing the scheduling queue and pod-a rejected by InterPodAffinity plugin

    pod-a moves into the Unschedulable Pod Pool. The scheduling queue records which plugin caused the scheduling failure for the Pod. For pod-a, the scheduling queue records that the InterPodAffinity plugin rejected the Pod.

    pod-a will never be schedulable until the InterPodAffinity failure is resolved. There're some scenarios that the failure could be resolved, one example is an existing running pod gets a label update and becomes matching a Pod affinity. For this scenario, the InterPodAffinity plugin's QueuingHint callback function checks every Pod label update that occurs in the cluster. Then, if a Pod gets a label update that matches the Pod affinity requirement of pod-a, the InterPodAffinity, plugin's QueuingHint prompts the scheduling queue to move pod-a back into the ActiveQ or the BackoffQ component.

    A diagram showing the scheduling queue and pod-a being moved by InterPodAffinity QueueingHint

    QueueingHint's history and what's new in v1.32

    At SIG Scheduling, we have been working on the development of QueueingHint since Kubernetes v1.28.

    While QueuingHint isn't user-facing, we implemented the SchedulerQueueingHints feature gate as a safety measure when we originally added this feature. In v1.28, we implemented QueueingHints with a few in-tree plugins experimentally, and made the feature gate enabled by default.

    However, users reported a memory leak, and consequently we disabled the feature gate in a patch release of v1.28. From v1.28 until v1.31, we kept working on the QueueingHint implementation within the rest of the in-tree plugins and fixing bugs.

    In v1.32, we made this feature enabled by default again. We finished implementing QueueingHints in all plugins and also identified the cause of the memory leak!

    We thank all the contributors who participated in the development of this feature and those who reported and investigated the earlier issues.

    Getting involved

    These features are managed by Kubernetes SIG Scheduling.

    Please join us and share your feedback.

    How can I learn more?