Kubernetes News

The Kubernetes blog is used by the project to communicate new features, community reports, and any news that might be relevant to the Kubernetes community.
The Kubernetes project logo
  • Since as early as Kubernetes v1.7, the Kubernetes project has pursued the ambitious goal of removing built-in cloud provider integrations (KEP-2395). While these integrations were instrumental in Kubernetes' early development and growth, their removal was driven by two key factors: the growing complexity of maintaining native support for every cloud provider across millions of lines of Go code, and the desire to establish Kubernetes as a truly vendor-neutral platform.

    After many releases, we're thrilled to announce that all cloud provider integrations have been successfully migrated from the core Kubernetes repository to external plugins. In addition to achieving our initial objectives, we've also significantly streamlined Kubernetes by removing roughly 1.5 million lines of code and reducing the binary sizes of core components by approximately 40%.

    This migration was a complex and long-running effort due to the numerous impacted components and the critical code paths that relied on the built-in integrations for the five initial cloud providers: Google Cloud, AWS, Azure, OpenStack, and vSphere. To successfully complete this migration, we had to build four new subsystems from the ground up:

    1. Cloud controller manager (KEP-2392)
    2. API server network proxy (KEP-1281)
    3. kubelet credential provider plugins (KEP-2133)
    4. Storage migration to use CSI (KEP-625)

    Each subsystem was critical to achieve full feature parity with built-in capabilities and required several releases to bring each subsystem to GA-level maturity with a safe and reliable migration path. More on each subsystem below.

    Cloud controller manager

    The cloud controller manager was the first external component introduced in this effort, replacing functionality within the kube-controller-manager and kubelet that directly interacted with cloud APIs. This essential component is responsible for initializing nodes by applying metadata labels that indicate the cloud region and zone a Node is running on, as well as IP addresses that are only known to the cloud provider. Additionally, it runs the service controller, which is responsible for provisioning cloud load balancers for Services of type LoadBalancer.

    Kubernetes components

    To learn more, read Cloud Controller Manager in the Kubernetes documentation.

    API server network proxy

    The API Server Network Proxy project, initiated in 2018 in collaboration with SIG API Machinery, aimed to replace the SSH tunneler functionality within the kube-apiserver. This tunneler had been used to securely proxy traffic between the Kubernetes control plane and nodes, but it heavily relied on provider-specific implementation details embedded in the kube-apiserver to establish these SSH tunnels.

    Now, the API Server Network Proxy is a GA-level extension point within the kube-apiserver. It offers a generic proxying mechanism that can route traffic from the API server to nodes through a secure proxy, eliminating the need for the API server to have any knowledge of the specific cloud provider it is running on. This project also introduced the Konnectivity project, which has seen growing adoption in production environments.

    You can learn more about the API Server Network Proxy from its README.

    Credential provider plugins for the kubelet

    The Kubelet credential provider plugin was developed to replace the kubelet's built-in functionality for dynamically fetching credentials for image registries hosted on Google Cloud, AWS, or Azure. The legacy capability was convenient as it allowed the kubelet to seamlessly retrieve short-lived tokens for pulling images from GCR, ECR, or ACR. However, like other areas of Kubernetes, supporting this required the kubelet to have specific knowledge of different cloud environments and APIs.

    Introduced in 2019, the credential provider plugin mechanism offers a generic extension point for the kubelet to execute plugin binaries that dynamically provide credentials for images hosted on various clouds. This extensibility expands the kubelet's capabilities to fetch short-lived tokens beyond the initial three cloud providers.

    To learn more, read kubelet credential provider for authenticated image pulls.

    Storage plugin migration from in-tree to CSI

    The Container Storage Interface (CSI) is a control plane standard for managing block and file storage systems in Kubernetes and other container orchestrators that went GA in 1.13. It was designed to replace the in-tree volume plugins built directly into Kubernetes with drivers that can run as Pods within the Kubernetes cluster. These drivers communicate with kube-controller-manager storage controllers via the Kubernetes API, and with kubelet through a local gRPC endpoint. Now there are over 100 CSI drivers available across all major cloud and storage vendors, making stateful workloads in Kubernetes a reality.

    However, a major challenge remained on how to handle all the existing users of in-tree volume APIs. To retain API backwards compatibility, we built an API translation layer into our controllers that will convert the in-tree volume API into the equivalent CSI API. This allowed us to redirect all storage operations to the CSI driver, paving the way for us to remove the code for the built-in volume plugins without removing the API.

    You can learn more about In-tree Storage migration in Kubernetes In-Tree to CSI Volume Migration Moves to Beta.

    What's next?

    This migration has been the primary focus for SIG Cloud Provider over the past few years. With this significant milestone achieved, we will be shifting our efforts towards exploring new and innovative ways for Kubernetes to better integrate with cloud providers, leveraging the external subsystems we've built over the years. This includes making Kubernetes smarter in hybrid environments where nodes in the cluster can run on both public and private clouds, as well as providing better tools and frameworks for developers of external providers to simplify and streamline their integration efforts.

    With all the new features, tools, and frameworks being planned, SIG Cloud Provider is not forgetting about the other side of the equation: testing. Another area of focus for the SIG's future activities is the improvement of cloud controller testing to include more providers. The ultimate goal of this effort being to create a testing framework that will include as many providers as possible so that we give the Kubernetes community the highest levels of confidence about their Kubernetes environments.

    If you're using a version of Kubernetes older than v1.29 and haven't migrated to an external cloud provider yet, we recommend checking out our previous blog post Kubernetes 1.29: Cloud Provider Integrations Are Now Separate Components.It provides detailed information on the changes we've made and offers guidance on how to migrate to an external provider. Starting in v1.31, in-tree cloud providers will be permanently disabled and removed from core Kubernetes components.

    If you’re interested in contributing, come join our bi-weekly SIG meetings!

  • Gateway API logo

    Following the GA release of Gateway API last October, Kubernetes SIG Network is pleased to announce the v1.1 release of Gateway API. In this release, several features are graduating to Standard Channel (GA), notably including support for service mesh and GRPCRoute. We're also introducing some new experimental features, including session persistence and client certificate verification.

    What's new

    Graduation to Standard

    This release includes the graduation to Standard of four eagerly awaited features. This means they are no longer experimental concepts; inclusion in the Standard release channel denotes a high level of confidence in the API surface and provides guarantees of backward compatibility. Of course, as with any other Kubernetes API, Standard Channel features can continue to evolve with backward-compatible additions over time, and we certainly expect further refinements and improvements to these new features in the future. For more information on how all of this works, refer to the Gateway API Versioning Policy.

    Service Mesh Support

    Service mesh support in Gateway API allows service mesh users to use the same API to manage ingress traffic and mesh traffic, reusing the same policy and routing interfaces. In Gateway API v1.1, routes (such as HTTPRoute) can now have a Service as a parentRef, to control how traffic to specific services behave. For more information, read the Gateway API service mesh documentation or see the list of Gateway API implementations.

    As an example, one could do a canary deployment of a workload deep in an application's call graph with an HTTPRoute as follows:

    apiVersion:gateway.networking.k8s.io/v1
    kind:HTTPRoute
    metadata:
    name:color-canary
    namespace:faces
    spec:
    parentRefs:
    - name:color
    kind:Service
    group:""
    port:80
    rules:
    - backendRefs:
    - name:color
    port:80
    weight:50
    - name:color2
    port:80
    weight:50
    

    This would split traffic sent to the color Service in the faces namespace 50/50 between the original color Service and the color2 Service, using a portable configuration that's easy to move from one mesh to another.

    GRPCRoute

    If you are already using the experimental version of GRPCRoute, we recommend holding off on upgrading to the standard channel version of GRPCRoute until the controllers you're using have been updated to support GRPCRoute v1. Until then, it is safe to upgrade to the experimental channel version of GRPCRoute in v1.1 that includes both v1alpha2 and v1 API versions.

    ParentReference Port

    The port field was added to ParentReference, allowing you to attach resources to Gateway Listeners, Services, or other parent resources (depending on the implementation). Binding to a port also allows you to attach to multiple Listeners at once.

    For example, you can attach an HTTPRoute to one or more specific Listeners of a Gateway as specified by the Listener port, instead of the Listener name field.

    For more information, see Attaching to Gateways.

    Conformance Profiles and Reports

    The conformance report API has been expanded with the mode field (intended to specify the working mode of the implementation), and the gatewayAPIChannel (standard or experimental). The gatewayAPIVersion and gatewayAPIChannel are now filled in automatically by the suite machinery, along with a brief description of the testing outcome. The Reports have been reorganized in a more structured way, and the implementations can now add information on how the tests have been run and provide reproduction steps.

    New additions to Experimental channel

    Gateway Client Certificate Verification

    Gateways can now configure client cert verification for each Gateway Listener by introducing a new frontendValidation field within tls. This field supports configuring a list of CA Certificates that can be used as a trust anchor to validate the certificates presented by the client.

    The following example shows how the CACertificate stored in the foo-example-com-ca-cert ConfigMap can be used to validate the certificates presented by clients connecting to the foo-https Gateway Listener.

    apiVersion:gateway.networking.k8s.io/v1
    kind:Gateway
    metadata:
    name:client-validation-basic
    spec:
    gatewayClassName:acme-lb
    listeners:
    name:foo-https
    protocol:HTTPS
    port:443
    hostname:foo.example.com
    tls:
    certificateRefs:
    kind:Secret
    group:""
    name:foo-example-com-cert
    frontendValidation:
    caCertificateRefs:
    kind:ConfigMap
    group:""
    name:foo-example-com-ca-cert
    

    Session Persistence and BackendLBPolicy

    Session Persistence is being introduced to Gateway API via a new policy (BackendLBPolicy) for Service-level configuration and as fields within HTTPRoute and GRPCRoute for route-level configuration. The BackendLBPolicy and route-level APIs provide the same session persistence configuration, including session timeouts, session name, session type, and cookie lifetime type.

    Below is an example configuration of BackendLBPolicy that enables cookie-based session persistence for the foo service. It sets the session name to foo-session, defines absolute and idle timeouts, and configures the cookie to be a session cookie:

    apiVersion:gateway.networking.k8s.io/v1alpha2
    kind:BackendLBPolicy
    metadata:
    name:lb-policy
    namespace:foo-ns
    spec:
    targetRefs:
    - group:core
    kind:service
    name:foo
    sessionPersistence:
    sessionName:foo-session
    absoluteTimeout:1h
    idleTimeout:30m
    type:Cookie
    cookieConfig:
    lifetimeType:Session
    

    Everything else

    TLS Terminology Clarifications

    As part of a broader goal of making our TLS terminology more consistent throughout the API, we've introduced some breaking changes to BackendTLSPolicy. This has resulted in a new API version (v1alpha3) and will require any existing implementations of this policy to properly handle the version upgrade, e.g. by backing up data and uninstalling the v1alpha2 version before installing this newer version.

    Any references to v1alpha2 BackendTLSPolicy fields will need to be updated to v1alpha3. Specific changes to fields include:

    • targetRef becomes targetRefs to allow a BackendTLSPolicy to attach to multiple targets
    • tls becomes validation
    • tls.caCertRefs becomes validation.caCertificateRefs
    • tls.wellKnownCACerts becomes validation.wellKnownCACertificates

    For a full list of the changes included in this release, please refer to the v1.1.0 release notes.

    Gateway API background

    The idea of Gateway API was initially proposed at the 2019 KubeCon San Diego as the next generation of Ingress API. Since then, an incredible community has formed to develop what has likely become the most collaborative API in Kubernetes history. Over 200 people have contributed to this API so far, and that number continues to grow.

    The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We literally couldn't have gotten this far without the support of this dedicated and active community.

    Try it out

    Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.26 or later, you'll be able to get up and running with this version of Gateway API.

    To try out the API, follow our Getting Started Guide.

    Get involved

    There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.

  • The Kubernetes Container Runtime Interface (CRI) acts as the main connection between the kubelet and the Container Runtime. Those runtimes have to provide a gRPC server which has to fulfill a Kubernetes defined Protocol Buffer interface. This API definition evolves over time, for example when contributors add new features or fields are going to become deprecated.

    In this blog post, I'd like to dive into the functionality and history of three extraordinary Remote Procedure Calls (RPCs), which are truly outstanding in terms of how they work: Exec, Attach and PortForward.

    Exec can be used to run dedicated commands within the container and stream the output to a client like kubectl or crictl. It also allows interaction with that process using standard input (stdin), for example if users want to run a new shell instance within an existing workload.

    Attach streams the output of the currently running process via standard I/O from the container to the client and also allows interaction with them. This is particularly useful if users want to see what is going on in the container and be able to interact with the process.

    PortForward can be utilized to forward a port from the host to the container to be able to interact with it using third party network tools. This allows it to bypass Kubernetes services for a certain workload and interact with its network interface.

    What is so special about them?

    All RPCs of the CRI either use the gRPC unary calls for communication or the server side streaming feature (only GetContainerEvents right now). This means that mainly all RPCs retrieve a single client request and have to return a single server response. The same applies to Exec, Attach, and PortForward, where their protocol definition looks like this:

    // Exec prepares a streaming endpoint to execute a command in the container.
    rpc Exec(ExecRequest) returns (ExecResponse) {}
    
    // Attach prepares a streaming endpoint to attach to a running container.
    rpc Attach(AttachRequest) returns (AttachResponse) {}
    
    // PortForward prepares a streaming endpoint to forward ports from a PodSandbox.
    rpc PortForward(PortForwardRequest) returns (PortForwardResponse) {}
    

    The requests carry everything required to allow the server to do the work, for example, the ContainerId or command (Cmd) to be run in case of Exec. More interestingly, all of their responses only contain a url:

    message ExecResponse {
     // Fully qualified URL of the exec streaming server.
     string url = 1;
    }
    
    message AttachResponse {
     // Fully qualified URL of the attach streaming server.
     string url = 1;
    }
    
    message PortForwardResponse {
     // Fully qualified URL of the port-forward streaming server.
     string url = 1;
    }
    

    Why is it implemented like that? Well, the original design document for those RPCs even predates Kubernetes Enhancements Proposals (KEPs) and was originally outlined back in 2016. The kubelet had a native implementation for Exec, Attach, and PortForward before the initiative to bring the functionality to the CRI started. Before that, everything was bound to Docker or the later abandoned container runtime rkt.

    The CRI related design document also elaborates on the option to use native RPC streaming for exec, attach, and port forward. The downsides outweighed this approach: the kubelet would still create a network bottleneck and future runtimes would not be free in choosing the server implementation details. Also, another option that the Kubelet implements a portable, runtime-agnostic solution has been abandoned over the final one, because this would mean another project to maintain which nevertheless would be runtime dependent.

    This means, that the basic flow for Exec, Attach and PortForward was proposed to look like this:

    Clients like crictl or the kubelet (via kubectl) request a new exec, attach or port forward session from the runtime using the gRPC interface. The runtime implements a streaming server that also manages the active sessions. This streaming server provides an HTTP endpoint for the client to connect to. The client upgrades the connection to use the SPDY streaming protocol or (in the future) to a WebSocket connection and starts to stream the data back and forth.

    This implementation allows runtimes to have the flexibility to implement Exec, Attach and PortForward the way they want, and also allows a simple test path. Runtimes can change the underlying implementation to support any kind of feature without having a need to modify the CRI at all.

    Many smaller enhancements to this overall approach have been merged into Kubernetes in the past years, but the general pattern has always stayed the same. The kubelet source code transformed into a reusable library, which is nowadays usable from container runtimes to implement the basic streaming capability.

    How does the streaming actually work?

    At a first glance, it looks like all three RPCs work the same way, but that's not the case. It's possible to group the functionality of Exec and Attach, while PortForward follows a distinct internal protocol definition.

    Exec and Attach

    Kubernetes defines Exec and Attach as remote commands, where its protocol definition exists in five different versions:

    # Version Note
    1 channel.k8s.io Initial (unversioned) SPDY sub protocol (#13394, #13395)
    2 v2.channel.k8s.io Resolves the issues present in the first version (#15961)
    3 v3.channel.k8s.io Adds support for resizing container terminals (#25273)
    4 v4.channel.k8s.io Adds support for exit codes using JSON errors (#26541)
    5 v5.channel.k8s.io Adds support for a CLOSE signal (#119157)

    On top of that, there is an overall effort to replace the SPDY transport protocol using WebSockets as part KEP #4006. Runtimes have to satisfy those protocols over their life cycle to stay up to date with the Kubernetes implementation.

    Let's assume that a client uses the latest (v5) version of the protocol as well as communicating over WebSockets. In that case, the general flow would be:

    1. The client requests an URL endpoint for Exec or Attach using the CRI.

      • The server (runtime) validates the request, inserts it into a connection tracking cache, and provides the HTTP endpoint URL for that request.
    2. The client connects to that URL, upgrades the connection to establish a WebSocket, and starts to stream data.

      • In the case of Attach, the server has to stream the main container process data to the client.
      • In the case of Exec, the server has to create the subprocess command within the container and then streams the output to the client.

      If stdin is required, then the server needs to listen for that as well and redirect it to the corresponding process.

    Interpreting data for the defined protocol is fairly simple: The first byte of every input and output packet defines the actual stream:

    First Byte Type Description
    0 standard input Data streamed from stdin
    1 standard output Data streamed to stdout
    2 standard error Data streamed to stderr
    3 stream error A streaming error occurred
    4 stream resize A terminal resize event
    255 stream close Stream should be closed (for WebSockets)

    How should runtimes now implement the streaming server methods for Exec and Attach by using the provided kubelet library? The key is that the streaming server implementation in the kubelet outlines an interface called Runtime which has to be fulfilled by the actual container runtime if it wants to use that library:

    // Runtime is the interface to execute the commands and provide the streams.
    type Runtime interface {
     Exec(ctx context.Context, containerID string, cmd []string, in io.Reader, out, err io.WriteCloser, tty bool, resize <-chan remotecommand.TerminalSize) error
     Attach(ctx context.Context, containerID string, in io.Reader, out, err io.WriteCloser, tty bool, resize <-chan remotecommand.TerminalSize) error
     PortForward(ctx context.Context, podSandboxID string, port int32, stream io.ReadWriteCloser) error
    }
    

    Everything related to the protocol interpretation is already in place and runtimes only have to implement the actual Exec and Attach logic. For example, the container runtime CRI-O does it like this pseudo code:

    func (s StreamService) Exec(
     ctx context.Context,
     containerID string,
     cmd []string,
     stdin io.Reader, stdout, stderr io.WriteCloser,
     tty bool,
     resizeChan <-chan remotecommand.TerminalSize,
    ) error {
     // Retrieve the container by the provided containerID
     // …
    
     // Update the container status and verify that the workload is running
     // …
    
     // Execute the command and stream the data
     return s.runtimeServer.Runtime().ExecContainer(
     s.ctx, c, cmd, stdin, stdout, stderr, tty, resizeChan,
     )
    }
    

    PortForward

    Forwarding ports to a container works a bit differently when comparing it to streaming IO data from a workload. The server still has to provide a URL endpoint for the client to connect to, but then the container runtime has to enter the network namespace of the container, allocate the port as well as stream the data back and forth. There is no simple protocol definition available like for Exec or Attach. This means that the client will stream the plain SPDY frames (with or without an additional WebSocket connection) which can be interpreted using libraries like moby/spdystream.

    Luckily, the kubelet library already provides the PortForward interface method which has to be implemented by the runtime. CRI-O does that by (simplified):

    func (s StreamService) PortForward(
     ctx context.Context,
     podSandboxID string,
     port int32,
     stream io.ReadWriteCloser,
    ) error {
     // Retrieve the pod sandbox by the provided podSandboxID
     sandboxID, err := s.runtimeServer.PodIDIndex().Get(podSandboxID)
     sb := s.runtimeServer.GetSandbox(sandboxID)
     // …
    
     // Get the network namespace path on disk for that sandbox
     netNsPath := sb.NetNsPath()
     // …
    
     // Enter the network namespace and stream the data
     return s.runtimeServer.Runtime().PortForwardContainer(
     ctx, sb.InfraContainer(), netNsPath, port, stream,
     )
    }
    

    Future work

    The flexibility Kubernetes provides for the RPCs Exec, Attach and PortForward is truly outstanding compared to other methods. Nevertheless, container runtimes have to keep up with the latest and greatest implementations to support those features in a meaningful way. The general effort to support WebSockets is not only a plain Kubernetes thing, it also has to be supported by container runtimes as well as clients like crictl.

    For example, crictl v1.30 features a new --transport flag for the subcommands exec, attach and port-forward (#1383, #1385) to allow choosing between websocket and spdy.

    CRI-O is going an experimental path by moving the streaming server implementation into conmon-rs (a substitute for the container monitor conmon). conmon-rs is a Rust implementation of the original container monitor and allows streaming WebSockets directly using supported libraries (#2070). The major benefit of this approach is that CRI-O does not even have to be running while conmon-rs can keep active Exec, Attach and PortForward sessions open. The simplified flow when using crictl directly will then look like this:

    sequenceDiagram autonumber participant crictl participant runtime as Container Runtime participant conmon-rs Note over crictl,runtime: Container Runtime Interface (CRI) crictl->>runtime: Exec, Attach, PortForward Note over runtime,conmon-rs: Cap’n Proto runtime->>conmon-rs: Serve Exec, Attach, PortForward conmon-rs->>runtime: HTTP endpoint (URL) runtime->>crictl: Response URL crictl-->>conmon-rs: Connection upgrade to WebSocket conmon-rs-)crictl: Stream data

    All of those enhancements require iterative design decisions, while the original well-conceived implementation acts as the foundation for those. I really hope you've enjoyed this compact journey through the history of CRI RPCs. Feel free to reach out to me anytime for suggestions or feedback using the official Kubernetes Slack.

  • With the release of Kubernetes 1.30, the feature to prevent the modification of the volume mode of a PersistentVolumeClaim that was created from an existing VolumeSnapshot in a Kubernetes cluster, has moved to GA!

    The problem

    The Volume Mode of a PersistentVolumeClaim refers to whether the underlying volume on the storage device is formatted into a filesystem or presented as a raw block device to the Pod that uses it.

    Users can leverage the VolumeSnapshot feature, which has been stable since Kubernetes v1.20, to create a PersistentVolumeClaim (shortened as PVC) from an existing VolumeSnapshot in the Kubernetes cluster. The PVC spec includes a dataSource field, which can point to an existing VolumeSnapshot instance. Visit Create a PersistentVolumeClaim from a Volume Snapshot for more details on how to create a PVC from an existing VolumeSnapshot in a Kubernetes cluster.

    When leveraging the above capability, there is no logic that validates whether the mode of the original volume, whose snapshot was taken, matches the mode of the newly created volume.

    This presents a security gap that allows malicious users to potentially exploit an as-yet-unknown vulnerability in the host operating system.

    There is a valid use case to allow some users to perform such conversions. Typically, storage backup vendors convert the volume mode during the course of a backup operation, to retrieve changed blocks for greater efficiency of operations. This prevents Kubernetes from blocking the operation completely and presents a challenge in distinguishing trusted users from malicious ones.

    Preventing unauthorized users from converting the volume mode

    In this context, an authorized user is one who has access rights to perform update or patch operations on VolumeSnapshotContents, which is a cluster-level resource.
    It is up to the cluster administrator to provide these rights only to trusted users or applications, like backup vendors. Users apart from such authorized ones will never be allowed to modify the volume mode of a PVC when it is being created from a VolumeSnapshot.

    To convert the volume mode, an authorized user must do the following:

    1. Identify the VolumeSnapshot that is to be used as the data source for a newly created PVC in the given namespace.
    2. Identify the VolumeSnapshotContent bound to the above VolumeSnapshot.
    kubectl describe volumesnapshot -n <namespace> <name>
    
    1. Add the annotation snapshot.storage.kubernetes.io/allow-volume-mode-change: "true" to the above VolumeSnapshotContent. The VolumeSnapshotContent annotations must include one similar to the following manifest fragment:
    kind:VolumeSnapshotContent
    metadata:
    annotations:
    - snapshot.storage.kubernetes.io/allow-volume-mode-change:"true"
    ...
    

    Note: For pre-provisioned VolumeSnapshotContents, you must take an extra step of setting spec.sourceVolumeMode field to either Filesystem or Block, depending on the mode of the volume from which this snapshot was taken.

    An example is shown below:

    apiVersion:snapshot.storage.k8s.io/v1
    kind:VolumeSnapshotContent
    metadata:
    annotations:
    - snapshot.storage.kubernetes.io/allow-volume-mode-change:"true"
    name:<volume-snapshot-content-name>
    spec:
    deletionPolicy:Delete
    driver:hostpath.csi.k8s.io
    source:
    snapshotHandle:<snapshot-handle>
    sourceVolumeMode:Filesystem
    volumeSnapshotRef:
    name:<volume-snapshot-name>
    namespace:<namespace>
    

    Repeat steps 1 to 3 for all VolumeSnapshotContents whose volume mode needs to be converted during a backup or restore operation. This can be done either via software with credentials of an authorized user or manually by the authorized user(s).

    If the annotation shown above is present on a VolumeSnapshotContent object, Kubernetes will not prevent the volume mode from being converted. Users should keep this in mind before they attempt to add the annotation to any VolumeSnapshotContent.

    Action required

    The prevent-volume-mode-conversion feature flag is enabled by default in the external-provisioner v4.0.0 and external-snapshotter v7.0.0. Volume mode change will be rejected when creating a PVC from a VolumeSnapshot unless the steps described above have been performed.

    What's next

    To determine which CSI external sidecar versions support this feature, please head over to the CSI docs page. For any queries or issues, join Kubernetes on Slack and create a thread in the #csi or #sig-storage channel. Alternately, create an issue in the CSI external-snapshotter repository.

  • With Kubernetes 1.30, we (SIG Auth) are moving Structured Authorization Configuration to beta.

    Today's article is about authorization: deciding what someone can and cannot access. Check a previous article from yesterday to find about what's new in Kubernetes v1.30 around authentication (finding out who's performing a task, and checking that they are who they say they are).

    Introduction

    Kubernetes continues to evolve to meet the intricate requirements of system administrators and developers alike. A critical aspect of Kubernetes that ensures the security and integrity of the cluster is the API server authorization. Until recently, the configuration of the authorization chain in kube-apiserver was somewhat rigid, limited to a set of command-line flags and allowing only a single webhook in the authorization chain. This approach, while functional, restricted the flexibility needed by cluster administrators to define complex, fine-grained authorization policies. The latest Structured Authorization Configuration feature (KEP-3221) aims to revolutionize this aspect by introducing a more structured and versatile way to configure the authorization chain, focusing on enabling multiple webhooks and providing explicit control mechanisms.

    The Need for Improvement

    Cluster administrators have long sought the ability to specify multiple authorization webhooks within the API Server handler chain and have control over detailed behavior like timeout and failure policy for each webhook. This need arises from the desire to create layered security policies, where requests can be validated against multiple criteria or sets of rules in a specific order. The previous limitations also made it difficult to dynamically configure the authorizer chain, leaving no room to manage complex authorization scenarios efficiently.

    The Structured Authorization Configuration feature addresses these limitations by introducing a configuration file format to configure the Kubernetes API Server Authorization chain. This format allows specifying multiple webhooks in the authorization chain (all other authorization types are specified no more than once). Each webhook authorizer has well-defined parameters, including timeout settings, failure policies, and conditions for invocation with CEL rules to pre-filter requests before they are dispatched to webhooks, helping you prevent unnecessary invocations. The configuration also supports automatic reloading, ensuring changes can be applied dynamically without restarting the kube-apiserver. This feature addresses current limitations and opens up new possibilities for securing and managing Kubernetes clusters more effectively.

    Sample Configurations

    Here is a sample structured authorization configuration along with descriptions for all fields, their defaults, and possible values.

    apiVersion:apiserver.config.k8s.io/v1beta1
    kind:AuthorizationConfiguration
    authorizers:
    - type:Webhook
    # Name used to describe the authorizer
    # This is explicitly used in monitoring machinery for metrics
    # Note:
    # - Validation for this field is similar to how K8s labels are validated today.
    # Required, with no default
    name:webhook
    webhook:
    # The duration to cache 'authorized' responses from the webhook
    # authorizer.
    # Same as setting `--authorization-webhook-cache-authorized-ttl` flag
    # Default: 5m0s
    authorizedTTL:30s
    # The duration to cache 'unauthorized' responses from the webhook
    # authorizer.
    # Same as setting `--authorization-webhook-cache-unauthorized-ttl` flag
    # Default: 30s
    unauthorizedTTL:30s
    # Timeout for the webhook request
    # Maximum allowed is 30s.
    # Required, with no default.
    timeout:3s
    # The API version of the authorization.k8s.io SubjectAccessReview to
    # send to and expect from the webhook.
    # Same as setting `--authorization-webhook-version` flag
    # Required, with no default
    # Valid values: v1beta1, v1
    subjectAccessReviewVersion:v1
    # MatchConditionSubjectAccessReviewVersion specifies the SubjectAccessReview
    # version the CEL expressions are evaluated against
    # Valid values: v1
    # Required, no default value
    matchConditionSubjectAccessReviewVersion:v1
    # Controls the authorization decision when a webhook request fails to
    # complete or returns a malformed response or errors evaluating
    # matchConditions.
    # Valid values:
    # - NoOpinion: continue to subsequent authorizers to see if one of
    # them allows the request
    # - Deny: reject the request without consulting subsequent authorizers
    # Required, with no default.
    failurePolicy:Deny
    connectionInfo:
    # Controls how the webhook should communicate with the server.
    # Valid values:
    # - KubeConfig: use the file specified in kubeConfigFile to locate the
    # server.
    # - InClusterConfig: use the in-cluster configuration to call the
    # SubjectAccessReview API hosted by kube-apiserver. This mode is not
    # allowed for kube-apiserver.
    type:KubeConfig
    # Path to KubeConfigFile for connection info
    # Required, if connectionInfo.Type is KubeConfig
    kubeConfigFile:/kube-system-authz-webhook.yaml
    # matchConditions is a list of conditions that must be met for a request to be sent to this
    # webhook. An empty list of matchConditions matches all requests.
    # There are a maximum of 64 match conditions allowed.
    #
    # The exact matching logic is (in order):
    # 1. If at least one matchCondition evaluates to FALSE, then the webhook is skipped.
    # 2. If ALL matchConditions evaluate to TRUE, then the webhook is called.
    # 3. If at least one matchCondition evaluates to an error (but none are FALSE):
    # - If failurePolicy=Deny, then the webhook rejects the request
    # - If failurePolicy=NoOpinion, then the error is ignored and the webhook is skipped
    matchConditions:
    # expression represents the expression which will be evaluated by CEL. Must evaluate to bool.
    # CEL expressions have access to the contents of the SubjectAccessReview in v1 version.
    # If version specified by subjectAccessReviewVersion in the request variable is v1beta1,
    # the contents would be converted to the v1 version before evaluating the CEL expression.
    #
    # Documentation on CEL: https://kubernetes.io/docs/reference/using-api/cel/
    #
    # only send resource requests to the webhook
    - expression:has(request.resourceAttributes)
    # only intercept requests to kube-system
    - expression:request.resourceAttributes.namespace == 'kube-system'
    # don't intercept requests from kube-system service accounts
    - expression:!('system:serviceaccounts:kube-system' in request.user.groups)
    - type:Node
    name:node
    - type:RBAC
    name:rbac
    - type:Webhook
    name:in-cluster-authorizer
    webhook:
    authorizedTTL:5m
    unauthorizedTTL:30s
    timeout:3s
    subjectAccessReviewVersion:v1
    failurePolicy:NoOpinion
    connectionInfo:
    type:InClusterConfig
    

    The following configuration examples illustrate real-world scenarios that need the ability to specify multiple webhooks with distinct settings, precedence order, and failure modes.

    Protecting Installed CRDs

    Ensuring of Custom Resource Definitions (CRDs) availability at cluster startup has been a key demand. One of the blockers of having a controller reconcile those CRDs is having a protection mechanism for them, which can be achieved through multiple authorization webhooks. This was not possible before as specifying multiple authorization webhooks in the Kubernetes API Server authorization chain was simply not possible. Now, with the Structured Authorization Configuration feature, administrators can specify multiple webhooks, offering a solution where RBAC falls short, especially when denying permissions to 'non-system' users for certain CRDs.

    Assuming the following for this scenario:

    • The "protected" CRDs are installed.
    • They can only be modified by users in the group admin.
    apiVersion:apiserver.config.k8s.io/v1beta1
    kind:AuthorizationConfiguration
    authorizers:
    - type:Webhook
    name:system-crd-protector
    webhook:
    unauthorizedTTL:30s
    timeout:3s
    subjectAccessReviewVersion:v1
    matchConditionSubjectAccessReviewVersion:v1
    failurePolicy:Deny
    connectionInfo:
    type:KubeConfig
    kubeConfigFile:/files/kube-system-authz-webhook.yaml
    matchConditions:
    # only send resource requests to the webhook
    - expression:has(request.resourceAttributes)
    # only intercept requests for CRDs
    - expression:request.resourceAttributes.resource.resource = "customresourcedefinitions"
    - expression:request.resourceAttributes.resource.group = ""
    # only intercept update, patch, delete, or deletecollection requests
    - expression:request.resourceAttributes.verb in ['update', 'patch', 'delete','deletecollection']
    - type:Node
    - type:RBAC
    

    Preventing unnecessarily nested webhooks

    A system administrator wants to apply specific validations to requests before handing them off to webhooks using frameworks like Open Policy Agent. In the past, this would require running nested webhooks within the one added to the authorization chain to achieve the desired result. The Structured Authorization Configuration feature simplifies this process, offering a structured API to selectively trigger additional webhooks when needed. It also enables administrators to set distinct failure policies for each webhook, ensuring more consistent and predictable responses.

    apiVersion:apiserver.config.k8s.io/v1beta1
    kind:AuthorizationConfiguration
    authorizers:
    - type:Webhook
    name:system-crd-protector
    webhook:
    unauthorizedTTL:30s
    timeout:3s
    subjectAccessReviewVersion:v1
    matchConditionSubjectAccessReviewVersion:v1
    failurePolicy:Deny
    connectionInfo:
    type:KubeConfig
    kubeConfigFile:/files/kube-system-authz-webhook.yaml
    matchConditions:
    # only send resource requests to the webhook
    - expression:has(request.resourceAttributes)
    # only intercept requests for CRDs
    - expression:request.resourceAttributes.resource.resource = "customresourcedefinitions"
    - expression:request.resourceAttributes.resource.group = ""
    # only intercept update, patch, delete, or deletecollection requests
    - expression:request.resourceAttributes.verb in ['update', 'patch', 'delete','deletecollection']
    - type:Node
    - type:RBAC
    - name:opa
    type:Webhook
    webhook:
    unauthorizedTTL:30s
    timeout:3s
    subjectAccessReviewVersion:v1
    matchConditionSubjectAccessReviewVersion:v1
    failurePolicy:Deny
    connectionInfo:
    type:KubeConfig
    kubeConfigFile:/files/opa-default-authz-webhook.yaml
    matchConditions:
    # only send resource requests to the webhook
    - expression:has(request.resourceAttributes)
    # only intercept requests to default namespace
    - expression:request.resourceAttributes.namespace == 'default'
    # don't intercept requests from default service accounts
    - expression:!('system:serviceaccounts:default' in request.user.groups)
    

    What's next?

    From Kubernetes 1.30, the feature is in beta and enabled by default. For Kubernetes v1.31, we expect the feature to stay in beta while we get more feedback from users. Once it is ready for GA, the feature flag will be removed, and the configuration file version will be promoted to v1.

    Learn more about this feature on the structured authorization configuration Kubernetes doc website. You can also follow along with KEP-3221 to track progress in coming Kubernetes releases.

    Call to action

    In this post, we have covered the benefits of the Structured Authorization Configuration feature in Kubernetes v1.30 and a few sample configurations for real-world scenarios. To use this feature, you must specify the path to the authorization configuration using the --authorization-config command line argument. From Kubernetes 1.30, the feature is in beta and enabled by default. If you want to keep using command line flags instead of a configuration file, those will continue to work as-is. Specifying both --authorization-config and --authorization-modes/--authorization-webhook-* won't work. You need to drop the older flags from your kube-apiserver command.

    The following kind Cluster configuration sets that command argument on the APIserver to load an AuthorizationConfiguration from a file (authorization_config.yaml) in the files folder. Any needed kubeconfig and certificate files can also be put in the files directory.

    kind:Cluster
    apiVersion:kind.x-k8s.io/v1alpha4
    featureGates:
    StructuredAuthorizationConfiguration:true# enabled by default in v1.30
    kubeadmConfigPatches:
    - |
     kind: ClusterConfiguration
     metadata:
     name: config
     apiServer:
     extraArgs:
     authorization-config: "/files/authorization_config.yaml"
     extraVolumes:
     - name: files
     hostPath: "/files"
     mountPath: "/files"
     readOnly: true
    nodes:
    - role:control-plane
    extraMounts:
    - hostPath:files
    containerPath:/files
    

    We would love to hear your feedback on this feature. In particular, we would like feedback from Kubernetes cluster administrators and authorization webhook implementors as they build their integrations with this new API. Please reach out to us on the #sig-auth-authorizers-dev channel on Kubernetes Slack.

    How to get involved

    If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

    You are also welcome to join the bi-weekly SIG Auth meetings held every other Wednesday.

    Acknowledgments

    This feature was driven by contributors from several different companies. We would like to extend a huge thank you to everyone who contributed their time and effort to make this possible.