Kubernetes News

The Kubernetes blog is used by the project to communicate new features, community reports, and any news that might be relevant to the Kubernetes community.
The Kubernetes project logo
  • In industries where systems must run very reliably and meet strict performance criteria such as telecommunication, high-performance or AI computing, containerized applications often need specific operating system configuration or hardware presence. It is common practice to require the use of specific versions of the kernel, its configuration, device drivers, or system components. Despite the existence of the Open Container Initiative (OCI), a governing community to define standards and specifications for container images, there has been a gap in expression of such compatibility requirements. The need to address this issue has led to different proposals and, ultimately, an implementation in Kubernetes' Node Feature Discovery (NFD).

    NFD is an open source Kubernetes project that automatically detects and reports hardware and system features of cluster nodes. This information helps users to schedule workloads on nodes that meet specific system requirements, which is especially useful for applications with strict hardware or operating system dependencies.

    The need for image compatibility specification

    Dependencies between containers and host OS

    A container image is built on a base image, which provides a minimal runtime environment, often a stripped-down Linux userland, completely empty or distroless. When an application requires certain features from the host OS, compatibility issues arise. These dependencies can manifest in several ways:

    • Drivers: Host driver versions must match the supported range of a library version inside the container to avoid compatibility problems. Examples include GPUs and network drivers.
    • Libraries or Software: The container must come with a specific version or range of versions for a library or software to run optimally in the environment. Examples from high performance computing are MPI, EFA, or Infiniband.
    • Kernel Modules or Features:: Specific kernel features or modules must be present. Examples include having support of write protected huge page faults, or the presence of VFIO
    • And more…

    While containers in Kubernetes are the most likely unit of abstraction for these needs, the definition of compatibility can extend further to include other container technologies such as Singularity and other OCI artifacts such as binaries from a spack binary cache.

    Multi-cloud and hybrid cloud challenges

    Containerized applications are deployed across various Kubernetes distributions and cloud providers, where different host operating systems introduce compatibility challenges. Often those have to be pre-configured before workload deployment or are immutable. For instance, different cloud providers will include different operating systems like:

    • RHCOS/RHEL
    • Photon OS
    • Amazon Linux 2
    • Container-Optimized OS
    • Azure Linux OS
    • And more...

    Each OS comes with unique kernel versions, configurations, and drivers, making compatibility a non-trivial issue for applications requiring specific features. It must be possible to quickly assess a container for its suitability to run on any specific environment.

    Image compatibility initiative

    An effort was made within the Open Containers Initiative Image Compatibility working group to introduce a standard for image compatibility metadata. A specification for compatibility would allow container authors to declare required host OS features, making compatibility requirements discoverable and programmable. The specification implemented in Kubernetes Node Feature Discovery is one of the discussed proposals. It aims to:

    • Define a structured way to express compatibility in OCI image manifests.
    • Support a compatibility specification alongside container images in image registries.
    • Allow automated validation of compatibility before scheduling containers.

    The concept has since been implemented in the Kubernetes Node Feature Discovery project.

    Implementation in Node Feature Discovery

    The solution integrates compatibility metadata into Kubernetes via NFD features and the NodeFeatureGroup API. This interface enables the user to match containers to nodes based on exposing features of hardware and software, allowing for intelligent scheduling and workload optimization.

    Compatibility specification

    The compatibility specification is a structured list of compatibility objects containing Node Feature Groups. These objects define image requirements and facilitate validation against host nodes. The feature requirements are described by using the list of available features from the NFD project. The schema has the following structure:

    • version (string) - Specifies the API version.
    • compatibilities (array of objects) - List of compatibility sets.
      • rules (object) - Specifies NodeFeatureGroup to define image requirements.
      • weight (int, optional) - Node affinity weight.
      • tag (string, optional) - Categorization tag.
      • description (string, optional) - Short description.

    An example might look like the following:

    version:v1alpha1
    compatibilities:
    - description:"My image requirements"
    rules:
    - name:"kernel and cpu"
    matchFeatures:
    - feature:kernel.loadedmodule
    matchExpressions:
    vfio-pci:{op:Exists}
    - feature:cpu.model
    matchExpressions:
    vendor_id:{op: In, value:["Intel","AMD"]}
    - name:"one of available nics"
    matchAny:
    - matchFeatures:
    - feature:pci.device
    matchExpressions:
    vendor:{op: In, value:["0eee"]}
    class:{op: In, value:["0200"]}
    - matchFeatures:
    - feature:pci.device
    matchExpressions:
    vendor:{op: In, value:["0fff"]}
    class:{op: In, value:["0200"]}
    

    Client implementation for node validation

    To streamline compatibility validation, we implemented a client tool that allows for node validation based on an image's compatibility artifact. In this workflow, the image author would generate a compatibility artifact that points to the image it describes in a registry via the referrers API. When a need arises to assess the fit of an image to a host, the tool can discover the artifact and verify compatibility of an image to a node before deployment. The client can validate nodes both inside and outside a Kubernetes cluster, extending the utility of the tool beyond the single Kubernetes use case. In the future, image compatibility could play a crucial role in creating specific workload profiles based on image compatibility requirements, aiding in more efficient scheduling. Additionally, it could potentially enable automatic node configuration to some extent, further optimizing resource allocation and ensuring seamless deployment of specialized workloads.

    Examples of usage

    1. Define image compatibility metadata
      A container image can have metadata that describes its requirements based on features discovered from nodes, like kernel modules or CPU models. The previous compatibility specification example in this article exemplified this use case.

    2. Attach the artifact to the image
      The image compatibility specification is stored as an OCI artifact. You can attach this metadata to your container image using the oras tool. The registry only needs to support OCI artifacts, support for arbitrary types is not required. Keep in mind that the container image and the artifact must be stored in the same registry. Use the following command to attach the artifact to the image:

    oras attach \
    --artifact-type application/vnd.nfd.image-compatibility.v1alpha1 <image-url> \
    <path-to-spec>.yaml:application/vnd.nfd.image-compatibility.spec.v1alpha1+yaml
    
    1. Validate image compatibility
      After attaching the compatibility specification, you can validate whether a node meets the image's requirements. This validation can be done using the nfd client:

    nfd compat validate-node --image <image-url>

    1. Read the output from the client
      Finally you can read the report generated by the tool or use your own tools to act based on the generated JSON report.

    validate-node command output

    Conclusion

    The addition of image compatibility to Kubernetes through Node Feature Discovery underscores the growing importance of addressing compatibility in cloud native environments. It is only a start, as further work is needed to integrate compatibility into scheduling of workloads within and outside of Kubernetes. However, by integrating this feature into Kubernetes, mission-critical workloads can now define and validate host OS requirements more efficiently. Moving forward, the adoption of compatibility metadata within Kubernetes ecosystems will significantly enhance the reliability and performance of specialized containerized applications, ensuring they meet the stringent requirements of industries like telecommunications, high-performance computing or any environment that requires special hardware or host OS configuration.

    Get involved

    Join the Kubernetes Node Feature Discovery project if you're interested in getting involved with the design and development of Image Compatibility API and tools. We always welcome new contributors.

  • UPDATE: We’ve received notice from Salesforce that our Slack workspace WILL NOT BE DOWNGRADED on June 20th. Stand by for more details, but for now, there is no urgency to back up private channels or direct messages.

    Kubernetes Slack will lose its special status and will be changing into a standard free Slack on June 20, 2025. Sometime later this year, our community may move to a new platform. If you are responsible for a channel or private channel, or a member of a User Group, you will need to take some actions as soon as you can.

    For the last decade, Slack has supported our project with a free customized enterprise account. They have let us know that they can no longer do so, particularly since our Slack is one of the largest and more active ones on the platform. As such, they will be downgrading it to a standard free Slack while we decide on, and implement, other options.

    On Friday, June 20, we will be subject to the feature limitations of free Slack. The primary ones which will affect us will be only retaining 90 days of history, and having to disable several apps and workflows which we are currently using. The Slack Admin team will do their best to manage these limitations.

    Responsible channel owners, members of private channels, and members of User Groups should take some actions to prepare for the upgrade and preserve information as soon as possible.

    The CNCF Projects Staff have proposed that our community look at migrating to Discord. Because of existing issues where we have been pushing the limits of Slack, they have already explored what a Kubernetes Discord would look like. Discord would allow us to implement new tools and integrations which would help the community, such as GitHub group membership synchronization. The Steering Committee will discuss and decide on our future platform.

    Please see our FAQ, and check the kubernetes-dev mailing list and the #announcements channel for further news. If you have specific feedback on our Slack status join the discussion on GitHub.

  • Kubernetes Events provide crucial insights into cluster operations, but as clusters grow, managing and analyzing these events becomes increasingly challenging. This blog post explores how to build custom event aggregation systems that help engineering teams better understand cluster behavior and troubleshoot issues more effectively.

    The challenge with Kubernetes events

    In a Kubernetes cluster, events are generated for various operations - from pod scheduling and container starts to volume mounts and network configurations. While these events are invaluable for debugging and monitoring, several challenges emerge in production environments:

    1. Volume: Large clusters can generate thousands of events per minute
    2. Retention: Default event retention is limited to one hour
    3. Correlation: Related events from different components are not automatically linked
    4. Classification: Events lack standardized severity or category classifications
    5. Aggregation: Similar events are not automatically grouped

    To learn more about Events in Kubernetes, read the Event API reference.

    Real-World value

    Consider a production environment with tens of microservices where the users report intermittent transaction failures:

    Traditional event aggregation process: Engineers are wasting hours sifting through thousands of standalone events spread across namespaces. By the time they look into it, the older events have long since purged, and correlating pod restarts to node-level issues is practically impossible.

    With its event aggregation in its custom events: The system groups events across resources, instantly surfacing correlation patterns such as volume mount timeouts before pod restarts. History indicates it occurred during past record traffic spikes, highlighting a storage scalability issue in minutes rather than hours.

    The benefit of this approach is that organizations that implement it commonly cut down their troubleshooting time significantly along with increasing the reliability of systems by detecting patterns early.

    Building an Event aggregation system

    This post explores how to build a custom event aggregation system that addresses these challenges, aligned to Kubernetes best practices. I've picked the Go programming language for my example.

    Architecture overview

    This event aggregation system consists of three main components:

    1. Event Watcher: Monitors the Kubernetes API for new events
    2. Event Processor: Processes, categorizes, and correlates events
    3. Storage Backend: Stores processed events for longer retention

    Here's a sketch for how to implement the event watcher:

    package main
    
    import (
     "context"
     metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
     "k8s.io/client-go/kubernetes"
     "k8s.io/client-go/rest"
     eventsv1 "k8s.io/api/events/v1"
    )
    
    type EventWatcher struct {
     clientset *kubernetes.Clientset
    }
    
    func NewEventWatcher(config *rest.Config) (*EventWatcher, error) {
     clientset, err := kubernetes.NewForConfig(config)
     if err != nil {
     return nil, err
     }
     return &EventWatcher{clientset: clientset}, nil
    }
    
    func (w *EventWatcher) Watch(ctx context.Context) (<-chan *eventsv1.Event, error) {
     events := make(chan *eventsv1.Event)
    
     watcher, err := w.clientset.EventsV1().Events("").Watch(ctx, metav1.ListOptions{})
     if err != nil {
     return nil, err
     }
    
     go func() {
     defer close(events)
     for {
     select {
     case event := <-watcher.ResultChan():
     if e, ok := event.Object.(*eventsv1.Event); ok {
     events <- e
     }
     case <-ctx.Done():
     watcher.Stop()
     return
     }
     }
     }()
    
     return events, nil
    }
    

    Event processing and classification

    The event processor enriches events with additional context and classification:

    type EventProcessor struct {
     categoryRules []CategoryRule
     correlationRules []CorrelationRule
    }
    
    type ProcessedEvent struct {
     Event *eventsv1.Event
     Category string
     Severity string
     CorrelationID string
     Metadata map[string]string
    }
    
    func (p *EventProcessor) Process(event *eventsv1.Event) *ProcessedEvent {
     processed := &ProcessedEvent{
     Event: event,
     Metadata: make(map[string]string),
     }
    
     // Apply classification rules
     processed.Category = p.classifyEvent(event)
     processed.Severity = p.determineSeverity(event)
    
     // Generate correlation ID for related events
     processed.CorrelationID = p.correlateEvent(event)
    
     // Add useful metadata
     processed.Metadata = p.extractMetadata(event)
    
     return processed
    }
    

    Implementing Event correlation

    One of the key features you could implement is a way of correlating related Events. Here's an example correlation strategy:

    func (p *EventProcessor) correlateEvent(event *eventsv1.Event) string {
     // Correlation strategies:
     // 1. Time-based: Events within a time window
     // 2. Resource-based: Events affecting the same resource
     // 3. Causation-based: Events with cause-effect relationships
    
     correlationKey := generateCorrelationKey(event)
     return correlationKey
    }
    
    func generateCorrelationKey(event *eventsv1.Event) string {
     // Example: Combine namespace, resource type, and name
     return fmt.Sprintf("%s/%s/%s",
     event.InvolvedObject.Namespace,
     event.InvolvedObject.Kind,
     event.InvolvedObject.Name,
     )
    }
    

    Event storage and retention

    For long-term storage and analysis, you'll probably want a backend that supports:

    • Efficient querying of large event volumes
    • Flexible retention policies
    • Support for aggregation queries

    Here's a sample storage interface:

    type EventStorage interface {
     Store(context.Context, *ProcessedEvent) error
     Query(context.Context, EventQuery) ([]ProcessedEvent, error)
     Aggregate(context.Context, AggregationParams) ([]EventAggregate, error)
    }
    
    type EventQuery struct {
     TimeRange TimeRange
     Categories []string
     Severity []string
     CorrelationID string
     Limit int
    }
    
    type AggregationParams struct {
     GroupBy []string
     TimeWindow string
     Metrics []string
    }
    

    Good practices for Event management

    1. Resource Efficiency

      • Implement rate limiting for event processing
      • Use efficient filtering at the API server level
      • Batch events for storage operations
    2. Scalability

      • Distribute event processing across multiple workers
      • Use leader election for coordination
      • Implement backoff strategies for API rate limits
    3. Reliability

      • Handle API server disconnections gracefully
      • Buffer events during storage backend unavailability
      • Implement retry mechanisms with exponential backoff

    Advanced features

    Pattern detection

    Implement pattern detection to identify recurring issues:

    type PatternDetector struct {
     patterns map[string]*Pattern
     threshold int
    }
    
    func (d *PatternDetector) Detect(events []ProcessedEvent) []Pattern {
     // Group similar events
     groups := groupSimilarEvents(events)
    
     // Analyze frequency and timing
     patterns := identifyPatterns(groups)
    
     return patterns
    }
    
    func groupSimilarEvents(events []ProcessedEvent) map[string][]ProcessedEvent {
     groups := make(map[string][]ProcessedEvent)
    
     for _, event := range events {
     // Create similarity key based on event characteristics
     similarityKey := fmt.Sprintf("%s:%s:%s",
     event.Event.Reason,
     event.Event.InvolvedObject.Kind,
     event.Event.InvolvedObject.Namespace,
     )
    
     // Group events with the same key
     groups[similarityKey] = append(groups[similarityKey], event)
     }
    
     return groups
    }
    
    
    func identifyPatterns(groups map[string][]ProcessedEvent) []Pattern {
     var patterns []Pattern
    
     for key, events := range groups {
     // Only consider groups with enough events to form a pattern
     if len(events) < 3 {
     continue
     }
    
     // Sort events by time
     sort.Slice(events, func(i, j int) bool {
     return events[i].Event.LastTimestamp.Time.Before(events[j].Event.LastTimestamp.Time)
     })
    
     // Calculate time range and frequency
     firstSeen := events[0].Event.FirstTimestamp.Time
     lastSeen := events[len(events)-1].Event.LastTimestamp.Time
     duration := lastSeen.Sub(firstSeen).Minutes()
    
     var frequency float64
     if duration > 0 {
     frequency = float64(len(events)) / duration
     }
    
     // Create a pattern if it meets threshold criteria
     if frequency > 0.5 { // More than 1 event per 2 minutes
     pattern := Pattern{
     Type: key,
     Count: len(events),
     FirstSeen: firstSeen,
     LastSeen: lastSeen,
     Frequency: frequency,
     EventSamples: events[:min(3, len(events))], // Keep up to 3 samples
     }
     patterns = append(patterns, pattern)
     }
     }
    
     return patterns
    }
    

    With this implementation, the system can identify recurring patterns such as node pressure events, pod scheduling failures, or networking issues that occur with a specific frequency.

    Real-time alerts

    The following example provides a starting point for building an alerting system based on event patterns. It is not a complete solution but a conceptual sketch to illustrate the approach.

    type AlertManager struct {
     rules []AlertRule
     notifiers []Notifier
    }
    
    func (a *AlertManager) EvaluateEvents(events []ProcessedEvent) {
     for _, rule := range a.rules {
     if rule.Matches(events) {
     alert := rule.GenerateAlert(events)
     a.notify(alert)
     }
     }
    }
    

    Conclusion

    A well-designed event aggregation system can significantly improve cluster observability and troubleshooting capabilities. By implementing custom event processing, correlation, and storage, operators can better understand cluster behavior and respond to issues more effectively.

    The solutions presented here can be extended and customized based on specific requirements while maintaining compatibility with the Kubernetes API and following best practices for scalability and reliability.

    Next steps

    Future enhancements could include:

    • Machine learning for anomaly detection
    • Integration with popular observability platforms
    • Custom event APIs for application-specific events
    • Enhanced visualization and reporting capabilities

    For more information on Kubernetes events and custom controllers, refer to the official Kubernetes documentation.

  • Modern generative AI and large language model (LLM) services create unique traffic-routing challenges on Kubernetes. Unlike typical short-lived, stateless web requests, LLM inference sessions are often long-running, resource-intensive, and partially stateful. For example, a single GPU-backed model server may keep multiple inference sessions active and maintain in-memory token caches.

    Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed for these workloads. They also don’t account for model identity or request criticality (e.g., interactive chat vs. batch jobs). Organizations often patch together ad-hoc solutions, but a standardized approach is missing.

    Gateway API Inference Extension

    Gateway API Inference Extension was created to address this gap by building on the existing Gateway API, adding inference-specific routing capabilities while retaining the familiar model of Gateways and HTTPRoutes. By adding an inference extension to your existing gateway, you effectively transform it into an Inference Gateway, enabling you to self-host GenAI/LLMs with a “model-as-a-service” mindset.

    The project’s goal is to improve and standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing based on real-time model metrics. By achieving these, the project aims to reduce latency and improve accelerator (GPU) utilization for AI workloads.

    How it works

    The design introduces two new Custom Resources (CRDs) with distinct responsibilities, each aligning with a specific user persona in the AI/ML serving workflow​:

    1. InferencePool Defines a pool of pods (model servers) running on shared compute (e.g., GPU nodes). The platform admin can configure how these pods are deployed, scaled, and balanced. An InferencePool ensures consistent resource usage and enforces platform-wide policies. An InferencePool is similar to a Service but specialized for AI/ML serving needs and aware of the model-serving protocol.

    2. InferenceModel A user-facing model endpoint managed by AI/ML owners. It maps a public name (e.g., "gpt-4-chat") to the actual model within an InferencePool. This lets workload owners specify which models (and optional fine-tuning) they want served, plus a traffic-splitting or prioritization policy.

    In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform operators manage where and how it’s served.

    Request flow

    The flow of a request builds on the Gateway API model (Gateways and HTTPRoutes) with one or more extra inference-aware steps (extensions) in the middle. Here’s a high-level example of the request flow with the Endpoint Selection Extension (ESE):

    1. Gateway Routing
      A client sends a request (e.g., an HTTP POST to /completions). The Gateway (like Envoy) examines the HTTPRoute and identifies the matching InferencePool backend.

    2. Endpoint Selection
      Instead of simply forwarding to any available pod, the Gateway consults an inference-specific routing extension— the Endpoint Selection Extension—to pick the best of the available pods. This extension examines live pod metrics (queue lengths, memory usage, loaded adapters) to choose the ideal pod for the request.

    3. Inference-Aware Scheduling
      The chosen pod is the one that can handle the request with the lowest latency or highest efficiency, given the user’s criticality or resource needs. The Gateway then forwards traffic to that specific pod.

    This extra step provides a smarter, model-aware routing mechanism that still feels like a normal single request to the client. Additionally, the design is extensible—any Inference Gateway can be enhanced with additional inference-specific extensions to handle new routing strategies, advanced scheduling logic, or specialized hardware needs. As the project continues to grow, contributors are encouraged to develop new extensions that are fully compatible with the same underlying Gateway API model, further expanding the possibilities for efficient and intelligent GenAI/LLM routing.

    Benchmarks

    We evaluated ​this extension against a standard Kubernetes Service for a vLLM‐based model serving deployment. The test environment consisted of multiple H100 (80 GB) GPU pods running vLLM (version 1) on a Kubernetes cluster, with 10 Llama2 model replicas. The Latency Profile Generator (LPG) tool was used to generate traffic and measure throughput, latency, and other metrics. The ShareGPT dataset served as the workload, and traffic was ramped from 100 Queries per Second (QPS) up to 1000 QPS.

    Key results

    • Comparable Throughput: Throughout the tested QPS range, the ESE delivered throughput roughly on par with a standard Kubernetes Service.

    • Lower Latency:

      • Per‐Output‐Token Latency: The ​ESE showed significantly lower p90 latency at higher QPS (500+), indicating that its model-aware routing decisions reduce queueing and resource contention as GPU memory approaches saturation.
      • Overall p90 Latency: Similar trends emerged, with the ​ESE reducing end‐to‐end tail latencies compared to the baseline, particularly as traffic increased beyond 400–500 QPS.

    These results suggest that this extension's model‐aware routing significantly reduced latency for GPU‐backed LLM workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can appear when using traditional load balancing methods for large, long‐running inference requests.

    Roadmap

    As the Gateway API Inference Extension heads toward GA, planned features include:

    1. Prefix-cache aware load balancing for remote caches
    2. LoRA adapter pipelines for automated rollout
    3. Fairness and priority between workloads in the same criticality band
    4. HPA support for scaling based on aggregate, per-model metrics
    5. Support for large multi-modal inputs/outputs
    6. Additional model types (e.g., diffusion models)
    7. Heterogeneous accelerators (serving on multiple accelerator types with latency- and cost-aware load balancing)
    8. Disaggregated serving for independently scaling pools

    Summary

    By aligning model serving with Kubernetes-native tooling, Gateway API Inference Extension aims to simplify and standardize how AI/ML traffic is routed. With model-aware routing, criticality-based prioritization, and more, it helps ops teams deliver the right LLM services to the right users—smoothly and efficiently.

    Ready to learn more? Visit the project docs to dive deeper, give an Inference Gateway extension a try with a few simple steps, and get involved if you’re interested in contributing to the project!

  • From the Kubernetes Multicontainer Pods: An Overview blog post you know what their job is, what are the main architectural patterns, and how they are implemented in Kubernetes. The main thing I’ll cover in this article is how to ensure that your sidecar containers start before the main app. It’s more complicated than you might think!

    A gentle refresher

    I'd just like to remind readers that the v1.29.0 release of Kubernetes added native support for sidecar containers, which can now be defined within the .spec.initContainers field, but with restartPolicy: Always. You can see that illustrated in the following example Pod manifest snippet:

    initContainers:
    - name:logshipper
    image:alpine:latest
    restartPolicy:Always# this is what makes it a sidecar container
    command:['sh','-c','tail -F /opt/logs.txt']
    volumeMounts:
    - name:data
    mountPath:/opt
    

    What are the specifics of defining sidecars with a .spec.initContainers block, rather than as a legacy multi-container pod with multiple .spec.containers? Well, all .spec.initContainers are always launched before the main application. If you define Kubernetes-native sidecars, those are terminated after the main application. Furthermore, when used with Jobs, a sidecar container should still be alive and could potentially even restart after the owning Job is complete; Kubernetes-native sidecar containers do not block pod completion.

    To learn more, you can also read the official Pod sidecar containers tutorial.

    The problem

    Now you know that defining a sidecar with this native approach will always start it before the main application. From the kubelet source code, it's visible that this often means being started almost in parallel, and this is not always what an engineer wants to achieve. What I'm really interested in is whether I can delay the start of the main application until the sidecar is not just started, but fully running and ready to serve. It might be a bit tricky because the problem with sidecars is there’s no obvious success signal, contrary to init containers - designed to run only for a specified period of time. With an init container, exit status 0 is unambiguously "I succeeded". With a sidecar, there are lots of points at which you can say "a thing is running". Starting one container only after the previous one is ready is part of a graceful deployment strategy, ensuring proper sequencing and stability during startup. It’s also actually how I’d expect sidecar containers to work as well, to cover the scenario where the main application is dependent on the sidecar. For example, it may happen that an app errors out if the sidecar isn’t available to serve requests (e.g., logging with DataDog). Sure, one could change the application code (and it would actually be the “best practice” solution), but sometimes they can’t - and this post focuses on this use case.

    I'll explain some ways that you might try, and show you what approaches will really work.

    Readiness probe

    To check whether Kubernetes native sidecar delays the start of the main application until the sidecar is ready, let’s simulate a short investigation. Firstly, I’ll simulate a sidecar container which will never be ready by implementing a readiness probe which will never succeed. As a reminder, a readiness probe checks if the container is ready to start accepting traffic and therefore, if the pod can be used as a backend for services.

    (Unlike standard init containers, sidecar containers can have probes so that the kubelet can supervise the sidecar and intervene if there are problems. For example, restarting a sidecar container if it fails a health check.)

    apiVersion:apps/v1
    kind:Deployment
    metadata:
    name:myapp
    labels:
    app:myapp
    spec:
    replicas:1
    selector:
    matchLabels:
    app:myapp
    template:
    metadata:
    labels:
    app:myapp
    spec:
    containers:
    - name:myapp
    image:alpine:latest
    command:["sh","-c","sleep 3600"]
    initContainers:
    - name:nginx
    image:nginx:latest
    restartPolicy:Always
    ports:
    - containerPort:80
    protocol:TCP
    readinessProbe:
    exec:
    command:
    - /bin/sh
    - -c
    - exit 1# this command always fails, keeping the container "Not Ready"
    periodSeconds:5
    volumes:
    - name:data
    emptyDir:{}
    

    The result is:

    controlplane $ kubectl get pods -w
    NAME READY STATUS RESTARTS AGE
    myapp-db5474f45-htgw5 1/2 Running 0 9m28s
    
    controlplane $ kubectl describe pod myapp-db5474f45-htgw5
    Name: myapp-db5474f45-htgw5
    Namespace: default
    (...)
    Events:
     Type Reason Age From Message
     ---- ------ ---- ---- -------
     Normal Scheduled 17s default-scheduler Successfully assigned default/myapp-db5474f45-htgw5 to node01
     Normal Pulling 16s kubelet Pulling image "nginx:latest"
     Normal Pulled 16s kubelet Successfully pulled image "nginx:latest" in 163ms (163ms including waiting). Image size: 72080558 bytes.
     Normal Created 16s kubelet Created container nginx
     Normal Started 16s kubelet Started container nginx
     Normal Pulling 15s kubelet Pulling image "alpine:latest"
     Normal Pulled 15s kubelet Successfully pulled image "alpine:latest" in 159ms (160ms including waiting). Image size: 3652536 bytes.
     Normal Created 15s kubelet Created container myapp
     Normal Started 15s kubelet Started container myapp
     Warning Unhealthy 1s (x6 over 15s) kubelet Readiness probe failed:
    

    From these logs it’s evident that only one container is ready - and I know it can’t be the sidecar, because I’ve defined it so it’ll never be ready (you can also check container statuses in kubectl get pod -o json). I also saw that myapp has been started before the sidecar is ready. That was not the result I wanted to achieve; in this case, the main app container has a hard dependency on its sidecar.

    Maybe a startup probe?

    To ensure that the sidecar is ready before the main app container starts, I can define a startupProbe. It will delay the start of the main container until the command is successfully executed (returns 0 exit status). If you’re wondering why I’ve added it to my initContainer, let’s analyse what happens If I’d added it to myapp container. I wouldn’t have guaranteed the probe would run before the main application code - and this one, can potentially error out without the sidecar being up and running.

    apiVersion:apps/v1
    kind:Deployment
    metadata:
    name:myapp
    labels:
    app:myapp
    spec:
    replicas:1
    selector:
    matchLabels:
    app:myapp
    template:
    metadata:
    labels:
    app:myapp
    spec:
    containers:
    - name:myapp
    image:alpine:latest
    command:["sh","-c","sleep 3600"]
    initContainers:
    - name:nginx
    image:nginx:latest
    ports:
    - containerPort:80
    protocol:TCP
    restartPolicy:Always
    startupProbe:
    httpGet:
    path:/
    port:80
    initialDelaySeconds:5
    periodSeconds:30
    failureThreshold:10
    timeoutSeconds:20
    volumes:
    - name:data
    emptyDir:{}
    

    This results in 2/2 containers being ready and running, and from events, it can be inferred that the main application started only after nginx had already been started. But to confirm whether it waited for the sidecar readiness, let’s change the startupProbe to the exec type of command:

    startupProbe:
    exec:
    command:
    - /bin/sh
    - -c
    - sleep 15
    

    and run kubectl get pods -w to watch in real time whether the readiness of both containers only changes after a 15 second delay. Again, events confirm the main application starts after the sidecar. That means that using the startupProbe with a correct startupProbe.httpGet request helps to delay the main application start until the sidecar is ready. It’s not optimal, but it works.

    What about the postStart lifecycle hook?

    Fun fact: using the postStart lifecycle hook block will also do the job, but I’d have to write my own mini-shell script, which is even less efficient.

    initContainers:
    - name:nginx
    image:nginx:latest
    restartPolicy:Always
    ports:
    - containerPort:80
    protocol:TCP
    lifecycle:
    postStart:
    exec:
    command:
    - /bin/sh
    - -c
    - |
     echo "Waiting for readiness at http://localhost:80"
     until curl -sf http://localhost:80; do
     echo "Still waiting for http://localhost:80..."
     sleep 5
     done
     echo "Service is ready at http://localhost:80"
    

    Liveness probe

    An interesting exercise would be to check the sidecar container behavior with a liveness probe. A liveness probe behaves and is configured similarly to a readiness probe - only with the difference that it doesn’t affect the readiness of the container but restarts it in case the probe fails.

    livenessProbe:
    exec:
    command:
    - /bin/sh
    - -c
    - exit 1# this command always fails, keeping the container "Not Ready"
    periodSeconds:5
    

    After adding the liveness probe configured just as the previous readiness probe and checking events of the pod by kubectl describe pod it’s visible that the sidecar has a restart count above 0. Nevertheless, the main application is not restarted nor influenced at all, even though I'm aware that (in our imaginary worst-case scenario) it can error out when the sidecar is not there serving requests. What if I’d used a livenessProbe without lifecycle postStart? Both containers will be immediately ready: at the beginning, this behavior will not be different from the one without any additional probes since the liveness probe doesn’t affect readiness at all. After a while, the sidecar will begin to restart itself, but it won’t influence the main container.

    Findings summary

    I’ll summarize the startup behavior in the table below:

    Probe/Hook Sidecar starts before the main app? Main app waits for the sidecar to be ready? What if the check doesn’t pass?
    readinessProbe Yes, but it’s almost in parallel (effectively no) No Sidecar is not ready; main app continues running
    livenessProbe Yes, but it’s almost in parallel (effectively no) No Sidecar is restarted, main app continues running
    startupProbe Yes Yes Main app is not started
    postStart Yes, main app container starts after postStart completes Yes, but you have to provide custom logic for that Main app is not started

    To summarize: with sidecars often being a dependency of the main application, you may want to delay the start of the latter until the sidecar is healthy. The ideal pattern is to start both containers simultaneously and have the app container logic delay at all levels, but it’s not always possible. If that's what you need, you have to use the right kind of customization to the Pod definition. Thankfully, it’s nice and quick, and you have the recipe ready above.

    Happy deploying!