Kubernetes News

The Kubernetes blog is used by the project to communicate new features, community reports, and any news that might be relevant to the Kubernetes community.
The Kubernetes project logo
  • Today, the ingress-nginx maintainers have released patches for a batch of critical vulnerabilities that could make it easy for attackers to take over your Kubernetes cluster. If you are among the over 40% of Kubernetes administrators using ingress-nginx, you should take action immediately to protect your users and data.

    Background

    Ingress is the traditional Kubernetes feature for exposing your workload Pods to the world so that they can be useful. In an implementation-agnostic way, Kubernetes users can define how their applications should be made available on the network. Then, an ingress controller uses that definition to set up local or cloud resources as required for the user’s particular situation and needs.

    Many different ingress controllers are available, to suit users of different cloud providers or brands of load balancers. Ingress-nginx is a software-only ingress controller provided by the Kubernetes project. Because of its versatility and ease of use, ingress-nginx is quite popular: it is deployed in over 40% of Kubernetes clusters!

    Ingress-nginx translates the requirements from Ingress objects into configuration for nginx, a powerful open source webserver daemon. Then, nginx uses that configuration to accept and route requests to the various applications running within a Kubernetes cluster. Proper handling of these nginx configuration parameters is crucial, because ingress-nginx needs to allow users significant flexibility while preventing them from accidentally or intentionally tricking nginx into doing things it shouldn’t.

    Vulnerabilities Patched Today

    Four of today’s ingress-nginx vulnerabilities are improvements to how ingress-nginx handles particular bits of nginx config. Without these fixes, a specially-crafted Ingress object can cause nginx to misbehave in various ways, including revealing the values of Secrets that are accessible to ingress-nginx. By default, ingress-nginx has access to all Secrets cluster-wide, so this can often lead to complete cluster takeover by any user or entity that has permission to create an Ingress.

    The most serious of today’s vulnerabilities, CVE-2025-1974, rated 9.8 CVSS, allows anything on the Pod network to exploit configuration injection vulnerabilities via the Validating Admission Controller feature of ingress-nginx. This makes such vulnerabilities far more dangerous: ordinarily one would need to be able to create an Ingress object in the cluster, which is a fairly privileged action. When combined with today’s other vulnerabilities, CVE-2025-1974 means that anything on the Pod network has a good chance of taking over your Kubernetes cluster, with no credentials or administrative access required. In many common scenarios, the Pod network is accessible to all workloads in your cloud VPC, or even anyone connected to your corporate network! This is a very serious situation.

    Today, we have released ingress-nginx v1.12.1 and v1.11.5, which have fixes for all five of these vulnerabilities.

    Your next steps

    First, determine if your clusters are using ingress-nginx. In most cases, you can check this by running kubectl get pods --all-namespaces --selector app.kubernetes.io/name=ingress-nginx with cluster administrator permissions.

    If you are using ingress-nginx, make a plan to remediate these vulnerabilities immediately.

    The best and easiest remedy is to upgrade to the new patch release of ingress-nginx. All five of today’s vulnerabilities are fixed by installing today’s patches.

    If you can’t upgrade right away, you can significantly reduce your risk by turning off the Validating Admission Controller feature of ingress-nginx.

    • If you have installed ingress-nginx using Helm
      • Reinstall, setting the Helm value controller.admissionWebhooks.enabled=false
    • If you have installed ingress-nginx manually
      • delete the ValidatingWebhookconfiguration called ingress-nginx-admission
      • edit the ingress-nginx-controller Deployment or Daemonset, removing --validating-webhook from the controller container’s argument list

    If you turn off the Validating Admission Controller feature as a mitigation for CVE-2025-1974, remember to turn it back on after you upgrade. This feature provides important quality of life improvements for your users, warning them about incorrect Ingress configurations before they can take effect.

    Conclusion, thanks, and further reading

    The ingress-nginx vulnerabilities announced today, including CVE-2025-1974, present a serious risk to many Kubernetes users and their data. If you use ingress-nginx, you should take action immediately to keep yourself safe.

    Thanks go out to Nir Ohfeld, Sagi Tzadik, Ronen Shustin, and Hillai Ben-Sasson from Wiz for responsibly disclosing these vulnerabilities, and for working with the Kubernetes SRC members and ingress-nginx maintainers (Marco Ebert and James Strong) to ensure we fixed them effectively.

    For further information about the maintenance and future of ingress-nginx, please see this GitHub issue and/or attend James and Marco’s KubeCon/CloudNativeCon EU 2025 presentation.

    For further information about the specific vulnerabilities discussed in this article, please see the appropriate GitHub issue: CVE-2025-24513, CVE-2025-24514, CVE-2025-1097, CVE-2025-1098, or CVE-2025-1974

  • Authors: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat)

    In this article, we introduce JobSet, an open source API for representing distributed jobs. The goal of JobSet is to provide a unified API for distributed ML training and HPC workloads on Kubernetes.

    Why JobSet?

    The Kubernetes community’s recent enhancements to the batch ecosystem on Kubernetes has attracted ML engineers who have found it to be a natural fit for the requirements of running distributed training workloads.

    Large ML models (particularly LLMs) which cannot fit into the memory of the GPU or TPU chips on a single host are often distributed across tens of thousands of accelerator chips, which in turn may span thousands of hosts.

    As such, the model training code is often containerized and executed simultaneously on all these hosts, performing distributed computations which often shard both the model parameters and/or the training dataset across the target accelerator chips, using communication collective primitives like all-gather and all-reduce to perform distributed computations and synchronize gradients between hosts.

    These workload characteristics make Kubernetes a great fit for this type of workload, as efficiently scheduling and managing the lifecycle of containerized applications across a cluster of compute resources is an area where it shines.

    It is also very extensible, allowing developers to define their own Kubernetes APIs, objects, and controllers which manage the behavior and life cycle of these objects, allowing engineers to develop custom distributed training orchestration solutions to fit their needs.

    However, as distributed ML training techniques continue to evolve, existing Kubernetes primitives do not adequately model them alone anymore.

    Furthermore, the landscape of Kubernetes distributed training orchestration APIs has become fragmented, and each of the existing solutions in this fragmented landscape has certain limitations that make it non-optimal for distributed ML training.

    For example, the KubeFlow training operator defines custom APIs for different frameworks (e.g. PyTorchJob, TFJob, MPIJob, etc.); however, each of these job types are in fact a solution fit specifically to the target framework, each with different semantics and behavior.

    On the other hand, the Job API fixed many gaps for running batch workloads, including Indexed completion mode, higher scalability, Pod failure policies and Pod backoff policy to mention a few of the most recent enhancements. However, running ML training and HPC workloads using the upstream Job API requires extra orchestration to fill the following gaps:

    Multi-template Pods : Most HPC or ML training jobs include more than one type of Pods. The different Pods are part of the same workload, but they need to run a different container, request different resources or have different failure policies. A common example is the driver-worker pattern.

    Job groups : Large scale training workloads span multiple network topologies, running across multiple racks for example. Such workloads are network latency sensitive, and aim to localize communication and minimize traffic crossing the higher-latency network links. To facilitate this, the workload needs to be split into groups of Pods each assigned to a network topology.

    Inter-Pod communication : Create and manage the resources (e.g. headless Services) necessary to establish communication between the Pods of a job.

    Startup sequencing : Some jobs require a specific start sequence of pods; sometimes the driver is expected to start first (like Ray or Spark), in other cases the workers are expected to be ready before starting the driver (like MPI).

    JobSet aims to address those gaps using the Job API as a building block to build a richer API for large-scale distributed HPC and ML use cases.

    How JobSet Works

    JobSet models a distributed batch workload as a group of Kubernetes Jobs. This allows a user to easily specify different pod templates for different distinct groups of pods (e.g. a leader, workers, parameter servers, etc.).

    It uses the abstraction of a ReplicatedJob to manage child Jobs, where a ReplicatedJob is essentially a Job Template with some desired number of Job replicas specified. This provides a declarative way to easily create identical child-jobs to run on different islands of accelerators, without resorting to scripting or Helm charts to generate many versions of the same job but with different names.

    Some other key JobSet features which address the problems described above include:

    Replicated Jobs : In modern data centers, hardware accelerators like GPUs and TPUs allocated in islands of homogenous accelerators connected via a specialized, high bandwidth network links. For example, a user might provision nodes containing a group of hosts co-located on a rack, each with H100 GPUs, where GPU chips within each host are connected via NVLink, with a NVLink Switch connecting the multiple NVLinks. TPU Pods are another example of this: TPU ViperLitePods consist of 64 hosts, each with 4 TPU v5e chips attached, all connected via ICI mesh. When running a distributed training job across multiple of these islands, we often want to partition the workload into a group of smaller identical jobs, 1 per island, where each pod primarily communicates with the pods within the same island to do segments of distributed computation, and keeping the gradient synchronization over DCN (data center network, which is lower bandwidth than ICI) to a bare minimum.

    Automatic headless service creation, configuration, and lifecycle management : Pod-to-pod communication via pod hostname is enabled by default, with automatic configuration and lifecycle management of the headless service enabling this.

    Configurable success policies : JobSet has configurable success policies which target specific ReplicatedJobs, with operators to target “Any” or “All” of their child jobs. For example, you can configure the JobSet to be marked complete if and only if all pods that are part of the “worker” ReplicatedJob are completed.

    Configurable failure policies : JobSet has configurable failure policies which allow the user to specify a maximum number of times the JobSet should be restarted in the event of a failure. If any job is marked failed, the entire JobSet will be recreated, allowing the workload to resume from the last checkpoint. When no failure policy is specified, if any job fails, the JobSet simply fails.

    Exclusive placement per topology domain : JobSet allows users to express that child jobs have 1:1 exclusive assignment to a topology domain, typically an accelerator island like a rack. For example, if the JobSet creates two child jobs, then this feature will enforce that the pods of each child job will be co-located on the same island, and that only one child job is allowed to schedule per island. This is useful for scenarios where we want to use a distributed data parallel (DDP) training strategy to train a model using multiple islands of compute resources (GPU racks or TPU slices), running 1 model replica in each accelerator island, ensuring the forward and backward passes themselves occur within a single model replica occurs over the high bandwidth interconnect linking the accelerators chips within the island, and only the gradient synchronization between model replicas occurs across accelerator islands over the lower bandwidth data center network.

    Integration with Kueue : Users can submit JobSets via Kueue to oversubscribe their clusters, queue workloads to run as capacity becomes available, prevent partial scheduling and deadlocks, enable multi-tenancy, and more.

    Example use case

    Distributed ML training on multiple TPU slices with Jax

    The following example is a JobSet spec for running a TPU Multislice workload on 4 TPU v5e slices. To learn more about TPU concepts and terminology, please refer to these docs.

    This example uses Jax, an ML framework with native support for Just-In-Time (JIT) compilation targeting TPU chips via OpenXLA. However, you can also use PyTorch/XLA to do ML training on TPUs.

    This example makes use of several JobSet features (both explicitly and implicitly) to support the unique scheduling requirements of TPU multislice training out-of-the-box with very little configuration required by the user.

    # Run a simple Jax workload on
    apiVersion:jobset.x-k8s.io/v1alpha2
    kind:JobSet
    metadata:
    name:multislice
    annotations:
    # Give each child Job exclusive usage of a TPU slice
    alpha.jobset.sigs.k8s.io/exclusive-topology:cloud.google.com/gke-nodepool
    spec:
    failurePolicy:
    maxRestarts:3
    replicatedJobs:
    - name:workers
    replicas:4# Set to number of TPU slices
    template:
    spec:
    parallelism:2# Set to number of VMs per TPU slice
    completions:2# Set to number of VMs per TPU slice
    backoffLimit:0
    template:
    spec:
    hostNetwork:true
    dnsPolicy:ClusterFirstWithHostNet
    nodeSelector:
    cloud.google.com/gke-tpu-accelerator:tpu-v5-lite-podslice
    cloud.google.com/gke-tpu-topology:2x4
    containers:
    - name:jax-tpu
    image:python:3.8
    ports:
    - containerPort:8471
    - containerPort:8080
    securityContext:
    privileged:true
    command:
    - bash
    - -c
    - |
     pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
     python -c 'import jax; print("Global device count:", jax.device_count())'
     sleep 60
    resources:
    limits:
    google.com/tpu:4
    

    Future work and getting involved

    We have a number of features on the JobSet roadmap planned for development this year, which can be found in the JobSet roadmap.

    Please feel free to reach out with feedback of any kind. We’re also open to additional contributors, whether it is to fix or report bugs, or help add new features or write documentation.

    You can get in touch with us via our repo, mailing list or on Slack.

    Last but not least, thanks to all our contributors who made this project possible!

  • In our ongoing SIG Spotlight series, we dive into the heart of the Kubernetes project by talking to the leaders of its various Special Interest Groups (SIGs). This time, we focus on SIG Apps, the group responsible for everything related to developing, deploying, and operating applications on Kubernetes. Sandipan Panda (DevZero) had the opportunity to interview Maciej Szulik (Defense Unicorns) and Janet Kuo (Google), the chairs and tech leads of SIG Apps. They shared their experiences, challenges, and visions for the future of application management within the Kubernetes ecosystem.

    Introductions

    Sandipan: Hello, could you start by telling us a bit about yourself, your role, and your journey within the Kubernetes community that led to your current roles in SIG Apps?

    Maciej: Hey, my name is Maciej, and I’m one of the leads for SIG Apps. Aside from this role, you can also find me helping SIG CLI and also being one of the Steering Committee members. I’ve been contributing to Kubernetes since late 2014 in various areas, including controllers, apiserver, and kubectl.

    Janet: Certainly! I'm Janet, a Staff Software Engineer at Google, and I've been deeply involved with the Kubernetes project since its early days, even before the 1.0 launch in 2015. It's been an amazing journey!

    My current role within the Kubernetes community is one of the chairs and tech leads of SIG Apps. My journey with SIG Apps started organically. I started with building the Deployment API and adding rolling update functionalities. I naturally gravitated towards SIG Apps and became increasingly involved. Over time, I took on more responsibilities, culminating in my current leadership roles.

    About SIG Apps

    All following answers were jointly provided by Maciej and Janet.

    Sandipan: For those unfamiliar, could you provide an overview of SIG Apps' mission and objectives? What key problems does it aim to solve within the Kubernetes ecosystem?

    As described in our charter, we cover a broad area related to developing, deploying, and operating applications on Kubernetes. That, in short, means we’re open to each and everyone showing up at our bi-weekly meetings and discussing the ups and downs of writing and deploying various applications on Kubernetes.

    Sandipan: What are some of the most significant projects or initiatives currently being undertaken by SIG Apps?

    At this point in time, the main factors driving the development of our controllers are the challenges coming from running various AI-related workloads. It’s worth giving credit here to two working groups we’ve sponsored over the past years:

    1. The Batch Working Group, which is looking at running HPC, AI/ML, and data analytics jobs on top of Kubernetes.
    2. The Serving Working Group, which is focusing on hardware-accelerated AI/ML inference.

    Best practices and challenges

    Sandipan: SIG Apps plays a crucial role in developing application management best practices for Kubernetes. Can you share some of these best practices and how they help improve application lifecycle management?

    1. Implementing health checks and readiness probes ensures that your applications are healthy and ready to serve traffic, leading to improved reliability and uptime. The above, combined with comprehensive logging, monitoring, and tracing solutions, will provide insights into your application's behavior, enabling you to identify and resolve issues quickly.

    2. Auto-scale your application based on resource utilization or custom metrics, optimizing resource usage and ensuring your application can handle varying loads.

    3. Use Deployment for stateless applications, StatefulSet for stateful applications, Job and CronJob for batch workloads, and DaemonSet for running a daemon on each node. Use Operators and CRDs to extend the Kubernetes API to automate the deployment, management, and lifecycle of complex applications, making them easier to operate and reducing manual intervention.

    Sandipan: What are some of the common challenges SIG Apps faces, and how do you address them?

    The biggest challenge we’re facing all the time is the need to reject a lot of features, ideas, and improvements. This requires a lot of discipline and patience to be able to explain the reasons behind those decisions.

    Sandipan: How has the evolution of Kubernetes influenced the work of SIG Apps? Are there any recent changes or upcoming features in Kubernetes that you find particularly relevant or beneficial for SIG Apps?

    The main benefit for both us and the whole community around SIG Apps is the ability to extend kubernetes with Custom Resource Definitions and the fact that users can build their own custom controllers leveraging the built-in ones to achieve whatever sophisticated use cases they might have and we, as the core maintainers, haven’t considered or weren’t able to efficiently resolve inside Kubernetes.

    Contributing to SIG Apps

    Sandipan: What opportunities are available for new contributors who want to get involved with SIG Apps, and what advice would you give them?

    We get the question, "What good first issue might you recommend we start with?" a lot :-) But unfortunately, there’s no easy answer to it. We always tell everyone that the best option to start contributing to core controllers is to find one you are willing to spend some time with. Read through the code, then try running unit tests and integration tests focusing on that controller. Once you grasp the general idea, try breaking it and the tests again to verify your breakage. Once you start feeling confident you understand that particular controller, you may want to search through open issues affecting that controller and either provide suggestions, explaining the problem users have, or maybe attempt your first fix.

    Like we said, there are no shortcuts on that road; you need to spend the time with the codebase to understand all the edge cases we’ve slowly built up to get to the point where we are. Once you’re successful with one controller, you’ll need to repeat that same process with others all over again.

    Sandipan: How does SIG Apps gather feedback from the community, and how is this feedback integrated into your work?

    We always encourage everyone to show up and present their problems and solutions during our bi-weekly meetings. As long as you’re solving an interesting problem on top of Kubernetes and you can provide valuable feedback about any of the core controllers, we’re always happy to hear from everyone.

    Looking ahead

    Sandipan: Looking ahead, what are the key focus areas or upcoming trends in application management within Kubernetes that SIG Apps is excited about? How is the SIG adapting to these trends?

    Definitely the current AI hype is the major driving factor; as mentioned above, we have two working groups, each covering a different aspect of it.

    Sandipan: What are some of your favorite things about this SIG?

    Without a doubt, the people that participate in our meetings and on Slack, who tirelessly help triage issues, pull requests and invest a lot of their time (very frequently their private time) into making kubernetes great!


    SIG Apps is an essential part of the Kubernetes community, helping to shape how applications are deployed and managed at scale. From its work on improving Kubernetes' workload APIs to driving innovation in AI/ML application management, SIG Apps is continually adapting to meet the needs of modern application developers and operators. Whether you’re a new contributor or an experienced developer, there’s always an opportunity to get involved and make an impact.

    If you’re interested in learning more or contributing to SIG Apps, be sure to check out their SIG README and join their bi-weekly meetings.

  • In this SIG etcd spotlight we talked with James Blair, Marek Siarkowicz, Wenjia Zhang, and Benjamin Wang to learn a bit more about this Kubernetes Special Interest Group.

    Introducing SIG etcd

    Frederico: Hello, thank you for the time! Let’s start with some introductions, could you tell us a bit about yourself, your role and how you got involved in Kubernetes.

    Benjamin: Hello, I am Benjamin. I am a SIG etcd Tech Lead and one of the etcd maintainers. I work for VMware, which is part of the Broadcom group. I got involved in Kubernetes & etcd & CSI (Container Storage Interface) because of work and also a big passion for open source. I have been working on Kubernetes & etcd (and also CSI) since 2020.

    James: Hey team, I’m James, a co-chair for SIG etcd and etcd maintainer. I work at Red Hat as a Specialist Architect helping people adopt cloud native technology. I got involved with the Kubernetes ecosystem in 2019. Around the end of 2022 I noticed how the etcd community and project needed help so started contributing as often as I could. There is a saying in our community that "you come for the technology, and stay for the people": for me this is absolutely real, it’s been a wonderful journey so far and I’m excited to support our community moving forward.

    Marek: Hey everyone, I'm Marek, the SIG etcd lead. At Google, I lead the GKE etcd team, ensuring a stable and reliable experience for all GKE users. My Kubernetes journey began with SIG Instrumentation, where I created and led the Kubernetes Structured Logging effort.
    I'm still the main project lead for Kubernetes Metrics Server, providing crucial signals for autoscaling in Kubernetes. I started working on etcd 3 years ago, right around the 3.5 release. We faced some challenges, but I'm thrilled to see etcd now the most scalable and reliable it's ever been, with the highest contribution numbers in the project's history. I'm passionate about distributed systems, extreme programming, and testing.

    Wenjia: Hi there, my name is Wenjia, I am the co-chair of SIG etcd and one of the etcd maintainers. I work at Google as an Engineering Manager, working on GKE (Google Kubernetes Engine) and GDC (Google Distributed Cloud). I have been working in the area of open source Kubernetes and etcd since the Kubernetes v1.10 and etcd v3.1 releases. I got involved in Kubernetes because of my job, but what keeps me in the space is the charm of the container orchestration technology, and more importantly, the awesome open source community.

    Becoming a Kubernetes Special Interest Group (SIG)

    Frederico: Excellent, thank you. I'd like to start with the origin of the SIG itself: SIG etcd is a very recent SIG, could you quickly go through the history and reasons behind its creation?

    Marek: Absolutely! SIG etcd was formed because etcd is a critical component of Kubernetes, serving as its data store. However, etcd was facing challenges like maintainer turnover and reliability issues. Creating a dedicated SIG allowed us to focus on addressing these problems, improving development and maintenance processes, and ensuring etcd evolves in sync with the cloud-native landscape.

    Frederico: And has becoming a SIG worked out as expected? Better yet, are the motivations you just described being addressed, and to what extent?

    Marek: It's been a positive change overall. Becoming a SIG has brought more structure and transparency to etcd's development. We've adopted Kubernetes processes like KEPs (Kubernetes Enhancement Proposals and PRRs (Production Readiness Reviews, which has improved our feature development and release cycle.

    Frederico: On top of those, what would you single out as the major benefit that has resulted from becoming a SIG?

    Marek: The biggest benefits for me was adopting Kubernetes testing infrastructure, tools like Prow and TestGrid. For large projects like etcd there is just no comparison to the default GitHub tooling. Having known, easy to use, clear tools is a major boost to the etcd as it makes it much easier for Kubernetes contributors to also help etcd.

    Wenjia: Totally agree, while challenges remain, the SIG structure provides a solid foundation for addressing them and ensuring etcd's continued success as a critical component of the Kubernetes ecosystem.

    The positive impact on the community is another crucial aspect of SIG etcd's success that I’d like to highlight. The Kubernetes SIG structure has created a welcoming environment for etcd contributors, leading to increased participation from the broader Kubernetes community. We have had greater collaboration with other SIGs like SIG API Machinery, SIG Scalability, SIG Testing, SIG Cluster Lifecycle, etc.

    This collaboration helps ensure etcd's development aligns with the needs of the wider Kubernetes ecosystem. The formation of the etcd Operator Working Group under the joint effort between SIG etcd and SIG Cluster Lifecycle exemplifies this successful collaboration, demonstrating a shared commitment to improving etcd's operational aspects within Kubernetes.

    Frederico: Since you mentioned collaboration, have you seen changes in terms of contributors and community involvement in recent months?

    James: Yes -- as showing in our unique PR author data we recently hit an all time high in March and are trending in a positive direction:

    Additionally, looking at our overall contributions across all etcd project repositories we are also observing a positive trend showing a resurgence in etcd project activity:

    The road ahead

    Frederico: That's quite telling, thank you. In terms of the near future, what are the current priorities for SIG etcd?

    Marek: Reliability is always top of mind -– we need to make sure etcd is rock-solid. We're also working on making etcd easier to use and manage for operators. And we have our sights set on making etcd a viable standalone solution for infrastructure management, not just for Kubernetes. Oh, and of course, scaling -– we need to ensure etcd can handle the growing demands of the cloud-native world.

    Benjamin: I agree that reliability should always be our top guiding principle. We need to ensure not only correctness but also compatibility. Additionally, we should continuously strive to improve the understandability and maintainability of etcd. Our focus should be on addressing the pain points that the community cares about the most.

    Frederico: Are there any specific SIGs that you work closely with?

    Marek: SIG API Machinery, for sure – they own the structure of the data etcd stores, so we're constantly working together. And SIG Cluster Lifecycle – etcd is a key part of Kubernetes clusters, so we collaborate on the newly created etcd operator Working group.

    Wenjia: Other than SIG API Machinery and SIG Cluster Lifecycle that Marek mentioned above, SIG Scalability and SIG Testing is another group that we work closely with.

    Frederico: In a more general sense, how would you list the key challenges for SIG etcd in the evolving cloud native landscape?

    Marek: Well, reliability is always a challenge when you're dealing with critical data. The cloud-native world is evolving so fast that scaling to meet those demands is a constant effort.

    Getting involved

    Frederico: We're almost at the end of our conversation, but for those interested in in etcd, how can they get involved?

    Marek: We'd love to have them! The best way to start is to join our SIG etcd meetings, follow discussions on the etcd-dev mailing list, and check out our GitHub issues. We're always looking for people to review proposals, test code, and contribute to documentation.

    Wenjia: I love this question 😀 . There are numerous ways for people interested in contributing to SIG etcd to get involved and make a difference. Here are some key areas where you can help:

    Code Contributions:

    • Bug Fixes: Tackle existing issues in the etcd codebase. Start with issues labeled "good first issue" or "help wanted" to find tasks that are suitable for newcomers.
    • Feature Development: Contribute to the development of new features and enhancements. Check the etcd roadmap and discussions to see what's being planned and where your skills might fit in.
    • Testing and Code Reviews: Help ensure the quality of etcd by writing tests, reviewing code changes, and providing feedback.
    • Documentation: Improve etcd's documentation by adding new content, clarifying existing information, or fixing errors. Clear and comprehensive documentation is essential for users and contributors.
    • Community Support: Answer questions on forums, mailing lists, or Slack channels. Helping others understand and use etcd is a valuable contribution.

    Getting Started:

    • Join the community: Start by joining the etcd community on Slack, attending SIG meetings, and following the mailing lists. This will help you get familiar with the project, its processes, and the people involved.
    • Find a mentor: If you're new to open source or etcd, consider finding a mentor who can guide you and provide support. Stay tuned! Our first cohort of mentorship program was very successful. We will have a new round of mentorship program coming up.
    • Start small: Don't be afraid to start with small contributions. Even fixing a typo in the documentation or submitting a simple bug fix can be a great way to get involved.

    By contributing to etcd, you'll not only be helping to improve a critical piece of the cloud-native ecosystem but also gaining valuable experience and skills. So, jump in and start contributing!

    Frederico: Excellent, thank you. Lastly, one piece of advice that you'd like to give to other newly formed SIGs?

    Marek: Absolutely! My advice would be to embrace the established processes of the larger community, prioritize collaboration with other SIGs, and focus on building a strong community.

    Wenjia: Here are some tips I myself found very helpful in my OSS journey:

    • Be patient: Open source development can take time. Don't get discouraged if your contributions aren't accepted immediately or if you encounter challenges.
    • Be respectful: The etcd community values collaboration and respect. Be mindful of others' opinions and work together to achieve common goals.
    • Have fun: Contributing to open source should be enjoyable. Find areas that interest you and contribute in ways that you find fulfilling.

    Frederico: A great way to end this spotlight, thank you all!


    For more information and resources, please take a look at :

    1. etcd website: https://etcd.io/
    2. etcd GitHub repository: https://github.com/etcd-io/etcd
    3. etcd community: https://etcd.io/community/
  • A new nftables mode for kube-proxy was introduced as an alpha feature in Kubernetes 1.29. Currently in beta, it is expected to be GA as of 1.33. The new mode fixes long-standing performance problems with the iptables mode and all users running on systems with reasonably-recent kernels are encouraged to try it out. (For compatibility reasons, even once nftables becomes GA, iptables will still be the default.)

    Why nftables? Part 1: data plane latency

    The iptables API was designed for implementing simple firewalls, and has problems scaling up to support Service proxying in a large Kubernetes cluster with tens of thousands of Services.

    In general, the ruleset generated by kube-proxy in iptables mode has a number of iptables rules proportional to the sum of the number of Services and the total number of endpoints. In particular, at the top level of the ruleset, there is one rule to test each possible Service IP (and port) that a packet might be addressed to:

    # If the packet is addressed to 172.30.0.41:80, then jump to the chain
    # KUBE-SVC-XPGD46QRK7WJZT7O for further processing
    -A KUBE-SERVICES -m comment --comment "namespace1/service1:p80 cluster IP" -m tcp -p tcp -d 172.30.0.41 --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O
    # If the packet is addressed to 172.30.0.42:443, then...
    -A KUBE-SERVICES -m comment --comment "namespace2/service2:p443 cluster IP" -m tcp -p tcp -d 172.30.0.42 --dport 443 -j KUBE-SVC-GNZBNJ2PO5MGZ6GT
    # etc...
    -A KUBE-SERVICES -m comment --comment "namespace3/service3:p80 cluster IP" -m tcp -p tcp -d 172.30.0.43 --dport 80 -j KUBE-SVC-X27LE4BHSL4DOUIK
    

    This means that when a packet comes in, the time it takes the kernel to check it against all of the Service rules is O(n) in the number of Services. As the number of Services increases, both the average and the worst-case latency for the first packet of a new connection increases (with the difference between best-case, average, and worst-case being mostly determined by whether a given Service IP address appears earlier or later in the KUBE-SERVICES chain).

    By contrast, with nftables, the normal way to write a ruleset like this is to have a single rule, using a "verdict map" to do the dispatch:

    table ip kube-proxy {
    # The service-ips verdict map indicates the action to take for each matching packet.
    map service-ips {
    type ipv4_addr . inet_proto . inet_service : verdict
    comment "ClusterIP, ExternalIP and LoadBalancer IP traffic"
    elements = { 172.30.0.41 . tcp . 80 : goto service-ULMVA6XW-namespace1/service1/tcp/p80,
    172.30.0.42 . tcp . 443 : goto service-42NFTM6N-namespace2/service2/tcp/p443,
    172.30.0.43 . tcp . 80 : goto service-4AT6LBPK-namespace3/service3/tcp/p80,
    ... }
    }
    # Now we just need a single rule to process all packets matching an
    # element in the map. (This rule says, "construct a tuple from the
    # destination IP address, layer 4 protocol, and destination port; look
    # that tuple up in "service-ips"; and if there's a match, execute the
    # associated verdict.)
    chain services {
    ip daddr . meta l4proto . th dport vmap @service-ips
    }
    ...
    }
    

    Since there's only a single rule, with a roughly O(1) map lookup, packet processing time is more or less constant regardless of cluster size, and the best/average/worst cases are very similar:

    But note the huge difference in the vertical scale between the iptables and nftables graphs! In the clusters with 5000 and 10,000 Services, the p50 (average) latency for nftables is about the same as the p01 (approximately best-case) latency for iptables. In the 30,000 Service cluster, the p99 (approximately worst-case) latency for nftables manages to beat out the p01 latency for iptables by a few microseconds! Here's both sets of data together, but you may have to squint to see the nftables results!:

    Why nftables? Part 2: control plane latency

    While the improvements to data plane latency in large clusters are great, there's another problem with iptables kube-proxy that often keeps users from even being able to grow their clusters to that size: the time it takes kube-proxy to program new iptables rules when Services and their endpoints change.

    With both iptables and nftables, the total size of the ruleset as a whole (actual rules, plus associated data) is O(n) in the combined number of Services and their endpoints. Originally, the iptables backend would rewrite every rule on every update, and with tens of thousands of Services, this could grow to be hundreds of thousands of iptables rules. Starting in Kubernetes 1.26, we began improving kube-proxy so that it could skip updating most of the unchanged rules in each update, but the limitations of iptables-restore as an API meant that it was still always necessary to send an update that's O(n) in the number of Services (though with a noticeably smaller constant than it used to be). Even with those optimizations, it can still be necessary to make use of kube-proxy's minSyncPeriod config option to ensure that it doesn't spend every waking second trying to push iptables updates.

    The nftables APIs allow for doing much more incremental updates, and when kube-proxy in nftables mode does an update, the size of the update is only O(n) in the number of Services and endpoints that have changed since the last sync, regardless of the total number of Services and endpoints. The fact that the nftables API allows each nftables-using component to have its own private table also means that there is no global lock contention between components like with iptables. As a result, kube-proxy's nftables updates can be done much more efficiently than with iptables.

    (Unfortunately I don't have cool graphs for this part.)

    Why not nftables?

    All that said, there are a few reasons why you might not want to jump right into using the nftables backend for now.

    First, the code is still fairly new. While it has plenty of unit tests, performs correctly in our CI system, and has now been used in the real world by multiple users, it has not seen anything close to as much real-world usage as the iptables backend has, so we can't promise that it is as stable and bug-free.

    Second, the nftables mode will not work on older Linux distributions; currently it requires a 5.13 or newer kernel. Additionally, because of bugs in early versions of the nft command line tool, you should not run kube-proxy in nftables mode on nodes that have an old (earlier than 1.0.0) version of nft in the host filesystem (or else kube-proxy's use of nftables may interfere with other uses of nftables on the system).

    Third, you may have other networking components in your cluster, such as the pod network or NetworkPolicy implementation, that do not yet support kube-proxy in nftables mode. You should consult the documentation (or forums, bug tracker, etc.) for any such components to see if they have problems with nftables mode. (In many cases they will not; as long as they don't try to directly interact with or override kube-proxy's iptables rules, they shouldn't care whether kube-proxy is using iptables or nftables.) Additionally, observability and monitoring tools that have not been updated may report less data for kube-proxy in nftables mode than they do for kube-proxy in iptables mode.

    Finally, kube-proxy in nftables mode is intentionally not 100% compatible with kube-proxy in iptables mode. There are a few old kube-proxy features whose default behaviors are less secure, less performant, or less intuitive than we'd like, but where we felt that changing the default would be a compatibility break. Since the nftables mode is opt-in, this gave us a chance to fix those bad defaults without breaking users who weren't expecting changes. (In particular, with nftables mode, NodePort Services are now only reachable on their nodes' default IPs, as opposed to being reachable on all IPs, including 127.0.0.1, with iptables mode.) The kube-proxy documentation has more information about this, including information about metrics you can look at to determine if you are relying on any of the changed functionality, and what configuration options are available to get more backward-compatible behavior.

    Trying out nftables mode

    Ready to try it out? In Kubernetes 1.31 and later, you just need to pass --proxy-mode nftables to kube-proxy (or set mode: nftables in your kube-proxy config file).

    If you are using kubeadm to set up your cluster, the kubeadm documentation explains how to pass a KubeProxyConfiguration to kubeadm init. You can also deploy nftables-based clusters with kind.

    You can also convert existing clusters from iptables (or ipvs) mode to nftables by updating the kube-proxy configuration and restarting the kube-proxy pods. (You do not need to reboot the nodes: when restarting in nftables mode, kube-proxy will delete any existing iptables or ipvs rules, and likewise, if you later revert back to iptables or ipvs mode, it will delete any existing nftables rules.)

    Future plans

    As mentioned above, while nftables is now the best kube-proxy mode, it is not the default, and we do not yet have a plan for changing that. We will continue to support the iptables mode for a long time.

    The future of the IPVS mode of kube-proxy is less certain: its main advantage over iptables was that it was faster, but certain aspects of the IPVS architecture and APIs were awkward for kube-proxy's purposes (for example, the fact that the kube-ipvs0 device needs to have every Service IP address assigned to it), and some parts of Kubernetes Service proxying semantics were difficult to implement using IPVS (particularly the fact that some Services had to have different endpoints depending on whether you connected to them from a local or remote client). And now, the nftables mode has the same performance as IPVS mode (actually, slightly better), without any of the downsides:

    (In theory the IPVS mode also has the advantage of being able to use various other IPVS functionality, like alternative "schedulers" for balancing endpoints. In practice, this ended up not being very useful, because kube-proxy runs independently on every node, and the IPVS schedulers on each node had no way of sharing their state with the proxies on other nodes, thus thwarting the effort to balance traffic more cleverly.)

    While the Kubernetes project does not have an immediate plan to drop the IPVS backend, it is probably doomed in the long run, and people who are currently using IPVS mode should try out the nftables mode instead (and file bugs if you think there is missing functionality in nftables mode that you can't work around).

    Learn more