Kubernetes News

-
Spotlight on SIG Apps
In our ongoing SIG Spotlight series, we dive into the heart of the Kubernetes project by talking to the leaders of its various Special Interest Groups (SIGs). This time, we focus on SIG Apps, the group responsible for everything related to developing, deploying, and operating applications on Kubernetes. Sandipan Panda (DevZero) had the opportunity to interview Maciej Szulik (Defense Unicorns) and Janet Kuo (Google), the chairs and tech leads of SIG Apps. They shared their experiences, challenges, and visions for the future of application management within the Kubernetes ecosystem.
Introductions
Sandipan: Hello, could you start by telling us a bit about yourself, your role, and your journey within the Kubernetes community that led to your current roles in SIG Apps?
Maciej: Hey, my name is Maciej, and I’m one of the leads for SIG Apps. Aside from this role, you can also find me helping SIG CLI and also being one of the Steering Committee members. I’ve been contributing to Kubernetes since late 2014 in various areas, including controllers, apiserver, and kubectl.
Janet: Certainly! I'm Janet, a Staff Software Engineer at Google, and I've been deeply involved with the Kubernetes project since its early days, even before the 1.0 launch in 2015. It's been an amazing journey!
My current role within the Kubernetes community is one of the chairs and tech leads of SIG Apps. My journey with SIG Apps started organically. I started with building the Deployment API and adding rolling update functionalities. I naturally gravitated towards SIG Apps and became increasingly involved. Over time, I took on more responsibilities, culminating in my current leadership roles.
About SIG Apps
All following answers were jointly provided by Maciej and Janet.
Sandipan: For those unfamiliar, could you provide an overview of SIG Apps' mission and objectives? What key problems does it aim to solve within the Kubernetes ecosystem?
As described in our charter, we cover a broad area related to developing, deploying, and operating applications on Kubernetes. That, in short, means we’re open to each and everyone showing up at our bi-weekly meetings and discussing the ups and downs of writing and deploying various applications on Kubernetes.
Sandipan: What are some of the most significant projects or initiatives currently being undertaken by SIG Apps?
At this point in time, the main factors driving the development of our controllers are the challenges coming from running various AI-related workloads. It’s worth giving credit here to two working groups we’ve sponsored over the past years:
- The Batch Working Group, which is looking at running HPC, AI/ML, and data analytics jobs on top of Kubernetes.
- The Serving Working Group, which is focusing on hardware-accelerated AI/ML inference.
Best practices and challenges
Sandipan: SIG Apps plays a crucial role in developing application management best practices for Kubernetes. Can you share some of these best practices and how they help improve application lifecycle management?
-
Implementing health checks and readiness probes ensures that your applications are healthy and ready to serve traffic, leading to improved reliability and uptime. The above, combined with comprehensive logging, monitoring, and tracing solutions, will provide insights into your application's behavior, enabling you to identify and resolve issues quickly.
-
Auto-scale your application based on resource utilization or custom metrics, optimizing resource usage and ensuring your application can handle varying loads.
-
Use Deployment for stateless applications, StatefulSet for stateful applications, Job and CronJob for batch workloads, and DaemonSet for running a daemon on each node. Use Operators and CRDs to extend the Kubernetes API to automate the deployment, management, and lifecycle of complex applications, making them easier to operate and reducing manual intervention.
Sandipan: What are some of the common challenges SIG Apps faces, and how do you address them?
The biggest challenge we’re facing all the time is the need to reject a lot of features, ideas, and improvements. This requires a lot of discipline and patience to be able to explain the reasons behind those decisions.
Sandipan: How has the evolution of Kubernetes influenced the work of SIG Apps? Are there any recent changes or upcoming features in Kubernetes that you find particularly relevant or beneficial for SIG Apps?
The main benefit for both us and the whole community around SIG Apps is the ability to extend kubernetes with Custom Resource Definitions and the fact that users can build their own custom controllers leveraging the built-in ones to achieve whatever sophisticated use cases they might have and we, as the core maintainers, haven’t considered or weren’t able to efficiently resolve inside Kubernetes.
Contributing to SIG Apps
Sandipan: What opportunities are available for new contributors who want to get involved with SIG Apps, and what advice would you give them?
We get the question, "What good first issue might you recommend we start with?" a lot :-) But unfortunately, there’s no easy answer to it. We always tell everyone that the best option to start contributing to core controllers is to find one you are willing to spend some time with. Read through the code, then try running unit tests and integration tests focusing on that controller. Once you grasp the general idea, try breaking it and the tests again to verify your breakage. Once you start feeling confident you understand that particular controller, you may want to search through open issues affecting that controller and either provide suggestions, explaining the problem users have, or maybe attempt your first fix.
Like we said, there are no shortcuts on that road; you need to spend the time with the codebase to understand all the edge cases we’ve slowly built up to get to the point where we are. Once you’re successful with one controller, you’ll need to repeat that same process with others all over again.
Sandipan: How does SIG Apps gather feedback from the community, and how is this feedback integrated into your work?
We always encourage everyone to show up and present their problems and solutions during our bi-weekly meetings. As long as you’re solving an interesting problem on top of Kubernetes and you can provide valuable feedback about any of the core controllers, we’re always happy to hear from everyone.
Looking ahead
Sandipan: Looking ahead, what are the key focus areas or upcoming trends in application management within Kubernetes that SIG Apps is excited about? How is the SIG adapting to these trends?
Definitely the current AI hype is the major driving factor; as mentioned above, we have two working groups, each covering a different aspect of it.
Sandipan: What are some of your favorite things about this SIG?
Without a doubt, the people that participate in our meetings and on Slack, who tirelessly help triage issues, pull requests and invest a lot of their time (very frequently their private time) into making kubernetes great!
SIG Apps is an essential part of the Kubernetes community, helping to shape how applications are deployed and managed at scale. From its work on improving Kubernetes' workload APIs to driving innovation in AI/ML application management, SIG Apps is continually adapting to meet the needs of modern application developers and operators. Whether you’re a new contributor or an experienced developer, there’s always an opportunity to get involved and make an impact.
If you’re interested in learning more or contributing to SIG Apps, be sure to check out their SIG README and join their bi-weekly meetings.
-
Spotlight on SIG etcd
In this SIG etcd spotlight we talked with James Blair, Marek Siarkowicz, Wenjia Zhang, and Benjamin Wang to learn a bit more about this Kubernetes Special Interest Group.
Introducing SIG etcd
Frederico: Hello, thank you for the time! Let’s start with some introductions, could you tell us a bit about yourself, your role and how you got involved in Kubernetes.
Benjamin: Hello, I am Benjamin. I am a SIG etcd Tech Lead and one of the etcd maintainers. I work for VMware, which is part of the Broadcom group. I got involved in Kubernetes & etcd & CSI (Container Storage Interface) because of work and also a big passion for open source. I have been working on Kubernetes & etcd (and also CSI) since 2020.
James: Hey team, I’m James, a co-chair for SIG etcd and etcd maintainer. I work at Red Hat as a Specialist Architect helping people adopt cloud native technology. I got involved with the Kubernetes ecosystem in 2019. Around the end of 2022 I noticed how the etcd community and project needed help so started contributing as often as I could. There is a saying in our community that "you come for the technology, and stay for the people": for me this is absolutely real, it’s been a wonderful journey so far and I’m excited to support our community moving forward.
Marek: Hey everyone, I'm Marek, the SIG etcd lead. At Google, I lead the GKE etcd team, ensuring a stable and reliable experience for all GKE users. My Kubernetes journey began with SIG Instrumentation, where I created and led the Kubernetes Structured Logging effort.
I'm still the main project lead for Kubernetes Metrics Server, providing crucial signals for autoscaling in Kubernetes. I started working on etcd 3 years ago, right around the 3.5 release. We faced some challenges, but I'm thrilled to see etcd now the most scalable and reliable it's ever been, with the highest contribution numbers in the project's history. I'm passionate about distributed systems, extreme programming, and testing.Wenjia: Hi there, my name is Wenjia, I am the co-chair of SIG etcd and one of the etcd maintainers. I work at Google as an Engineering Manager, working on GKE (Google Kubernetes Engine) and GDC (Google Distributed Cloud). I have been working in the area of open source Kubernetes and etcd since the Kubernetes v1.10 and etcd v3.1 releases. I got involved in Kubernetes because of my job, but what keeps me in the space is the charm of the container orchestration technology, and more importantly, the awesome open source community.
Becoming a Kubernetes Special Interest Group (SIG)
Frederico: Excellent, thank you. I'd like to start with the origin of the SIG itself: SIG etcd is a very recent SIG, could you quickly go through the history and reasons behind its creation?
Marek: Absolutely! SIG etcd was formed because etcd is a critical component of Kubernetes, serving as its data store. However, etcd was facing challenges like maintainer turnover and reliability issues. Creating a dedicated SIG allowed us to focus on addressing these problems, improving development and maintenance processes, and ensuring etcd evolves in sync with the cloud-native landscape.
Frederico: And has becoming a SIG worked out as expected? Better yet, are the motivations you just described being addressed, and to what extent?
Marek: It's been a positive change overall. Becoming a SIG has brought more structure and transparency to etcd's development. We've adopted Kubernetes processes like KEPs (Kubernetes Enhancement Proposals and PRRs (Production Readiness Reviews, which has improved our feature development and release cycle.
Frederico: On top of those, what would you single out as the major benefit that has resulted from becoming a SIG?
Marek: The biggest benefits for me was adopting Kubernetes testing infrastructure, tools like Prow and TestGrid. For large projects like etcd there is just no comparison to the default GitHub tooling. Having known, easy to use, clear tools is a major boost to the etcd as it makes it much easier for Kubernetes contributors to also help etcd.
Wenjia: Totally agree, while challenges remain, the SIG structure provides a solid foundation for addressing them and ensuring etcd's continued success as a critical component of the Kubernetes ecosystem.
The positive impact on the community is another crucial aspect of SIG etcd's success that I’d like to highlight. The Kubernetes SIG structure has created a welcoming environment for etcd contributors, leading to increased participation from the broader Kubernetes community. We have had greater collaboration with other SIGs like SIG API Machinery, SIG Scalability, SIG Testing, SIG Cluster Lifecycle, etc.
This collaboration helps ensure etcd's development aligns with the needs of the wider Kubernetes ecosystem. The formation of the etcd Operator Working Group under the joint effort between SIG etcd and SIG Cluster Lifecycle exemplifies this successful collaboration, demonstrating a shared commitment to improving etcd's operational aspects within Kubernetes.
Frederico: Since you mentioned collaboration, have you seen changes in terms of contributors and community involvement in recent months?
James: Yes -- as showing in our unique PR author data we recently hit an all time high in March and are trending in a positive direction:
Additionally, looking at our overall contributions across all etcd project repositories we are also observing a positive trend showing a resurgence in etcd project activity:
The road ahead
Frederico: That's quite telling, thank you. In terms of the near future, what are the current priorities for SIG etcd?
Marek: Reliability is always top of mind -– we need to make sure etcd is rock-solid. We're also working on making etcd easier to use and manage for operators. And we have our sights set on making etcd a viable standalone solution for infrastructure management, not just for Kubernetes. Oh, and of course, scaling -– we need to ensure etcd can handle the growing demands of the cloud-native world.
Benjamin: I agree that reliability should always be our top guiding principle. We need to ensure not only correctness but also compatibility. Additionally, we should continuously strive to improve the understandability and maintainability of etcd. Our focus should be on addressing the pain points that the community cares about the most.
Frederico: Are there any specific SIGs that you work closely with?
Marek: SIG API Machinery, for sure – they own the structure of the data etcd stores, so we're constantly working together. And SIG Cluster Lifecycle – etcd is a key part of Kubernetes clusters, so we collaborate on the newly created etcd operator Working group.
Wenjia: Other than SIG API Machinery and SIG Cluster Lifecycle that Marek mentioned above, SIG Scalability and SIG Testing is another group that we work closely with.
Frederico: In a more general sense, how would you list the key challenges for SIG etcd in the evolving cloud native landscape?
Marek: Well, reliability is always a challenge when you're dealing with critical data. The cloud-native world is evolving so fast that scaling to meet those demands is a constant effort.
Getting involved
Frederico: We're almost at the end of our conversation, but for those interested in in etcd, how can they get involved?
Marek: We'd love to have them! The best way to start is to join our SIG etcd meetings, follow discussions on the etcd-dev mailing list, and check out our GitHub issues. We're always looking for people to review proposals, test code, and contribute to documentation.
Wenjia: I love this question 😀 . There are numerous ways for people interested in contributing to SIG etcd to get involved and make a difference. Here are some key areas where you can help:
Code Contributions:
- Bug Fixes: Tackle existing issues in the etcd codebase. Start with issues labeled "good first issue" or "help wanted" to find tasks that are suitable for newcomers.
- Feature Development: Contribute to the development of new features and enhancements. Check the etcd roadmap and discussions to see what's being planned and where your skills might fit in.
- Testing and Code Reviews: Help ensure the quality of etcd by writing tests, reviewing code changes, and providing feedback.
- Documentation: Improve etcd's documentation by adding new content, clarifying existing information, or fixing errors. Clear and comprehensive documentation is essential for users and contributors.
- Community Support: Answer questions on forums, mailing lists, or Slack channels. Helping others understand and use etcd is a valuable contribution.
Getting Started:
- Join the community: Start by joining the etcd community on Slack, attending SIG meetings, and following the mailing lists. This will help you get familiar with the project, its processes, and the people involved.
- Find a mentor: If you're new to open source or etcd, consider finding a mentor who can guide you and provide support. Stay tuned! Our first cohort of mentorship program was very successful. We will have a new round of mentorship program coming up.
- Start small: Don't be afraid to start with small contributions. Even fixing a typo in the documentation or submitting a simple bug fix can be a great way to get involved.
By contributing to etcd, you'll not only be helping to improve a critical piece of the cloud-native ecosystem but also gaining valuable experience and skills. So, jump in and start contributing!
Frederico: Excellent, thank you. Lastly, one piece of advice that you'd like to give to other newly formed SIGs?
Marek: Absolutely! My advice would be to embrace the established processes of the larger community, prioritize collaboration with other SIGs, and focus on building a strong community.
Wenjia: Here are some tips I myself found very helpful in my OSS journey:
- Be patient: Open source development can take time. Don't get discouraged if your contributions aren't accepted immediately or if you encounter challenges.
- Be respectful: The etcd community values collaboration and respect. Be mindful of others' opinions and work together to achieve common goals.
- Have fun: Contributing to open source should be enjoyable. Find areas that interest you and contribute in ways that you find fulfilling.
Frederico: A great way to end this spotlight, thank you all!
For more information and resources, please take a look at :
- etcd website: https://etcd.io/
- etcd GitHub repository: https://github.com/etcd-io/etcd
- etcd community: https://etcd.io/community/
-
NFTables mode for kube-proxy
A new nftables mode for kube-proxy was introduced as an alpha feature in Kubernetes 1.29. Currently in beta, it is expected to be GA as of 1.33. The new mode fixes long-standing performance problems with the iptables mode and all users running on systems with reasonably-recent kernels are encouraged to try it out. (For compatibility reasons, even once nftables becomes GA, iptables will still be the default.)
Why nftables? Part 1: data plane latency
The iptables API was designed for implementing simple firewalls, and has problems scaling up to support Service proxying in a large Kubernetes cluster with tens of thousands of Services.
In general, the ruleset generated by kube-proxy in iptables mode has a number of iptables rules proportional to the sum of the number of Services and the total number of endpoints. In particular, at the top level of the ruleset, there is one rule to test each possible Service IP (and port) that a packet might be addressed to:
# If the packet is addressed to 172.30.0.41:80, then jump to the chain # KUBE-SVC-XPGD46QRK7WJZT7O for further processing -A KUBE-SERVICES -m comment --comment "namespace1/service1:p80 cluster IP" -m tcp -p tcp -d 172.30.0.41 --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O # If the packet is addressed to 172.30.0.42:443, then... -A KUBE-SERVICES -m comment --comment "namespace2/service2:p443 cluster IP" -m tcp -p tcp -d 172.30.0.42 --dport 443 -j KUBE-SVC-GNZBNJ2PO5MGZ6GT # etc... -A KUBE-SERVICES -m comment --comment "namespace3/service3:p80 cluster IP" -m tcp -p tcp -d 172.30.0.43 --dport 80 -j KUBE-SVC-X27LE4BHSL4DOUIK
This means that when a packet comes in, the time it takes the kernel to check it against all of the Service rules is O(n) in the number of Services. As the number of Services increases, both the average and the worst-case latency for the first packet of a new connection increases (with the difference between best-case, average, and worst-case being mostly determined by whether a given Service IP address appears earlier or later in the
KUBE-SERVICES
chain).By contrast, with nftables, the normal way to write a ruleset like this is to have a single rule, using a "verdict map" to do the dispatch:
table ip kube-proxy { # The service-ips verdict map indicates the action to take for each matching packet. map service-ips { type ipv4_addr . inet_proto . inet_service : verdict comment "ClusterIP, ExternalIP and LoadBalancer IP traffic" elements = { 172.30.0.41 . tcp . 80 : goto service-ULMVA6XW-namespace1/service1/tcp/p80, 172.30.0.42 . tcp . 443 : goto service-42NFTM6N-namespace2/service2/tcp/p443, 172.30.0.43 . tcp . 80 : goto service-4AT6LBPK-namespace3/service3/tcp/p80, ... } } # Now we just need a single rule to process all packets matching an # element in the map. (This rule says, "construct a tuple from the # destination IP address, layer 4 protocol, and destination port; look # that tuple up in "service-ips"; and if there's a match, execute the # associated verdict.) chain services { ip daddr . meta l4proto . th dport vmap @service-ips } ... }
Since there's only a single rule, with a roughly O(1) map lookup, packet processing time is more or less constant regardless of cluster size, and the best/average/worst cases are very similar:
But note the huge difference in the vertical scale between the iptables and nftables graphs! In the clusters with 5000 and 10,000 Services, the p50 (average) latency for nftables is about the same as the p01 (approximately best-case) latency for iptables. In the 30,000 Service cluster, the p99 (approximately worst-case) latency for nftables manages to beat out the p01 latency for iptables by a few microseconds! Here's both sets of data together, but you may have to squint to see the nftables results!:
Why nftables? Part 2: control plane latency
While the improvements to data plane latency in large clusters are great, there's another problem with iptables kube-proxy that often keeps users from even being able to grow their clusters to that size: the time it takes kube-proxy to program new iptables rules when Services and their endpoints change.
With both iptables and nftables, the total size of the ruleset as a whole (actual rules, plus associated data) is O(n) in the combined number of Services and their endpoints. Originally, the iptables backend would rewrite every rule on every update, and with tens of thousands of Services, this could grow to be hundreds of thousands of iptables rules. Starting in Kubernetes 1.26, we began improving kube-proxy so that it could skip updating most of the unchanged rules in each update, but the limitations of
iptables-restore
as an API meant that it was still always necessary to send an update that's O(n) in the number of Services (though with a noticeably smaller constant than it used to be). Even with those optimizations, it can still be necessary to make use of kube-proxy'sminSyncPeriod
config option to ensure that it doesn't spend every waking second trying to push iptables updates.The nftables APIs allow for doing much more incremental updates, and when kube-proxy in nftables mode does an update, the size of the update is only O(n) in the number of Services and endpoints that have changed since the last sync, regardless of the total number of Services and endpoints. The fact that the nftables API allows each nftables-using component to have its own private table also means that there is no global lock contention between components like with iptables. As a result, kube-proxy's nftables updates can be done much more efficiently than with iptables.
(Unfortunately I don't have cool graphs for this part.)
Why not nftables?
All that said, there are a few reasons why you might not want to jump right into using the nftables backend for now.
First, the code is still fairly new. While it has plenty of unit tests, performs correctly in our CI system, and has now been used in the real world by multiple users, it has not seen anything close to as much real-world usage as the iptables backend has, so we can't promise that it is as stable and bug-free.
Second, the nftables mode will not work on older Linux distributions; currently it requires a 5.13 or newer kernel. Additionally, because of bugs in early versions of the
nft
command line tool, you should not run kube-proxy in nftables mode on nodes that have an old (earlier than 1.0.0) version ofnft
in the host filesystem (or else kube-proxy's use of nftables may interfere with other uses of nftables on the system).Third, you may have other networking components in your cluster, such as the pod network or NetworkPolicy implementation, that do not yet support kube-proxy in nftables mode. You should consult the documentation (or forums, bug tracker, etc.) for any such components to see if they have problems with nftables mode. (In many cases they will not; as long as they don't try to directly interact with or override kube-proxy's iptables rules, they shouldn't care whether kube-proxy is using iptables or nftables.) Additionally, observability and monitoring tools that have not been updated may report less data for kube-proxy in nftables mode than they do for kube-proxy in iptables mode.
Finally, kube-proxy in nftables mode is intentionally not 100% compatible with kube-proxy in iptables mode. There are a few old kube-proxy features whose default behaviors are less secure, less performant, or less intuitive than we'd like, but where we felt that changing the default would be a compatibility break. Since the nftables mode is opt-in, this gave us a chance to fix those bad defaults without breaking users who weren't expecting changes. (In particular, with nftables mode, NodePort Services are now only reachable on their nodes' default IPs, as opposed to being reachable on all IPs, including
127.0.0.1
, with iptables mode.) The kube-proxy documentation has more information about this, including information about metrics you can look at to determine if you are relying on any of the changed functionality, and what configuration options are available to get more backward-compatible behavior.Trying out nftables mode
Ready to try it out? In Kubernetes 1.31 and later, you just need to pass
--proxy-mode nftables
to kube-proxy (or setmode: nftables
in your kube-proxy config file).If you are using kubeadm to set up your cluster, the kubeadm documentation explains how to pass a
KubeProxyConfiguration
tokubeadm init
. You can also deploy nftables-based clusters withkind
.You can also convert existing clusters from iptables (or ipvs) mode to nftables by updating the kube-proxy configuration and restarting the kube-proxy pods. (You do not need to reboot the nodes: when restarting in nftables mode, kube-proxy will delete any existing iptables or ipvs rules, and likewise, if you later revert back to iptables or ipvs mode, it will delete any existing nftables rules.)
Future plans
As mentioned above, while nftables is now the best kube-proxy mode, it is not the default, and we do not yet have a plan for changing that. We will continue to support the iptables mode for a long time.
The future of the IPVS mode of kube-proxy is less certain: its main advantage over iptables was that it was faster, but certain aspects of the IPVS architecture and APIs were awkward for kube-proxy's purposes (for example, the fact that the
kube-ipvs0
device needs to have every Service IP address assigned to it), and some parts of Kubernetes Service proxying semantics were difficult to implement using IPVS (particularly the fact that some Services had to have different endpoints depending on whether you connected to them from a local or remote client). And now, the nftables mode has the same performance as IPVS mode (actually, slightly better), without any of the downsides:(In theory the IPVS mode also has the advantage of being able to use various other IPVS functionality, like alternative "schedulers" for balancing endpoints. In practice, this ended up not being very useful, because kube-proxy runs independently on every node, and the IPVS schedulers on each node had no way of sharing their state with the proxies on other nodes, thus thwarting the effort to balance traffic more cleverly.)
While the Kubernetes project does not have an immediate plan to drop the IPVS backend, it is probably doomed in the long run, and people who are currently using IPVS mode should try out the nftables mode instead (and file bugs if you think there is missing functionality in nftables mode that you can't work around).
Learn more
-
"KEP-3866: Add an nftables-based kube-proxy backend" has the history of the new feature.
-
"How the Tables Have Turned: Kubernetes Says Goodbye to IPTables", from KubeCon/CloudNativeCon North America 2024, talks about porting kube-proxy and Calico from iptables to nftables.
-
"From Observability to Performance", from KubeCon/CloudNativeCon North America 2024. (This is where the kube-proxy latency data came from; the raw data for the charts is also available.)
-
-
The Cloud Controller Manager Chicken and Egg Problem
Kubernetes 1.31 completed the largest migration in Kubernetes history, removing the in-tree cloud provider. While the component migration is now done, this leaves some additional complexity for users and installer projects (for example, kOps or Cluster API) . We will go over those additional steps and failure points and make recommendations for cluster owners. This migration was complex and some logic had to be extracted from the core components, building four new subsystems.
- Cloud controller manager (KEP-2392)
- API server network proxy (KEP-1281)
- kubelet credential provider plugins (KEP-2133)
- Storage migration to use CSI (KEP-625)
The cloud controller manager is part of the control plane. It is a critical component that replaces some functionality that existed previously in the kube-controller-manager and the kubelet.
Components of Kubernetes
One of the most critical functionalities of the cloud controller manager is the node controller, which is responsible for the initialization of the nodes.
As you can see in the following diagram, when the kubelet starts, it registers the Node object with the apiserver, Tainting the node so it can be processed first by the cloud-controller-manager. The initial Node is missing the cloud-provider specific information, like the Node Addresses and the Labels with the cloud provider specific information like the Node, Region and Instance type information.
Chicken and egg problem sequence diagram
This new initialization process adds some latency to the node readiness. Previously, the kubelet was able to initialize the node at the same time it created the node. Since the logic has moved to the cloud-controller-manager, this can cause a chicken and egg problem during the cluster bootstrapping for those Kubernetes architectures that do not deploy the controller manager as the other components of the control plane, commonly as static pods, standalone binaries or daemonsets/deployments with tolerations to the taints and using
hostNetwork
(more on this below)Examples of the dependency problem
As noted above, it is possible during bootstrapping for the cloud-controller-manager to be unschedulable and as such the cluster will not initialize properly. The following are a few concrete examples of how this problem can be expressed and the root causes for why they might occur.
These examples assume you are running your cloud-controller-manager using a Kubernetes resource (e.g. Deployment, DaemonSet, or similar) to control its lifecycle. Because these methods rely on Kubernetes to schedule the cloud-controller-manager, care must be taken to ensure it will schedule properly.
Example: Cloud controller manager not scheduling due to uninitialized taint
As noted in the Kubernetes documentation, when the kubelet is started with the command line flag
--cloud-provider=external
, its correspondingNode
object will have a no schedule taint namednode.cloudprovider.kubernetes.io/uninitialized
added. Because the cloud-controller-manager is responsible for removing the no schedule taint, this can create a situation where a cloud-controller-manager that is being managed by a Kubernetes resource, such as aDeployment
orDaemonSet
, may not be able to schedule.If the cloud-controller-manager is not able to be scheduled during the initialization of the control plane, then the resulting
Node
objects will all have thenode.cloudprovider.kubernetes.io/uninitialized
no schedule taint. It also means that this taint will not be removed as the cloud-controller-manager is responsible for its removal. If the no schedule taint is not removed, then critical workloads, such as the container network interface controllers, will not be able to schedule, and the cluster will be left in an unhealthy state.Example: Cloud controller manager not scheduling due to not-ready taint
The next example would be possible in situations where the container network interface (CNI) is waiting for IP address information from the cloud-controller-manager (CCM), and the CCM has not tolerated the taint which would be removed by the CNI.
The Kubernetes documentation describes the
node.kubernetes.io/not-ready
taint as follows:"The Node controller detects whether a Node is ready by monitoring its health and adds or removes this taint accordingly."
One of the conditions that can lead to a Node resource having this taint is when the container network has not yet been initialized on that node. As the cloud-controller-manager is responsible for adding the IP addresses to a Node resource, and the IP addresses are needed by the container network controllers to properly configure the container network, it is possible in some circumstances for a node to become stuck as not ready and uninitialized permanently.
This situation occurs for a similar reason as the first example, although in this case, the
node.kubernetes.io/not-ready
taint is used with the no execute effect and thus will cause the cloud-controller-manager not to run on the node with the taint. If the cloud-controller-manager is not able to execute, then it will not initialize the node. It will cascade into the container network controllers not being able to run properly, and the node will end up carrying both thenode.cloudprovider.kubernetes.io/uninitialized
andnode.kubernetes.io/not-ready
taints, leaving the cluster in an unhealthy state.Our Recommendations
There is no one “correct way” to run a cloud-controller-manager. The details will depend on the specific needs of the cluster administrators and users. When planning your clusters and the lifecycle of the cloud-controller-managers please consider the following guidance:
For cloud-controller-managers running in the same cluster, they are managing.
- Use host network mode, rather than the pod network: in most cases, a cloud controller manager will need to communicate with an API service endpoint associated with the infrastructure. Setting “hostNetwork” to true will ensure that the cloud controller is using the host networking instead of the container network and, as such, will have the same network access as the host operating system. It will also remove the dependency on the networking plugin. This will ensure that the cloud controller has access to the infrastructure endpoint (always check your networking configuration against your infrastructure provider’s instructions).
- Use a scalable resource type.
Deployments
andDaemonSets
are useful for controlling the lifecycle of a cloud controller. They allow easy access to running multiple copies for redundancy as well as using the Kubernetes scheduling to ensure proper placement in the cluster. When using these primitives to control the lifecycle of your cloud controllers and running multiple replicas, you must remember to enable leader election, or else your controllers will collide with each other which could lead to nodes not being initialized in the cluster. - Target the controller manager containers to the control plane. There might exist other
controllers which need to run outside the control plane (for example, Azure’s node manager
controller). Still, the controller managers themselves should be deployed to the control plane.
Use a node selector or affinity stanza to direct the scheduling of cloud controllers to the
control plane to ensure that they are running in a protected space. Cloud controllers are vital
to adding and removing nodes to a cluster as they form a link between Kubernetes and the
physical infrastructure. Running them on the control plane will help to ensure that they run
with a similar priority as other core cluster controllers and that they have some separation
from non-privileged user workloads.
- It is worth noting that an anti-affinity stanza to prevent cloud controllers from running on the same host is also very useful to ensure that a single node failure will not degrade the cloud controller performance.
- Ensure that the tolerations allow operation. Use tolerations on the manifest for the cloud
controller container to ensure that it will schedule to the correct nodes and that it can run
in situations where a node is initializing. This means that cloud controllers should tolerate
the
node.cloudprovider.kubernetes.io/uninitialized
taint, and it should also tolerate any taints associated with the control plane (for example,node-role.kubernetes.io/control-plane
ornode-role.kubernetes.io/master
). It can also be useful to tolerate thenode.kubernetes.io/not-ready
taint to ensure that the cloud controller can run even when the node is not yet available for health monitoring.
For cloud-controller-managers that will not be running on the cluster they manage (for example, in a hosted control plane on a separate cluster), then the rules are much more constrained by the dependencies of the environment of the cluster running the cloud-controller-manager. The advice for running on a self-managed cluster may not be appropriate as the types of conflicts and network constraints will be different. Please consult the architecture and requirements of your topology for these scenarios.
Example
This is an example of a Kubernetes Deployment highlighting the guidance shown above. It is important to note that this is for demonstration purposes only, for production uses please consult your cloud provider’s documentation.
apiVersion: apps/v1 kind: Deployment metadata: labels: app.kubernetes.io/name: cloud-controller-manager name: cloud-controller-manager namespace: kube-system spec: replicas: 2 selector: matchLabels: app.kubernetes.io/name: cloud-controller-manager strategy: type: Recreate template: metadata: labels: app.kubernetes.io/name: cloud-controller-manager annotations: kubernetes.io/description: Cloud controller manager for my infrastructure spec: containers: # the container details will depend on your specific cloud controller manager - name: cloud-controller-manager command: - /bin/my-infrastructure-cloud-controller-manager - --leader-elect=true - -v=1 image: registry/my-infrastructure-cloud-controller-manager@latest resources: requests: cpu: 200m memory: 50Mi hostNetwork: true # these Pods are part of the control plane nodeSelector: node-role.kubernetes.io/control-plane: "" affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: "kubernetes.io/hostname" labelSelector: matchLabels: app.kubernetes.io/name: cloud-controller-manager tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 120 - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 120 - effect: NoSchedule key: node.cloudprovider.kubernetes.io/uninitialized operator: Exists - effect: NoSchedule key: node.kubernetes.io/not-ready operator: Exists
When deciding how to deploy your cloud controller manager it is worth noting that cluster-proportional, or resource-based, pod autoscaling is not recommended. Running multiple replicas of a cloud controller manager is good practice for ensuring high-availability and redundancy, but does not contribute to better performance. In general, only a single instance of a cloud controller manager will be reconciling a cluster at any given time.
-
Spotlight on SIG Architecture: Enhancements
This is the fourth interview of a SIG Architecture Spotlight series that will cover the different subprojects, and we will be covering SIG Architecture: Enhancements.
In this SIG Architecture spotlight we talked with Kirsten Garrison, lead of the Enhancements subproject.
The Enhancements subproject
Frederico (FSM): Hi Kirsten, very happy to have the opportunity to talk about the Enhancements subproject. Let's start with some quick information about yourself and your role.
Kirsten Garrison (KG): I’m a lead of the Enhancements subproject of SIG-Architecture and currently work at Google. I first got involved by contributing to the service-catalog project with the help of Carolyn Van Slyck. With time, I joined the Release team, eventually becoming the Enhancements Lead and a Release Lead shadow. While on the release team, I worked on some ideas to make the process better for the SIGs and Enhancements team (the opt-in process) based on my team’s experiences. Eventually, I started attending Subproject meetings and contributing to the Subproject’s work.
FSM: You mentioned the Enhancements subproject: how would you describe its main goals and areas of intervention?
KG: The Enhancements Subproject primarily concerns itself with the Kubernetes Enhancement Proposal (KEP for short)—the "design" documents required for all features and significant changes to the Kubernetes project.
The KEP and its impact
FSM: The improvement of the KEP process was (and is) one in which SIG Architecture was heavily involved. Could you explain the process to those that aren’t aware of it?
KG: Every release, the SIGs let the Release Team know which features they intend to work on to be put into the release. As mentioned above, the prerequisite for these changes is a KEP - a standardized design document that all authors must fill out and approve in the first weeks of the release cycle. Most features will move through 3 phases: alpha, beta and finally GA so approving a feature represents a significant commitment for the SIG.
The KEP serves as the full source of truth of a feature. The KEP template has different requirements based on what stage a feature is in, but it generally requires a detailed discussion of the design and the impact as well as providing artifacts of stability and performance. The KEP takes quite a bit of iterative work between authors, SIG reviewers, api review team and the Production Readiness Review team1 before it is approved. Each set of reviewers is looking to make sure that the proposal meets their standards in order to have a stable and performant Kubernetes release. Only after all approvals are secured, can an author go forth and merge their feature in the Kubernetes code base.
FSM: I see, quite a bit of additional structure was added. Looking back, what were the most significant improvements of that approach?
KG: In general, I think that the improvements with the most impact had to do with focusing on the core intent of the KEP. KEPs exist not just to memorialize designs, but provide a structured way to discuss and come to an agreement about different facets of the change. At the core of the KEP process is communication and consideration.
To that end, some of the significant changes revolve around a more detailed and accessible KEP template. A significant amount of work was put in over time to get the k/enhancements repo into its current form -- a directory structure organized by SIG with the contours of the modern KEP template (with Proposal/Motivation/Design Details subsections). We might take that basic structure for granted today, but it really represents the work of many people trying to get the foundation of this process in place over time.
As Kubernetes matures, we’ve needed to think about more than just the end goal of getting a single feature merged. We need to think about things like: stability, performance, setting and meeting user expectations. And as we’ve thought about those things the template has grown more detailed. The addition of the Production Readiness Review was major as well as the enhanced testing requirements (varying at different stages of a KEP’s lifecycle).
Current areas of focus
FSM: Speaking of maturing, we’ve recently released Kubernetes v1.31, and work on v1.32 has started. Are there any areas that the Enhancements sub-project is currently addressing that might change the way things are done?
KG: We’re currently working on two things:
- Creating a Process KEP template. Sometimes people want to harness the KEP process for significant changes that are more process oriented rather than feature oriented. We want to support this because memorializing changes is important and giving people a better tool to do so will only encourage more discussion and transparency.
- KEP versioning. While our template changes aim to be as non-disruptive as possible, we believe that it will be easier to track and communicate those changes to the community better with a versioned KEP template and the policies that go alongside such versioning.
Both features will take some time to get right and fully roll out (just like a KEP feature) but we believe that they will both provide improvements that will benefit the community at large.
FSM: You mentioned improvements: I remember when project boards for Enhancement tracking were introduced in recent releases, to great effect and unanimous applause from release team members. Was this a particular area of focus for the subproject?
KG: The Subproject provided support to the Release Team’s Enhancement team in the migration away from using the spreadsheet to a project board. The collection and tracking of enhancements has always been a logistical challenge. During my time on the Release Team, I helped with the transition to an opt-in system of enhancements, whereby the SIG leads "opt-in" KEPs for release tracking. This helped to enhance communication between authors and SIGs before any significant work was undertaken on a KEP and removed toil from the Enhancements team. This change used the existing tools to avoid introducing too many changes at once to the community. Later, the Release Team approached the Subproject with an idea of leveraging GitHub Project Boards to further improve the collection process. This was to be a move away from the use of complicated spreadsheets to using repo-native labels on k/enhancement issues and project boards.
FSM: That surely adds an impact on simplifying the workflow...
KG: Removing sources of friction and promoting clear communication is very important to the Enhancements Subproject. At the same time, it’s important to give careful consideration to decisions that impact the community as a whole. We want to make sure that changes are balanced to give an upside and while not causing any regressions and pain in the rollout. We supported the Release Team in ideation as well as through the actual migration to the project boards. It was a great success and exciting to see the team make high impact changes that helped everyone involved in the KEP process!
Getting involved
FSM: For those reading that might be curious and interested in helping, how would you describe the required skills for participating in the sub-project?
KG: Familiarity with KEPs either via experience or taking time to look through the kubernetes/enhancements repo is helpful. All are welcome to participate if interested - we can take it from there.
FSM: Excellent! Many thanks for your time and insight -- any final comments you would like to share with our readers?
KG: The Enhancements process is one of the most important parts of Kubernetes and requires enormous amounts of coordination and collaboration of people and teams across the project to make it successful. I’m thankful and inspired by everyone’s continued hard work and dedication to making the project great. This is truly a wonderful community.
-
For more information, check the Production Readiness Review spotlight interview in this series. ↩︎