What I Learnt Fixing 50+ Broken Kubernetes Clusters
Speaker: David Flanagan
Summary
David Flanagan shares his experiences of fixing 100 broken Kubernetes clusters in a presentation. He founded Rockwood Academy, a platform that initially started as a YouTube channel and later transformed into a full-time venture focusing on cloud-native software. David hosts a show called "Clustered" where he takes on the challenge of fixing compromised Kubernetes clusters. Through this show, he emphasizes the importance of learning from failures to better understand system operations. Additionally, David discusses various tools and techniques related to Kubernetes and Linux, such as eBPF, and recommends their usage for enhanced system operation and monitoring.
Transcription
All right, I'm going to get started. So, this is what I learned fixing 100 broken Kubernetes clusters. I'm going to give you all an opportunity to laugh at me for a bit at my pain and anguish, and then I'm going to try and share some information with you about operating Kubernetes.
So, my name is David. I go by Rawkode across the internet. I'm the founder of the Rockwood Academy. It started out as a YouTube channel, but now, as of September, it's my full-time job where I'm just producing content on cloud-native software. I focus on open-source stuff. So, if you're deploying to Kubernetes, doing cloud-native, want to get better at observability, monitoring, GitOps, all those cool buzzwords, hopefully, you'll find my channel quite useful.
However, today we're going to focus on a show that I have called Clustered. Clustered is a show where I fix broken Kubernetes clusters. Has anyone seen it before? Okay, a few hands. The rest of you are hopefully in for a little bit of a treat.
I also organized and founded KubeHuddle, which is a community-driven Kubernetes conference. We started in Edinburgh last year, and we'll be doing two this year: one in Toronto in May and again in Edinburgh in September. So, check out Kubehuddle.com if you want to learn more about the Kubernetes conference.
The Clustered concept on paper is really simple. You'll see that green bit of text at the bottom: all you have to do is upgrade a Kubernetes deployment. All you need to do is change the tag on the image from V1 to V2. This should take you about 15 seconds. However, the clusters are given to people in advance, two days or 48 hours typically, and they're allowed to do anything they want to that cluster to stop us from performing the upgrade. And we've learned a few things about how cruel people can be essentially.
If you ever want to join me on Clustered and you like what you see here, it's really hard to find people that are willing to be vulnerable on a live stream, fixing a cluster they have no idea what's wrong with. If that sounds fun to you, definitely come and talk to me.
All right, as I said, people are mean. I'm going to start with a video, and it's a couple of minutes long. Then there's one other really short video afterwards. I want to give you a taste for what people do to break these clusters. In the first one, you're going to see me getting very frustrated. There are some swear words; I do apologize. And in the second video, you'll see a clip from our Team's Edition which featured Red Hat and Talos, and there's a really cool lesson in the second one.
“Yeah, we still have no cluster DNS? Cube DNS has an endpoint... Okay, that's after eight. Let's see if... Yeah, I think it was the last bit of a place before we wrap this up. Are we close? You've not looked at the configuration of CoreDNS. You've looked at the pods, but not how the pods are configured to run the DNS. We have both DNS. It keeps coming. All right, did we miss something? I'm going to jump out of the cluster, 14, and grab the config back and see what's different. What about...? Yeah, these are the exact same. He told us to look in the CoreDNS config, and then there is nothing wrong with the CoreDNS config. That's just cruel. Saying there's a few pixels, this is a whitespace error in the CoreDNS config. I will be getting frustrated soon.
It might be a space error! I am... He's blocking us now because he said, "You're jumping back and forth, notice a slight difference. This number seems arbitrary." Yeah, I'm assuming there's like a weird bug, and I'm sure he's found it because he works at Skyscanner. They've got a level of skill that most people don't have, but something to do with this number going so high is maybe causing it to load an old config map in their pods or something. I don't really know, but the number being so high worries me. But we're now at the stage where I have to go and pick up my daughter, so yeah, I'm just gonna bring Guy in to tell us. What wonderful magic... When you were jumping back and forth between the working config and the non-working config, if you look really hard at the 'C' in the Kubernetes config… Are you...? The 'C'? No, it's a character which looks a lot like… So, the Kubernetes became authoritative for a letter that looks like 'C'luster.local?
Exactly.
I really hate that guy. That was a long episode, nearly two hours we spent trying to fix this last problem. What I love about that clip is, I promise you, I'm quite smart and I'm quite good with Kubernetes. But it had me doubting things, which I know are not the fault. The fact that I think a six-digit number is going to cause any sort of overflow on a 64-bit system... Of course not! But debugging is hard. Sometimes you have to kind of remove yourself from a situation and come back to it, which we didn't do there. We now have a new rule on Clustered: no more Unicode breaks ever. So, strictly prohibited. No one's allowed to do it. We have very few rules on our clusters, but this is now one of the official ones.
Okay, so, "failure is a success in progress," and you only learn when things go wrong. This is why I really love doing Clustered. If you just have a cluster that just works, you're never really going to learn how to operate that beyond a certain level of scale. Clustered brings us this situation where we can have people bring their failures from their own companies, organizations, or teams. We replicate those issues on a live stream format that allows us to see how individuals debug. There's always a guest, there's always a team, and you get that insight into how they approach a problem. Sometimes that's more important. You don't necessarily need to fix the cluster; you can get a lot of valuable information by just seeing how people react, handle symptoms, and work backward.
We've been doing it for 18 months and I hopefully won't be stopping anytime soon. This talk is going to focus on this road of failures that we've come to appreciate from Clustered. I want to share some lessons with you. I'm also going to talk about some other learnings.
Next, it's just a really short video. It's the Team Edition. And then I want to talk about the problem, because it's also particularly cruel.
This was in the first 10 seconds of them picking up this cluster. They've disabled any auto-complete in the shell. They had removed the executable bit from `kubectl`, and then removed the executable bit from `chmod`. Now they're in a position where they can no longer reapply the executable bit to `kubectl` or `chmod`. And if you're familiar with the Linux file system of colours, all the executable bits have been removed. They don't have `chmod`, they don't have `kubectl`, they don't have `oc`. I don't know why they added that terrible alias. They don't have `chmod`. The only way I know how to fix this is with Perl, so I thought that was a particularly cruel touch.
Does anyone know how to execute a binary without executable permission? Why would you? But Clustered gives us the opportunity to learn this. This wouldn't have happened, this wouldn't have been a good episode if it wasn't for these two specific teams. This was Team Red Hat. They do Linux all day, every day. It was also Team Talos. If you're not familiar with Talos, they build Cidero and Talos Linux, custom Linux operating systems for running Kubernetes. So, we had these two powerhouses of Linux contributors who know Linux inside and out. It was complete serendipity that it came out this way.
So the easiest way to fix this is Peril has the ability to execute most syscalls on a machine. And you could have done a Peril dash-run, but they removed Perl.
And so what we learned from this episode is, you can actually execute the dynamic linker on Linux. So we have this LD-1x.so, and you can actually execute any binary on a machine by proxying it through that linker. So you can check my channel for that, which is a really cool trick.
Also, on Clustered, we've seen people modify attributes on a Linux file system. Anyone know what attributes are in a Linux file system? No, of course not. Why should you? So attributes allow you to actually get really low-level into the file system. And here we're actually marking a file as immutable. And you can do that with change ttr command. And so you can lock a file that you know kubecontrol or Kubernetes has to write to, mark it as immutable, and you've immediately broken a system. And you're not going to be able to detect that break by running your regular ls command. You actually do need to do an lsattr on the file and then understand what these obscure references mean when you list them all. So again, Clustered gives us this environment where we get to extract all of this knowledge from people that have done stuff that we haven't done before. And again, that's just awesome.
So there was another very serendipitous episode with Chris Nova and Thomas Stromberg. Chris Nova is a kernel hacker that's worked in security and Kubernetes for many years. Thomas Stromberg was the maintainer of Minikube at Google. Also very smart, been doing this for at least 20 years, and coincidentally used to work in forensic analysis of intrusions at Google. I didn't know that, but when he came onto his episode and Chris Nova hacked Kubernetes--don't break it, literally rootkitted the Kubernetes--Thomas came on and ran this fls command. Who knows what fls is? Of course not. It's this very old toolkit written in the late 90s called SleuthKit that does forensic analysis of a file system, and a Linux file system. And by running this command, he got a time-ordered change of every modification to the Linux file system. He had every answer to every question he wanted to answer for the last 48 hours that Chris Nova had that cluster. And again, anyone else was up against Chris Nova on that episode, they were not fixing that. So I love that we have these opportunities of complete serendipity to share knowledge with everyone.
So Linux is a pretty good attack surface because it's a very complicated operating system. If you want to get into the low-level specifics, you want to know what else is hard? Networking. We've had a fair number of networking breaks on Clustered too.
What people don't realize with Kubernetes is that we have core networking policies. However, we're now seeing fragmentation as other CNI providers bring on their own adaptations to network policies. So it's not enough to check for network policies or cluster network policies. But we have Cilium network policies, Cilium cluster-wide number of policies, and even Cilium local redirect policies.
Specifically, the understanding, stuff is the things you need to know to successfully operate a Kubernetes cluster from a networking level continues to evolve and get very cumbersome, scaly, complicated, but also easier. My biggest frustration with Kubernetes is the default DNS policy. Who thinks the default DNS policy in Kubernetes is the default DNS policy?
It's not. We have this DNS policy called "Default," but it's not a default. The default is "Cluster First," which means it's going to try and resolve the DNS name within a cluster. And the default policy actually resorts to the default routing on the host. So, it passes the eyeball test. You've got this configuration, you could be looking at your kubelet configs, "Yeah, this is great, it's perfect." No, it's not. And I hope, and I've been discussing with people like Tim Hawkins and other core maintainers of Kubernetes, how do we remove some of these anomalies that are essentially foot-guns for people that just haven't encountered these problems before?
Moving outside of Kubernetes, when it comes to Kubernetes networking, we have the host, we have IP tables and NF tables, we have Quality of Service through something called traffic control. Who's heard of traffic control? Okay, a couple of hands, good. There's lots of ways to change how networking happens on a host. And more recently, we have eXtreme Data Path and eBPF, which is changing the landscape completely as well. It's not, you can't just go into a Linux machine anymore and run iptables -L, which I'm sure we all have ingrained on our skulls from the last 20 years. But no, what or who knows how to list all the eBPF probes or traffic policies in the cluster? Essentially, you can.
You need to have other eBPF tools that can understand the existing eBPF tools. And unfortunately, Cilium is leading the way here. So a good tip is to check out Hubble by Cilium. Provided you're using their CNI, of course. But you get this wonderful visual representation of all the networking policies, Kubernetes-specific and Cilium-specific. And it shows you with nice little arrows and pointers what services can communicate with each other. Very valuable tool.
So, Hubble also ships with a CLI. We did a special cluster where I tried to actually confuse the Cilium team. The Isovalent team joined me, and I thought, "I'm going to use your own tool against you." I used a local redirect policy. This allows you to do a host-based DNS resolution by saying every time I want to speak to this service, don't let them speak to that service. Instead, make them speak to this service over here. So, as you can see here, they're doing a Hubble observe and grepping for Port 5432 (Postgres). However, I misused the local redirect policy and set them to fluentd, so they weren't actually able to communicate with the database at all. However, Duffy is a very smart man, and if you ran the Hubble CLI and actually seen, "Hey, how about showing me that we're speaking to fluentd instead of Postgres? What's going on there?" And within 30 seconds, he removed my local redirect policy. So, we have the tools to understand networking within our cluster. If you're lucky enough to be in Cilium land. If you're using other CNIs, you will have to find other tools. But they do exist as well.
My favorite thing about Cilium is the editor. You can actually go to editor.cilium.io, and you can build a Kubernetes networking policy or a Cilium network policy by dragging boxes, changing labels, and changing port numbers. So, you don't actually need to learn how to navigate these esoteric YAML files anymore. Use the tooling that's perfected. What's also really good about this is you can take an existing network policy that you have in your cluster, drop it in, and it gives you that visual representation too. So, whether you're debugging existing policies or creating new policies, everything in the Cilium editor is there and available to help you.
Alright, next, I'm going to talk about etcd. I'm not going to ask who's operated etcd. Instead, I'll ask who enjoys operating etcd? Not a single hand, of course not. And this is the one thing where, when we do the pre-prep on Clustered, I generally ask people what they don't want to see from the other team, and every single time, it's, "Please don't touch etcd or certificates." Certificates are also up there, but I think etcd probably just edges them. Most people swear when they go, "Oh, it's an etcd break." They don't normally say it as politely as that. It's like that moment it sinks in; there's dread and fear because you can't just bust out the etcd ctl command line. You have to prepare it with a whole bunch of environment variables to tell it to speak to etcd and then point it to all the right certificates. And then, eventually, you might be able to run etcdctl health or status. It's not easy to start debugging when etcd is the problem.
What we've seen is etcd comes with a default database maximum size of 25MB, and it can be configured. So, what people like to do is saturate that. They write a whole bunch of data into etcd; it hits the max file size, and you'll be presented with an alarm which says "No space." Now, you may think, "I can resize the database, restart etcd, and things will work." No, of course not. Etcd will not manually clear that alarm when you've rectified the problem. You actually have to explicitly and intentionally disable the alarm, and then etcd will begin to work again. So, etcd worship is easier. There are lots of things that are easy to break, and you have to have all of this knowledge. Again, you have to have failed a whole bunch of things just to cover the basics of operating etcd in your cluster, which is why we all find managed Kubernetes offerings so appealing, right?
On one particularly cruel episode, a team turned on encryption on my Kubernetes cluster and didn't re-encrypt or didn't encrypt any of the data in etcd already. Remember, running kubectl commands, we weren't able to get anything back. And honestly, this one stumped me. Like, I've actually never, until this episode, turned on encryption on my etcd cluster. So, what's a really cool thing to do? They watched me suffer for 20 minutes as I was writing a bash script to pull out keys, encrypt them, and write them back in. But you can actually enable dual-mode. The config for that is there; we don't really need to read it. But you can turn this on. It's this thing called identity, and it means etcd will support reading unencrypted keys and encrypted keys. And then all you need to do is get all secrets, namespace as JSON, and then replace them straight back in, and you've encrypted your cluster. But unless you've had to do this, you don't know that you need to do this. And this is the config: you just enable identity on the bottom.
Alright, so where to next?
There are a lot of ways to break a Kubernetes cluster. This word cloud is probably not that comprehensive, and I wish I had more data from all of the episodes. But, you know, you can attack the container runtime. We've had people recompile the kubelet and remove certain controllers so they can't spin off pods or containers. We've had people roll back the kubectl binary 25 versions, so it was running kubectl 0.8 or some similar number like this on a cluster that was running 1.23. Why did they roll it back 25 versions? Because rolling it back 24 versions didn't actually break it, which is pretty cool.
And they kept rolling it back, one version, one version, one version, until it eventually broke and didn't speak the same API language. So, we have this nice thing in that the Kubernetes maintainers were really good at ensuring we had a level of backwards compatibility. But there's a lot. The style, you can push it too far. And they decreed, we haven't really discussed storage, but if you're operating bare-metal Kubernetes, you've got to worry about your own CSI providers. And that is another very challenging aspect. There's also backups, there's auto-scaling, there's horizontal pods, give a vertical plug on the scale, and there's a whole bunch of ways to break a cluster. And then there's the security aspect of this, right? What about set comp? Are we doing that in our clusters? Are we minimizing permissions? Lots and lots of these. And I would say, if you want to learn more about these things, if you find this interesting, go to the YouTube channel. There's over 50 episodes and absolutely good numbers. However, what I really want to cement here is that you're not supposed to know this, right? You only learn this stuff when things go wrong. So if you've been exceptionally lucky and these things haven't gone wrong, then a lot of these principles may seem really foreign.
One of the things I always tell our guests at the start of every episode is that we want to normalize sin. I don't know. The one rule I give people is, please don't sit there, quiet, Googling off-camera to get an answer and go, "Oh, I know how to fix this," because that doesn't benefit the audience. I actually want to normalize this. I've been doing Kubernetes for six years. I've operated bare-metal clusters, managed cloud service. I think I've seen an awful lot, and cluster of this made me see even more. And I still get stumped every single episode because there's so many different tactics or disease to break Kubernetes. So just make sure you're bringing us into your team. If you are an SRE or platform engineer, you're helping operate Kubernetes. See, I don't know, especially if you're principal or senior, let's normalize this for other people on the team and try and dismantle this, not this terrible hero culture that we've adopted over the last 20 years.
So we all find this difficult, and I encourage you all to learn in public too. I do this on my YouTube channel because it's actually kind of selfish. I get to learn more from other people. But at the same time, I'm hopefully producing content that we can all learn together. And what I want to see here, what I'd love to see, is more people share their Kubernetes knowledge out there. And we have some really good resources as well. And there's a KSAF, I think it is, which is all the horror stories from other people that have operated Kubernetes. And we really should share as much as possible. So to kind of finish, I want to share what I'm learning now and what I find really interesting, and that is eBPF. I genuinely believe that eBPF is going to change everything with regards to the way that we operate, not just Kubernetes but container-based workflows and Linux. As a reason, it changes everything. It's because it's super performant. It runs, am I over time? No, it runs in kernel and kernel space, not user space. I use ring buffers and other fancy stuff, so you don't take any sort of hit when you start deploying eBPF probes to your clusters. It also does some really cool tracing. And we now are very lucky to have continuous profiling tools that don't need this understandable runtime. They don't need to speak the gopher protocol. They don't need to understand, PHP is the wrong thing, they don't need to understand the JVM. We're able to instrument all of our applications at an eBPF level, which is very, very cool. And it's also as safe and secure because eBPF programs have a limited subset of calls they can make. It should technically never crash or cause you any problems on your kernel. Always edge cases. And the tools that I want to highlight, that I think if you're interested in eBPF, you should be looking at, are cilium from Isovalent. Hubble is built on eBPF. It's an eBPF-powered CNI. You actually don't need the kube-proxy, which I think is also very, very cool. Falco from Sysdig gives us security monitoring of workloads. It's detected at the Cisco level if our applications are doing things that we don't actually expect them to do, and it requires zero instrumentation on your part. Inspector Gadget from Kinvolk actually exposes all of these SNOOP tools. I'm going to demo one or two of, and your Kubernetes cluster as well. And Pixie by Pixie Labs, which is now owned by New Relic, but they've kept it open source and free, gives you automated instrumentation and monitoring data for any application, again based on eBPF. So you don't need to go out and jump to open telemetry and add loads of lines of code to your application to start monitoring. You don't need to add your own Prometheus metrics, anything. You should, and you can deploy Pixie and start to get a lot of this information for free, a small performance overhead of an extra container in your pod. The most interesting stuff for me, though, are these SNOOP tools. Execsnoop, you can run on a machine, and it will tell you anytime anybody or any process executes a command on this machine. IOSnoop will tell, and opensnoop will tell you when files are opened, written to, closed, etc., on the machine. And then we have a whole bunch of stuff for networking too. And in fact, if I find my terminal here and SSH onto a server I set up this morning, we can actually run a couple of commands. So on the top, I'm going to fix that in a second. I'm going to run execsnoop, and I won't reject my one password prompt this time. Okay, so we can already see a whole bunch of things are happening just by me logging into the system. Now I could run echo hello, but it's not going to show up. Does anyone know why? It's built-in to Bash. It doesn't actually execute anything on the machine. However, we could run /bin/echo. And you see it on the top. What's really cool is when you start filtering that data for things that you don't expect to happen on your box. Here, we can actually see that sudo was called. We see the parameters to sudo, and we understand when privilege elevation happens on our machine. So execsnoop is just an example application of hooking an eBPF probe into your system to get information out. The last one I'll show is opensnoop, which will actually show you every time a file is read or written to, etc. And you can even see as I type, and things are happening because of the weird file descriptors and the fact that everything on Linux is a file. But now we can see that someone read our /etc/password file.
No, they didn't. I'm just kidding. But it gives you an idea of the kind of power that eBPF can give you. You can hook these probes into any system call. You can measure any metric. You can expose that data any way you want. And it's all super safe. There's no kernel module. It's always running in the kernel, it's not going to crash, it's not going to cause you any problems. And it gives us a much richer visibility into our applications and our operating systems than we've ever had before. So definitely recommend looking into eBPF, especially if you're into operating and infrastructure, Kubernetes, anything in that space. But for everyone else, it's also very cool too. So with that, thank you very much. And if you have any questions, I'll be around later to chat.
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.