Observability at a Home: A Homelab Experiment
Speaker: Tracy P Holmes
Summary
At Navigate NA 2023, Tracy Holmes presented her talk titled "Observability at Home: A Homelab Experiment." She discusses the use of Cilium, Grafana, and Prometheus for observability in a home lab setting. Tracy explains eBPF, Cilium's capabilities, and how Hubble enhances observability. She also shares her personal experiences and insights while using these tools. This informative session is perfect for those interested in observability and networking in Kubernetes environments.
Transcription
How many people in here are already using Cilium, Grafana, and/or Prometheus? Either of them, just like... Oh, alright, let me break that down then because I wasn't expecting that. Alright. To do it, did do it, move to Epson.
Alright, there we go. Alright, and then turn on Do Not Disturb. Alright, how many people are using Cilium? I'll take it. How many people are using Grafana? Figured. How many people are using Prometheus? Is it kind of one of those if you have one, you have the other kind of... Okay, alright. That's kind of how I ended up here also.
So, I am Tracy Holmes. I am looking for my mouse, and I am going to give you all the remix of the talk that I would have given you had I been on time. That cool? Alright, sounds good. If you've never heard me speak before, I'm kind of worse than JJ, just without all the puns.
So, yeah, this is Observability at Home: A Home Lab Experiment using Cilium, Grafana, and Prometheus. I am Tracy P. Holmes. I am a Technical Community Advocate at Isovalent.
So the first thing we're going to do is, I'm going to go over, kind of give you a crib notes of eBPF because honestly, I don't understand it. I don't have to. I just deal with Cilium. I'll give you some crib notes on Cilium, and then I'll give you basically what I did, how it ended up, and what I'll probably do later on. Cool? Alright.
So, since I don't have my clicker... Oh well, she was gonna do that. Jesus, sorry cameraman. There we go. Alright, so.
What's eBPF? So most of you probably heard eBPF is a set of superpowers. That's kind of the top thing that we kind of go with if we're not doing the whole Star Wars thing. But what is it exactly?
So it's a tooling that attaches business logic to pretty much anything that's running in your kernel. That opens up stuff like tracing profiles, well, tracing and profiling, observability which is why I'm here, security - for us that would kind of be Tetragon, networking that kind of stuff. So there's a lot that you can do with BPFs or eBPF and if I say PBF, it's probably because I'm hungry and I'm thinking of peanut butter.
It's a system that also allows dynamic distinction of the OS kernel. At its core, it's just some instructions, that's basically what it is. Those instructions just happen to make it be able to attach to things so that you can see things that shouldn't be there, or maybe should be there, depending on what you're looking for.
Oh, it's also been taking Linux by storm, building one of the top trends in Cloud Native, and there's some SDP goodness going on around there also.
So how exactly does it work? Any OS kernel is event driven. So you basically have events coming in from different physical or virtual devices like network cards, storage devices, all that kind of good stuff, and then you have the processes that are making the system calls. eBPF provides the facility to attach that logic into those events running in the system. And so that way you have a deep understanding and like a magnifying glass going all up in your kernel to find out what's going on. Basically, what happened, why did it happen, where did it happen, and what can we do to prevent it?
So, context, how do you get the context? Well, you use something called helpers. I heard a camera going off earlier and thought it was you, so you're on camera with me blaming you. I'm just letting you know ahead of time. Those are stable APIs in the kernel that provide access to all sorts of information.
Now, here, once you've triggered eBPF programs on some events and interacted with the kernel, either gathering the context or gathering the logic, what do you do with that info? Like, do you just throw the state away? So, this is where maps come in. Maps allow you to store data in efficient data structures, share those with user applications for other eBPF programs. That was your crib notes on eBPF. Now, let's talk about Cilium.
So, what is Cilium? There's a whole bunch of text, feel free to read that. But it makes the kernel microservices-aware with the help of eBPF, more context than you would normally have with another CNI, but it helps you to leverage those superpowers. It also gives you rich context. It'll also help you answer questions like, what if you could use eBPF to control and observe all the network traffic in your cluster? You know, that magnifying glass that I was telling you about. What if you could use the context that we can gather from eBPF to understand which DNS queries are failing or failed in the last 10 or 15 minutes?
So, why is this interesting to you, or us, with K8s? It's the superpower thing, so you gotta pick a CNI.
There's a ton of use cases. I am not going down that list. But there's a whole bunch of functionality that Cilium will provide. But the bottom line is that eBPF provides a way for Cilium to gather rich context to be able to connect your applications together in an efficient way. It also provides the framework to be able to understand, you saw nothing, and reason about how your applications are behaving in the cluster. This is not a comprehensive list, but it's a big old list. The ones we're pretty much going to talk about today kind of touched on Kubernetes networking but mostly the observability section.
You heard me mention the CNI, what exactly is a CNI? So, it's the thing that allows the pods to communicate. Sometimes it also enables traffic into and out of your cluster. It also provides a network device config for your pods. It handles the IPs and routes, handles the traffic to and from the pods within the node and across the nodes. Now, these can be built in a range of ways. One of these ways is establishing a tunnel mesh between nodes where all the packets are encapsulated, or by integrating with the network to provide direct routing. Most naive implementations will implement all of this logic in IPtables in the kernel, which divides the chains of rules, like if you see this IP pass it in, if you see the IP drop it, it's kind of like hot potato. Yeah, I'll just leave it at hot potato, I'm not going to read the rest of notes on that one.
So, how does Cilium implement the network plugin eBPF? So, with a microservice model, moving logic into more components, the network really is at the heart of your application and it makes it efficient, allowing your apps to scale better. When something goes wrong, it's even more important to know what's going on. Cilium runs as an agent on each node, with tailored eBPF programs for each pod. And this allows deep control and context to be used to implement the networking.
So, for example, services. Services is one area and they're about providing consistent IP addressing for applications to talk to each other. App one wants to talk to App2, it talks to the IP address. Then it's relying on the existence of that specific pod at that specific location. And if that pod goes away, App1 needs to know about it and react to it. Think about it like this, you want to send $500 to your nephew, it's your grandmother. Your grandmother doesn't do anything other than cash and checks. You send the checks, she sends it to the old address. She doesn't have the updated address. She's at one, she needed to know about that updated address better, she forgot to write it down in the passwords notebook, but that's a whole another story. Anyway, this model makes it easier to deal with the ephemeral and ever-changing nature of microservices. But it relies on the networking plugin delivering the traffic to the backend. Services are a crucial part of any cluster because pod IPs are ephemeral, and many clusters today, this is implemented in IPtables or IPVS.
So, we know services are dynamic, we just talked about that. How do we handle them under the hood? So, in Cilium, it's implemented using BPF maps, especially a hash -- sorry, specifically a hash table that can perform the IP translation in constant time. Furthermore, since it's backed by a hash table, updates are also performed in constant time. Now, IP tables were initially designed for low-churn environments. Changes in IP tables are not atomic. Even though it's possible to change just one rule in IP tables, what's really happening is that all the rules are evaluated as we make changes more frequent. This problem gets a little out of hand. Now, K8s was designed for high-churn environments. Services are represented on each node in your cluster, so the problem gets worse as you scale in any dimension -- pod, services, nodes, ports, all of that good stuff.
We're going to get out here on time. Thank you. Observability, which is why we're here. So again, like I said, context is king. Let's circle back around to that.
What is observability? So, for the purposes of this discussion, observability is about having the tools necessary to observe how your applications are behaving in a Kubernetes cluster. It's about having the camera at the doggy daycare so you can see exactly whose poop your dog is eating. That's basically what that is. You get a lot of questions around this context, but Kubernetes doesn't apply any tooling for this out of the box. And in fact, most of the CNIs out there don't either. With Cilium, we're basically using -- there's that superpowers thing again, I'm gonna start wearing a cape to these talks -- of eBPF to this problem, and we can provide both the tactical and strategic solution. Before that.
Hubble, which is the other thing that you all probably saw in the description. It takes everything we learn about the application, whether it's at the app layer, the networking layer, whatever, and it exposes that to you in three ways. It's built in the Cilium so it can be used as CLI, an API, or a UI. It exposes that context we gathered to platform operators. And with it, you can dig into what application is doing, what the application is doing on the network.
That's pretty much what it looks like, that's the UI version of it. You have this whole nice little dashboard, you can observe the flows, that kind of thing. It shows you all the traffic going in and out. And then you kind of drill down a little bit into it. I prefer the UI. I do not like looking at the flows in the CLI because it starts getting a little bit "Why don't I have an ultra-wide monitor to look at all this?" kind of deal. So, this is better for me, the graphical representation. But something that gets people excited, like you just saw me, is to see the UI that you construct on top of Hubble. So, you build a map of the traffic flows throughout the cluster. It uses that context with Cilium eBPF provides, and it can visualize that for you. Now, beyond that, when you click on different pods in the UI, you can drill down, and you get shown the flows associated with that particular application. And you can easily navigate to where the problems are in your cluster and start to investigate why those particular applications are hitting issues. And there's another -- I've been drilling down.
Now, by default, Hubble shares a rich context showing a flow data that describes traffic via protocols or by an action in the cluster. You can look for all DNS traffic from, oh I don't know, let's use the demo example, an X-wing, or follow an HTTPS request from your workload to Disney.com and back, or wherever. When it comes to -- you know what, I'm not going to read the rest of this one, I stole this slide, I'm not gonna even lie. When it comes to the Death Stars, the Empire has a policy about letting Rebel Alliance ships near, and you can see the traffic getting denied by the network policy. If you've ever seen any of our tutorials, we're all in on that entire thing. So it works though, like as an instructional component, I actually do admire it because you're either on one side or the other. So it's kind of easy for you to say, "I'm either with the Alliance or I'm not. I don't want you in my house or I do." You also get DNS visibility. If you look at it here, these are all the responses that came with an Rcode that should not equal null. Rcode is part of the response from a DNS server, if there's a problem with the DNS query. So, with a flow like this, you can determine if there are DNS errors in the cluster, which pods are related to the errors, and that is observability, and it's built in.
Here's a quickie on - I'll just show you the diagram of this. I'm not going to read this because we're tight short one time, but that's a L7 visibility. I can get a lot of context from that. Also, you can also get transparent SSL visibility and you can basically use BPF to defer the encryption for that until the proxy can gain the visibility and actually show you what you're looking for.
How many of you are like with a like a platform? Oh yeah, platform is the thing now, isn't it? How many of you are with a platform team or adjacent? Okay, good. We won't go through that example, but basically, because of Hubble metrics, we can generate these events, complete with the context about the application that's in it and which version of the code was running at the time. Now, those were the fancy slides. Here are the Tracy slides. So, what had happened was, when I started this, I wanted to spin it up using ISE, you know, like you do, Terraform, Pulumi, Crossplane. Is Victor here? Okay, yeah, I had to say Crossplane, just get, Crossplane, and that's - that's what I started doing. And also, because it was Terraform, I started fighting it, as you tend to do and I fought on it just, you know, a wee bit, uh, too long and literally went down their route. Those are the curse words.
And I'll tell you why. I kept going to different, let me tell you about you all and the rest of the people that are here at this event, because we all do it and we do it for a reason, because we all like to teach, right? For the most part, teacher learning. There are a lot of blog posts out there when you're trying to do this stuff. I think I came across three that had different versions of the documentation, and then I went back to the provider and it said something totally different. At that point, I just started throwing stuff, so I started talking to my boss about it. She's like, "I mean, Civo says the whole one-click and you're up in so many minutes thing, why don't you work on that?" So, that's what I did. This we're going to go over in two minutes.
I got in there. If you've ever actually spun up a cluster on CIvo, you all know how it is if you did it through the video, click on stuff. 90 seconds to a minute and a half, that doesn't mean in half, 90 seconds to - late three minutes later, you've got a cluster. Don't say it, Jeremy. It's pretty straightforward. My friend Tamika likes to tease me, and she'll - I think she has a talk today also about using the UI. But honestly, I go to the UI. I got stung by Azure way too many times by not doing this. I go to the UI just to see what the options are before I start just drilling down in the CLI, and I'm glad I did, especially when money's involved, because, you know, money. So, I did get my cluster done, but this was after I'd done it like two or three times because one piece of the documentation said one thing, and something else said something else. So, when I had initially tried to spin it up through the CLI, I had default applications I didn't know about. They were already being spun up, which meant I had to go back and do it again. Well, then I did it the proper way, and then I couldn't get one of the applications to run. So then I just said, "Screw it." So, I disabled everything, set it up so I could see it, and I went to the CLI, and so that's how I - that's how I got through with it. A lot of the screenshots you actually saw earlier, if you saw like Prometheus or the Grafana boards or anything, if I hit that yet, some of those were my screenshots. And I would have said that earlier. But here are some of the things I observed. So, I had to fight with this, uh, because that was one of the things I didn't want to spin up when I went through the UI, and so I ended up having to use Helm to, like, get that done. So, if you're using the Q Prometheus stack, the only remaining, do it that way. And you'll see it on the next screen, that is the Prometheus Operator that I ended up having to go to the GitHub UI. I mean, the GitHub repo to find out exactly what everything was. I did use Cilium for the CNI. The default image is not the most up-to-date one.
So, the good thing is, because I ended up using Helm, I got to rectify that with that. Once I got everything updated and started getting the flows I wanted, it was a little bit easier. It's not a criticism. It's more of, we also have beginners here. And if you want to put in a PR to fix that documentation or play around with it, absolutely do it because I'm - I can guarantee there are not enough people to keep it updated where it needs to be in all of the places. But outside of that, the experience was great. I actually have, if you want to follow me, I'll show you my Hubble metrics and my Grafana board, once I got the right password, after about 20 minutes, going on in the background, and, you know, it was a thing. I'm gonna end up actually going back to do a complete Terraform, full-on repo kind of breakdown once I get back home, and then I'm gonna try doing this on my bare metal with RKE 2. So, pray for me. But that's it.
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.