Managing Kubernetes Log Data at Scale
Speaker: Matt Miller
Summary
In this talk by Matt Miller from Edge Deltas, we delve into the world of managing Kubernetes log data at a significant scale. Matt breaks down different strategies for handling log data, including Kubernetes native tools, persistent storage, open-source log collection, and using centralized log vendors. He explores the benefits and drawbacks of each approach, particularly focusing on the challenges of scalability and cost when dealing with large volumes of data. Matt also shares insights on how to optimize data management using an intelligent edge-first approach.
If you're looking to better manage and understand your Kubernetes log data, this talk provides valuable insights to aid your journey.
Transcription
Good morning everyone. It is still morning. Thank you for joining our session today. I know we had a little bit of a kind of bleed over on the last one but today, in this session at least, we're going to talk a little bit about Kubernetes log data being generated in platform, how we can better manage it, understand that data at significant scale volumes. Just to clarify, my name is Matt Miller. I'm with Edge Deltas. This is technically a sponsored session but I did everything I could to try to keep the sponsored part out of the session. We'll talk a little bit more, once again, about how we typically see customers and organizations handle this particular challenge.
I wanted to start this morning, had a chance to chat with a few folks and capture some pretty interesting quotes that we thought around Kubernetes logging, data in Kubernetes, and you know how you might leverage some of this stuff. No, I'm just kidding, all this came from Chat GPT so, you know, I think it kind of hit the nail on the head, in terms of what we look to accomplish with our goals of understanding application runtime, understanding performance. There's a lot of different ways that we can tackle this challenge but I think at the core we're looking to essentially understand how that cluster is operating, how the applications on the cluster are operating, and everything we have configured in that manner. Kind of high-level mission goals of anyone collecting, using, emanating log data at scale is certainly, you know, of the very specific and easy goal.
One of the things that we typically see is, yes, it's easy to run, you know, a K3s instance on your local machine and collect some log data and understand it. When you're running, you know, a couple gigabytes, okay, maybe a couple hundred gigabytes, things get a little squirrely. Once we exceed that, get into the terabyte level, things just kind of fall off the rails. So, performance and maintainability of this entire practice are paramount in order to have any use of the system as we scale the the massive volumes of data up.
So, how might we tackle this kind of challenge today? Well, there's really four main ways to handle this type of data, each with their own kind of drawbacks and positives. The first is to use Kubernetes native tools, built-in solutions that allow you to gain visibility into the systems. The second would be to leverage some of our friends in the storage room to store volumes of that data, whether it's with persistent volumes or network attached storage. A very common approach, I would say honestly the most common approach that the market is doing today is using some open-source log collection capabilities to pull data out of somewhere, or all of your cluster, and do something with that data. Toss it into Logstash to be viewed later, or you know, once again another way to do it would be to use a centralized log vendor. The Datadogs of the world, the Splunks of the world, Elastic and others. So, we'll kind of talk through these here and understand once again the drawbacks and where we can kind of help you with scaling.
Let's talk about built-in tools first. Everyone's favorite, invoking that command line, looking at a pod, understanding the performance. The best thing about this approach is it's totally free. It's just your time and your command line, your terminal of choice, right? Whether you're using K9 or just, you know, the standard CLI, you can run things like kube describe, kube logs, and gain some visibility into what's happening. The bad news is, it creates essentially infinite data silos. Think about running an application that could be executing on 200 different nodes, right? Maybe 10,000 different pods that potentially could be emitting data. It's pretty impossible to try to manage that. There's no detection built into that system, so you have to know where to look. You have to rely on an external system to try to give you some sort of alert something's going on, maybe it's a resource based alert or something in that realm, and it's pretty individual when it comes to insights, right? So you're gaining insights as an individual looking at your terminal. It's tough to share that with the team unless you're copying, pasting or downloading a log and sending it to somebody. It just, it can't scale beyond, once again, maybe that local cluster or a tiny sandbox.
So, next, we try using a persistent volume. We're going to store log data from all of my application pods, as an example, in one place. We'll store it for later. We'll use that, maybe that's going to help us on this journey because storage for most organizations is relatively cheap, right? It can scale out.
Whether you're using cloud native or local storage, for the most part, that isn't a traditional drawback of managing massive amounts of log data. It's very easy to specify how long you want to keep these things for because you have complete control. It's your storage, right? You can access it. But once again, when it comes to getting insights from the data, when it comes to actually trying to understand what's going on, there's no baselining or machine learning running. Obviously, we're just storing logs in a particular destination. And once again, better collaboration than local but still limited. No console to take a look at, no analyzing with team members. So, once again, if you need to share this data, it gets a little harder. So, let's look at what we think is probably the most popular approach to this challenge. Of course, it's using some sort of agent deployed to a pod, to a node, to collect the data and use that information in a more collaborative manner.
Once again, for the most part, a big check, no vendor license costs. Things like Fluentd, Filebeat, Promtail, yeah, they're open source, right? We don't have to pay anybody to use them. They do require some resources and a limited capacity on your cluster so that there is a minimal resource cost there. But you do have to use that data in context. So, you're going to send that data somewhere. You're going to centralize it. It could be on the cluster, could be running Loki, you could be running Elastic on the cluster. You might want to leverage that data and aggregate all your logs into one particular cluster-specific view of the world.
And then, if you do that, it's going to take some time to get it up and running. If you want to do things like create a dashboard, create a metric on your data and create an alert even, you have to know things like regex. You have to potentially get a little deeper into SQL queries. This is just, you know, a little hard sometimes. It can take a while and it doesn't allow us to continue to scale this process. Also, a lot of times there is somebody figuring all this out and then a bunch of people on the team who want to know, "Hey, how'd you figure this query out? Tell me more about how you configured Filebeat to send this data into Logstash. And which attributes did you set up? What was the regex you used to do that?" So that also takes a lot of time that, typically speaking, most people don't have.
So, you know, then in that case, a lot of vendors have come in and say, "Hey, we'll wrap up Grafana. We'll wrap up Elastic. Or we have our own solutions for managing those types of data sets to make it easier to filter, to make it easier to use and spend less time, and collaborate." But the problem with most of these vendors is, you know, they cost money, right? So, I don't know if anyone's used one of these. Just kind of a couple that we've tossed up here that are known by some analysts somewhere that we see out there as vendors who support central log aggregation. But for the most part, each of these are going to require either some sort of licensing cost or they're going to require you to put some effort in to run them. Just quick show of hands, does anyone use any of these today? Yeah, me too. Okay, that's great, wonderful solutions.
What we feel has really occurred with this, when we increase the massive scale, is cost. So unfortunately, all of these vendors, or the centralized approach, do require the capability of having unlimited or near unlimited budget to manage significant volumes of Kubernetes log data. Kubernetes is not a quiet way to run your applications. So, obviously, for a lot of these vendors, they're getting some details into how much data you're sending and then there's a licensed back end. Really, what we also see at scale, I mean we're talking in this case even petabytes of data per day, is that organizations have to essentially establish what data they want to use to fit into their licensing requirements. So, attribute-based filtering is great, alerts are great, baselines are phenomenal. The teams are collaborating, the Slack channel is up and running. I've got my AI practice ready to go.
But then actually getting data requires setting up, in most cases, a central vendor-based agent. So you deploy their agent, or you try to use an open source one and still experience the problems of managing, you know, Fluentd or Filebeat at scale. Then you need to specify what data actually should be included. So it means things like drop filters. It means things like analyzing only small sets of that Kubernetes cluster. "Oh, I can only afford to send these application pod logs today," as an example.
And then there's still the high cost, right? So even after all that, you've selected what data you want to route, you've managed how you're collecting it, you're still sending in and paying for every single log event, even if no one ever uses that. I was talking to a customer recently that mentioned that typically speaking, they will only analyze and view like five percent of the data that they're sending into Datadog, as an example. Like they just brought it in, it sits there, anyone can access it in Log Explorer, but no one ever actually does. Because a lot of times, there's no reason to look at that log data. But it's just there, in case anybody needs it. Of course, you've paid for that, which isn't always the most cost advantageous to the finance team.
So, in our world, I think we are typically looking to kind of change this paradigm. And Edge Delta's idea would be to own your data. Don't require a third party vendor to house that for you. Be able to search any data you want and not have enough charge for it. And then, obviously, simplify how you get data anywhere.
If you think about our approach to your observability solution, it might look something like this. This is a typical deployment for, you know, one of our larger organizations that we work with. And the idea, once again, is wherever your data is being created. In our case here, we're talking Kubernetes data. So we'll collect data, anything that comes from that node. Doesn't matter which application or component it happens to be a part of, but collect it and then make intelligent decisions about where that data should be routed. Whether it's specific indexes for teams, whether it's some teams using Splunk, some teams using Datadog, some teams using Grafana, others using Elastic. Be vendor agnostic with that data and make it easy to leverage.
The other thing that we like to do is make sure that your information is owned by you. So teams still have that capability, in complete control. For example, you think option number two, using your own storage, we think that's great. You keep your own data, keep your own storage. You have S3, you have Blob Storage, you have GCS, great. Make it super simple and from our perspective, if you ever need to dip into that, make your practice more sustainable because you have access to your data at any time with no up charges, right? No additional costs.
You may have something like this internally, you may not, and that's totally okay. But a lot of organizations have classifications of data that they look at, whether it's for observability practices, security practices, just generic development use cases. These events and logs need to be leveraged somehow. So one way that organizations do that is create a tier system, right? Where, when I'm using this data, I'm going to use it for observability, I'm going to use it for deep dive troubleshooting, or maybe I'm going to use it for specific use cases that are significant when it comes to fidelity.
In our world, for the most part, this is not raw log data. You don't need raw logs to be able to create a metric in a downstream destination, in a Datadog, and in a Grafana. You don't need to be sending all that information there, paying to send it there, when you can actually create it locally. Make it super simple.
If you do need to troubleshoot, kind of this middle tier, we like to call it the troubleshooting tier, where teams are looking to analyze something. We can be intelligent about this data at the application layer. We can baseline when there's a reason to send that into my downstream destination. And certainly, from that perspective, make it a little more simple when it comes to analyzing this data.
I want to bring up, a couple of you raise your hands, I'm guessing someone in this room has some experience with Datadog. A great solution, top of the market. Personally, I think it's probably one of the best, if not the best, when it comes to the holistic observability suite. But Datadog is not cheap, if anyone has used it, they know that. So a use case that is typical of Datadog is to route data in, pay to ingest that data, just to send it to Datadog. Here's 10 cents, here's how much it costs to store this particular millions of events as an example. So we're paying some costs just to send it, then we're going to pay some costs to create a metric on that data, and then we're going to put it up in a dashboard and we're going to alert off of it. Once again, no one's ever going to consume the raw log data that powers the KPIs of this dashboard. We're just sending the data there so that we can create a metric from it.
We think that's ridiculous, right? It's your data, you own it. Why not keep it in an S3 bucket, in case anybody ever needs it. Keep it in local storage, in case everyone needs it, and just create a metric locally and route that metric data into Datadog. Cut your bill down significantly when it comes to observability budgets, make it much easier to scale. Let's say instead of Datadog, this is an Elastic cluster you're running locally that requires your own compute and storage to run. Now you're cutting down, you know, 80-90 percent of the data that you actually have to send to get the same insights. It's going to make it much, much easier to run that particular approach.
I told our friends at Civo here, we'd wrap things up early. So, you know, from my perspective, I'd like to make sure everyone understands Edge Delta aside. But of course, with Edge Delta, the key tenets of analyzing your Kubernetes log data at scale. One, always control your own data. Make sure that you have complete control over where that information is routed. If you're paying somebody to route the data there, you're probably paying too much. Two, optimize your use of the central vendors. So not saying don't use them, right, just saying, "Hey, let's be careful about what data we send here and make sure that people are actually gaining value from this." And then three, use intelligence to make the troubleshooting process easier. If you want to jump in, and you want to be able to dive into individual pods or daemon sets or deployments to troubleshoot what's happening at the individual node cluster level, great. Do so, but have somebody pointing you in that direction. Don't just try to dive in without any intelligence, without any alerting, without any machine learning.
And last but not least, we think it's in this, you know, 2023 day and age, it makes sense to analyze data at the source, right? Typically speaking, like almost 80 percent of all Kubernetes clusters are over-provisioned from a CPU and from a memory perspective. You're not going to get that money back unless you're using a FinOps tool to try to do so. So use that resources that are already provisioned, and paid for, and available. Execute, understand that data locally, and then make a decision, route that data to some place that you can analyze it. That's about all I had. It certainly opened it up for some questions here. We've got about three minutes left on our on our track. But yeah, any questions we can answer?
Hello, yeah, is Edge Data trying to replace Elastic or Splunk, whatever we are using, or is it trying to... I got your point that we have to control our own data, but what is Edge Data doing here is... I'm trying to understand.
Basically, that's a great question. Let me go back to one visual here. So this capability here is Edge Delta sitting as an intelligent agent layer. So we replace your FluentD, your File Beats, your Promtail. We're not replacing Elastic, we're not replacing Datadog or any downstream destination. We want to make the use of those solutions more cost-efficient and more efficient with resources.
Absolutely, great question. Any others that we can answer for you here?
And so we need to do many of the things that you've described, but also to then split that up based on regions, some of the regions we're not allowed to commingle the data.
So yeah, absolutely, I think one of the most important pieces that using a local collection capability is allows you to choose where you want to route that data. For example, some of our customers in the EU need to keep data in the EU. As a very common example that we run into for GDPR concerns. Great, they probably have S3 in the EU, they probably have Blob in the EU or wherever they're supporting from a region perspective, we route the data there. That's a very critical tenet of any practices score.
So, one minute left then. Last thing I'll mention, we do have a booth here, just Edge Delta, has a more interactive capability. If you do want to learn more, feel free to stop by. We're in that main hallway before the keynote area. So if you have more questions, feel free to stop by, we'll be happy to answer them for you. But other than that, enjoy the rest of your Navigate journey, and we appreciate your time today.
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.