Do You Really Know What Your Database Is Doing Right Now?
Speaker: Barrett Chambers
Summary
In this talk, Barrett Chambers from Percona delves into the crucial topic of database observability. He highlights the essential aspects of database performance, including query execution, relationships, and durability. With a focus on MySQL and Postgres databases, he introduces the powerful open-source tool, Percona Monitoring and Management (PMM), which aids in identifying slow queries and resource usage. Discover how to optimize your database's efficiency and gain valuable insights to improve overall performance.
Transcription
So, I am not DGB, he's supposed to be here but he had a family emergency. My name is Barrett Chambers and I'll be talking about do you really know what your database is doing right now? At Percona, this is extremely important to us. We have global customers running MySQL, Postgres databases, and we've built a product that is entirely open source around observability. So we'll get into kind of what you need to be looking at in terms of database performance, as well as how this tool can potentially help you see things that if you don't have a detailed observability tool, you might be missing.
The first thing to understand about what's going on in your database is understanding the core functions of a database. And this is pretty simplistic but we did break it down a little bit. So, the first thing obviously, you want your database to execute queries, and you want as many as possible, as fast as possible while maintaining data integrity. And some systems will sacrifice the bottom two for the top two, and even like MySQL and Postgres will allow you to do that with certain settings. So, you need to keep that in mind in terms of what the purpose of the database is for. You can always squeeze extra performance by tweaking that data integrity and data quality knob, but you run the risk of split brain or dirty data, things like that, so things to keep in mind.
The next thing when it comes to analyzing your database is relationships. So that means several things. The first of which is like users' relationship with the database, which is the security perspective of managing your database, so accounts, privileges, and access. And then you've got its relationship with other databases, so like replication natively within the database, or foreign data wrappers as with Postgres where you can interact with like heterogeneous database environments. And then you've got sharding. So, MongoDB has built-in sharding, so that's the relationship between different types of data but within the same database cluster. And then you've got your teams which are interested in what's going on with the database. So you've got your logs and your status info. So all those are relationships that are important to your database.
And then the third thing a database is supposed to do is survive, so that's the durability aspect of a database. So in order to do that, it needs to interact with its underlying server, disk usage, CPU, and memory allocation. So all of these are important when it comes to what a database functions as well as what you need to be observing.
So now we're going to talk more about the query aspect side of it. So what factors determine whether a query is considered fast or considered slow? So, the first factor is the construction of the query itself. The second factor is the availability of the shared resources. And from experience, these two things are where you butt heads when you're diagnosing performance problems within your database. There's a lot of finger-pointing that happens when there's a slowdown.
So when we're talking about these two things, on the construction of the query, we're talking about the schemas that it's using, as well as the indexing of the tables that it's accessing. Do you have the correct data types for the type of queries that you're trying to run, as well as have you partitioned the data? On the shared resource side, it's more like, I come from a DevOps background, so we were more focused on that. So, it's like free connections, free threads, is the data synced to avoid those stale reads? And then you can measure both of these concepts with time and latency time. So, time would be the query time running and latency time is access to resources.
But there's a lot of, at least what we see with our customers, there's disagreement over how you should define a slow query. Do you just, you know, somebody says it's slow, and then we accept that, and our support team deals with this as well. It's like, we get a ticket and say, 'Hey, this is slow.' Well, slow compared to what and compared to when, and compared to what is the expected output?
So, you really want to understand whether you're taking a subjective or an objective approach. So, a more objective approach is comparing the query execution times and comparing it against the overall time of execution. So, there's several strategies that we see people trying to take when it comes to analyzing a query as an outlier in terms of slowness. And it's, should we be using the max amount of time that it takes, should we be using the minimum amount of time it takes, or should we be using the average amount of time it takes for all of these queries with the same fingerprint? Maybe you're doing different IDs or something, but in general, you should expect some level of performance out of that, and that's your baseline for determining whether something's slow or not.
The answer is basically, like, don't use any of these. They're not very good metrics to determine whether a query is slow or not because they can give you some misleading results. If your slowest query, for example, takes 300 milliseconds, then your fastest one could be just below that. So, that's not really a good metric to use. Fastest one is also not good, the slowest one could be taking any amount of time, and you won't be aware of that. And then if you're using average, you've got both of those things playing against each other. So, it could be, we all know how averages work, it could be pulling it to the high end or the low end.
So, what's the answer? What's the right way to determine whether you have slow queries or not? At Percona, we like to use percentiles, and specifically P99 is a good standard. And what that means is that 99% of your queries are taking less than that amount of time to run. So, you know that anything outside of that will be considered an outlier. And I'll show you within Query Analytics in PMM how you can look at that for different query fingerprints. So, we're going to give you the tools you need to kind of use that metric.
So then you have to determine whether the cause of that slowness, once you've determined you have an outlier, is it due to a poorly written query, or is it due to something else? So, it could be both of these things, or it could be one or the other.
So, high intrinsic latency applies to the actual query itself. So, either it's poorly written, it's using tables that are poorly designed, or poor schema design as well, and no matter the amount of hardware, really, it'll still be slow. Or maybe you'll spin up to a larger instance type within AWS, and it'll appear that it's faster because your problem's gone. However, you know, given load, that query is going to pop up again and strain the system. So, we also see this a lot in cloud environments, people are continuing to vertically scale to solve intrinsic query problems, and eventually those come to haunt you because you can only go up so high, as well as like the bill you get from those cloud providers for those larger instance types. So on the other side, not intrinsic to the query itself but there's queue latency. So, that means the query is fine, but something else is going on within that database environment. So, this is kind of the 'it runs fine in my environment' type thing. The query is good, but maybe there's some resource issues or it's queued as waiting or something like that.
Now we're going to go on to a PMM, kind of focused on how to look at all of these things. If you want to follow along, you can. I don't know if your laptop's open, but this will be live forever if you want to go later. It's pmmdemo.percona.com. You don't have to create an account or anything. You can just click around to various dashboards.
The reason that we recommend people use this is we've built a lot of stuff into PMM and we built it based on what we saw with our customers. Our customers were using things like Grafana and Prometheus to build a time series storage and then a visual display tool using open-source software. So, we wanted to make that easier, as well as incorporate our own custom software that we've been developing over time, like tools that help our support team and consult team do their job better.
If you look on the left over here, this is the server component. This runs as a centralized monitoring host. There's a more simplified version of the server over there. This ships in a Docker container. All of these components, you don't have to worry about connecting any of them. We are using three different types of databases. You'll see at the bottom, the Postgres one is just a metadata storage for the API calls, as well as what systems are running. The Victoria Metrics server down there is for our time series data. It's very similar to Prometheus. You can still use Prometheus exporters, but we've found at scale, it performs better. So that's our database backend for our time series.
And then the backend for our Query Analytics section, which this is the kind of custom Percona part of this, you could build this other stuff on your own if you wanted to, but this is built with our toolkit as well as other things that we've built over time. It is all open-source. You could technically build this if you wanted. You can go on our GitHub and get the Query Analytics information. But that uses a ClickHouse database, so that's better for full-text searching, which is helpful in the Query Analytics section of the app.
The top part is the client-side. There's an expanded view of the client over there, and that's all the things that live on the database server itself. So there's a pull and push. There are graphs on our documentation, but essentially the exporters are pushing back and storing on Victoria Metrics. And then we've got the server UI. So, if somebody else is managing all this for you, then all you have to worry about is accessing the UI, which is a Grafana-based UI.
All right, so I wanted to explain that just so you can understand why I'm talking about PMM and why I think it's valuable. You're kind of combining the best in open-source software as well as using our expertise in the open-source space for 16 years.
Okay, so, visibility of intrinsic latency. And intrinsic latency is the query itself. So, if you do have the PMM demo open, you can click on the top right Query Analytics section, and you'll see something like this. This screen will display the queries running in your system by total execution time, primarily. You can change that like by count, but we find this is the best way to look at things as you're starting out. Because even if you have a fairly fast query, if it's running millions of times, it'll bubble up to the top. And if you can get a little bit more performance out of it and optimize it, then that'll be a drastic reduction in resource uses on that database server.
You can also filter over here by cluster, replication topology, environment, that sort of thing. If you only want to look at dev, like you're testing it in dev, you can do that. You can also drill down into individual queries. So, we were talking about looking at the query count and time and things like that. So, without running command line, you can come in here, drill down, get your total query account, lock time, you can get a fingerprint, you can get explain plan. And that really helps you with that intrinsic diagnostics. Is the query written well? Okay, so here we've got, I'm showing our P99 value over here. If there's a fingerprint you're looking for, you can search for it up there. Then if you go to the right and hover, you'll see your min, your max, your average, as well as your P99 value. So, this will let you know, if you see this type of query running and it's longer than, you know, 123 milliseconds over here, that would be considered an outlier. That's something that shouldn't be happening within this database server and would be worthy of investigation. If somebody tells you it's running slow and it's running at 100 milliseconds, you can confidently tell them, 'That's not slow, that's how it's always run,' or over the period of time we have scheduled up at the top. So, it gives you a little more evidence and weight when you're doing diagnostic and trying to figure out what a problem is.
This is just showing the changing time range up here. And what I wanted to talk about with this one is, a lot of times, at least in the past when I was working in DevOps, it's like, 'You know what happened between X time and X time?' You can easily filter down to this, and Query Analytics will filter as well. So you can see which queries were causing the most stress on the workload, and it'll bubble up to the top. So it's very easy to click through and see if you had any that were running errantly or were slow.
All right, now we're going to talk about Queue latency. So this is on the resource side. Intrinsic is how the query's written itself, and then Queue latency is the resources that underlie it. So, if you have PMM open, you can click back on the home dashboard. So, you can either do the two up there and just type 'home', or you can click on the Percona icon in the upper left-hand corner, and that'll take you back to the home screen. But what we've built in here is a CPU anomalies graph. So this will tell you if there's been any unusual activity on any of the servers that PMM is monitoring within the last seven days. So whether it's low CPU utilization or high CPU utilization, if it's abnormal over the last seven days, it'll appear in this graph. We also have that for disk queue anomalies, as well as high disk and high memory use. So it's very easy to get a top-level view if you do have those anomalies happening within the Queue latency component of a slow query.
And this also, I don't know if I pointed that out, but in the top left, if you're over in the Query Analytics section and you jump back to this home section, it'll hold whatever time range you set. So we're still looking at whatever time range we were trying to troubleshoot.
You can also go, if you're following along, just click up where it says 'MySQL', type 'MySQL Instance Summary', and you can see these graphs. So this is the thread count and other things we were talking about when it comes to analyzing contentions potentially within the database server itself. So if you see a spike here, that might be indicative that we were hitting our max number of connections, things like that. You can also set up alerting on any of these if you want to be alerted on various thresholds. And I'll show that a little bit later as well. So yeah, go ahead.
So, going back to where you're showing the highest, sorry, yeah, going back to where you're showing, yeah, that slide, do you guys give any suggestions or any indications of what part of the query was the highest and like how to improve it? Like, advice on how to tune the query?
Not built into it yet. We do have advisors that we're working on, and I'll show that in a little bit. Right now, those are more related to security advice or configuration advice. They're not down to specific queries. But that's a good suggestion.
Okay, and a couple more questions. On the next slide, when you were saying about the high CPU usage, do you guys let the user know what particular CPU was using high CPU usage?
Yeah, so if something did appear here, you could drill down and it would say, 'Average CPU is this, we saw this in this server.'
All right, and then, this is like a free product from Percona, right?
Yeah, this is free. I'm not trying to sell anything. I do work on the sales side, but this is all free. You can use it and never pay us anything.
And it works with, like, all databases or certain?
So, out of the box, if we go back to the architecture diagram really quickly, we only support MySQL, MongoDB, PostgreSQL. However, you can extend this. There are exporters for Prometheus that will work, so you can just expand this persistent layer if you want to see like, I think they have Redis exporters, they have MS SQL exporters, you can do other things as well.
Okay, all right. Cool, thank you.
All right, so I'm just showing here, we also have those client thread activities for PostgreSQL, as well as MongoDB, and these are the three most popular open source database technologies on the market today. Although we can debate about the openness of MongoDB's license.
So, alerting. I talked about that a little earlier. We have integrated that. Pretty much any dashboard we were looking at on these previous slides, you can create an alert based on thresholds and that can all be managed within PMM. Again, this is not a paid feature. This is part of the normal product. Our team, our managed service team, actually uses this to support our customers, and they get paged through Slack and PagerDuty. So, there's a lot of integrations that you can work into the alerting within PMM.
Here's an example of what an alert template might look like. It's a little complex initially but it's very easy to read if you're used to automation through Ansible or anything like that. And the reason I'm showing this is just once you have this kind of fleshed out, it's very easy to automate. You don't have to necessarily interact with the GUI interface of PMM. It's API driven, so you can just have it stored somewhere and send it up there. Very easy to manage on an ongoing basis as you're adding new servers or new alerts, things like that.
The advisors. So this is where I think it would be a good idea to have some query advisors within PMM. But right now, we've got more like server-level, config-level or security-level. And on the home screen, you'll see this icon right here and it lets you know a critical, medium, or low severity advisor that we're seeing. All you have to do is connect your servers and you'll get these advisors. This is the only part where you will have to have a subscription to get all advisors. You'll get some advisors out of the box for free, but you do have the ability to make your own advisors. Once you see how they work, you can extend this on your own and that's totally fine. But you'll see we're calling out like a bad config in terms of the number of cores that you have configured on your MongoDB nodes. That will likely lead to bad performance if you don't correct it. Or, you have an unsafe network binding config within the MongoDB instance. So, it's really helpful if you have other people adding servers within your environment that aren't following best practices.
We're also adding Database as a Service. I know this is kind of like everybody's doing this at this point. However, this is free within ours. The reason we partnered with Civo is because this happens in a matter of a minute or two, and it would take forever anywhere else. But essentially, you just plug in that Kubernetes cluster over here, and then I'll give you access to our operators. So, these operators manage high availability, backups, lots of automation for the database itself just from the operator. So, you can try MySQL or MongoDB. We're working on PostgreSQL that should be within the next quarter or so.
And this is just what it's like to configure or to provision a cluster, so you can pick your version. It gives you real-time, based on whatever Kubernetes cluster you have provisioned, what the resource consumption is going to be.
And if you're going to max it out, it'll let you know before you launch it.
All right, that was it. I'm happy to take any questions if you guys have them.
On PMM, which is very interesting, I will definitely try that. I was looking at the clients. There was a slide with the architecture, and I was wondering, because at the end the backend is just a storage and having a time series database and Grafana, I was thinking if I use just the client, do I get similar information that you were getting, especially the queries that you're listing, the queries and this is what will bring my interest? So if I just use the client and connected with a separate solution would I get the similar details or do you do some magic tricks in the back end as well?
So the client would be probably interacting with the Query Analytics API and ClickHouse. You could figure out how all that's connected, but you would probably have to tweak the client code to be sending somewhere else but its all open source you can figure it out.
And also I was wondering, how do you get the queries that you were displaying? Is it through the logs, or is it...?
There's a couple of options. I think we recommend Performance Schema but there are some other options as well. You do have to enable advanced logging which technically impacts the performance on your database servers. A lot of people are like, 'I don't want to lose five percent of my database', but it's the trade-off. Do you want to know what's happening?
Okay, thank you.
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.