Down to the Dollar: Turning Logs into Serverless Estimates

Transcription

Our crew here from Pantheon is in the room, representing, so I'm excited to see some people that I don't always get to see given that we're so remote now. This is really a peek behind the curtain in what we're doing at Pantheon in terms of how we're making engineering and architectural decisions for our own back-ends and infrastructure related to our own pursuit of minimizing our Cost of Goods Sold (COGS), as we often refer to it.

COGS is the concept that for every dollar of product you sell, you spend a certain amount on the infrastructure and operations behind it. Ultimately, everyone has to answer for that, whether you're doing e-commerce or running a products company like we do. You ultimately have to measure cash in versus cash out. When it comes down to infrastructure, we have a lot of choices today about how we deploy things, and it's often very difficult to compare them in terms of the actual cost of that infrastructure.

What we've done at Pantheon, when we're looking at where we can deploy our infrastructure, is start with the luxury we have as a company that is processing a lot of traffic right now. By some back-of-the-envelope calculations, I've done about one percent of page views on the internet. We have a lot of data on the actual empirical traffic and behavior of these sites. We have data coming through our edge, going to our log, at creation; all that's located on BigQuery on Google Cloud. That logs literally every single request on the platform, whether we're hitting the cache, missing the cache, whether we're running it in a serverless back-end or on an orchestration layer for containers that we use for the rest of the platform.

What we can actually do is approach this question not just from a theoretical level of how things ought to behave, but from a more direct question of, from an empirical model, how can we actually take the actual traffic on the platform and, if we simulated it running on a cloud environment that's more serverless, we can look at these things from a perspective of what they actually cost.

We have requests records arriving like this, how do we actually look at the cost of building infrastructure behind them? Well, it's at a certain point it's really a question of capacity planning. Especially when you're on the lower level of the spectrum for say bare metal and virtual machines up to some of the container orchestration, depending on whether you're using auto-scaling or not.

When you're deploying bare metal and virtual machines up through static container orchestration, as in not auto-scaling, you have to approach this from a perspective of how much headroom you want to provide an application, and that's sort of defining a fixed cost for that infrastructure. But as you go up this from here to a level of dynamic infrastructure where you have container auto-scaling and you have things like serverless foundations, you're starting to pay for that on infrastructure on-demand.

It becomes less a question of capacity planning for what fixed infrastructure you put in place, and more of a question of whether it's economical to run that infrastructure in a dynamic, auto-scaling sort of way. Because per unit time, as you go up this, the cost per second goes up a lot. For at least what we've seen on say Google Cloud, comparing the virtual machines versus technology like Cloud Run, we see a two to three-fold difference in the actual cost per second. In other words, we pay a lot more per second for the serverless technology than we do for the virtual machines. But they also are much more ephemeral in terms of when they actually are around.

What we need to figure out is resource shaping around the actual requests. And we can approach this question to apply both static resource provisioning from a capacity planning perspective, as well as a cost modeling shaping exercise for the serverless resources. Ultimately what we're doing here is trying to find the area under the curve, an integral of sorts. But it's not quite like the integrals you might have seen in calculus because we're not finding the exact area under the curve. We're actually looking for the area under the curve in terms of what resources we have to deploy.

Let's say we have the simplest possible case where we're deploying static resources, whether it's bare metal or virtual machines or just a fixed number of containers on Kube. What we're looking at here is, what does it look like when you just have infrastructure that just is fixed? It's very common when you deploy infrastructure that's fixed to have things set up where you want to have a lot of headroom. In the sense that if your normal traffic is up to concurrency four, for example, in this case, you really want to have infrastructure deployed where you can handle concurrency of say eight or twelve or some headroom on that. Because when your infrastructure is fixed in its deployment, you can't come close to the actual workload if you're building something like a web application that has to be able to respond on a fairly real-time basis to varying levels of traffic.

This is very different than something let's say working through a queue. But this is what it sort of looks like. I've sort of extended the infrastructure well beyond the expected concurrency we're seeing here because when you deploy it in a fixed capacity you have to do that. But it also means that you're paying for all that capacity on a constant basis. Really easy to estimate the cost of this because you just multiply the number of minutes in a month or seconds in a month times the cost, and that's your cost. The risk here is all on the side of your capacity planning in the sense of, are you deploying too little capacity?

And then, you run out sometimes when your traffic spikes. Or do you deploy too much capacity, and you're overspending on it? This is the most traditional way to look at infrastructure, and it still is a valid approach, especially if the rate of traffic is very predictable and doesn't spike that much. Let's say you have API calls coming to a system. You might be able to have an extremely predictable API request rate for some things. So this is still valid. I just wanted to present it as the most basic case.

Then, you have a more auto-scaling sort of setup. This is an example where, let's say, you have... you want some... you want to deploy infrastructure where you have each container can process up to two requests per concurrently. And you want to deploy at least one additional container sitting around so that you have extra headroom and reduce the chance of a cold start on the application. As you start getting into auto-scaling and other dynamic provisioning, cold starts become a much bigger question.

Cold starts are the concept of a container, from the time of a request coming in, to downloading the information for the runtime, starting the runtime, and actually serving the request. And that time can sometimes exceed several seconds, which is way too long for people to wait on, say, a web request. So it's very common to deploy infrastructure like this... where you have... this is sort of showing that when you have no requests concurrently, you have one container deployed. And then when you start having that container occupied, you deploy a second container so that you always have a container that is hot and ready to go to get additional requests.

This is a very simple case of it. Sometimes, you actually deploy more than just n plus one of the containers. You... you end up with a case of... having to balance how fast your spikes are against your cold starts. Because if... if you expect, say, a tsunami of traffic to come in, adding one additional container at a time is actually not enough to keep up with it and avoid the cold starts affecting it. But this sort of shows... from a... the yellow being the capacity that's deployed, and the... the pink being the active requests, how that maps. And what I'm going to get into is, as we refine these concepts and map them to serverless platforms, how we've actually started to map our own logs onto estimating this.

Because if you've ever looked at the estimators on say, Amazon or Google Cloud for this stuff, it sort of asks you to answer the question in a way that's not based on information you really have. Like they'll often ask you, 'What... how many containers do you want to play at the same time?' It's like, 'Well, if I knew the answer to that question, then I wouldn't be using this calculator.'

So I... I've had to purchase, from a data science perspective... this is a much more tightly hemmed case. This is actually how we have our front-end sites application deployed for Pantheon, which, that's our decoupled platform for things like node runtimes and static sites. We actually got the cold starts down so fast that we don't have any spare capacity deployed for processing those requests. So we can model that as basically having the containers be active only when a request is actually coming in, and scaling those containers as necessary.

Services build because you can't actually turn on and off containers at the single millisecond level. And scale them quite like that. So, this is actually much closer to the way that our actual billing model works for something like Google Cloud Run, which is analogous as a product to say, Fargate on AWS.

So if you look at the yellow background on here, you see that the containers are deployed. For in this case, it can start on any millisecond but it's in multiples of 100 milliseconds before the container can actually get shut down. So whenever at least one request bleeds over to another multiple of 100 milliseconds, then the container just lasts that long. So, you'll see that there's a tailing yellow on here where the container is still around and you're paying for it but it's not actually processing an active request.

This is much closer to the actual billing model that gets used for these platforms. But this is also extremely hard to simulate and use the data science tools because the determination of when these containers start at this millisecond level basically means there's no alignment in the data set whatsoever. In the sense that each one of these containers could start in totally staggered ways and you basically end up running it as a simulation of processing the traffic on an auto-scaling cluster rather than something where you can process it using standard data science tools with large scale logs.

And we're not just running about one percent of page views on the internet, we're running many individual websites that run on things like Drupal, WordPress and Node that all make up different clusters of containers that are serving requests. So imagine you have this problem but multiply it by about a hundred thousand in terms of the scale of trying to align these things.

I've worked with some open source tooling to create a simulation that relies on aligned deployment of the containers, which is slightly inaccurate. It can possibly over count the actual incidence of running the containers in the sense that if you have a request that runs, say 99 milliseconds but straddles two of these boundaries, you end up with a container that's sort of simulated as if it was around for 200 milliseconds which over counts. But at least it provides us confidence in the numbers.

This is where the real data science tools come into it. This is where we start seeing how do you actually end up processing many gigabytes of logs in a way where we can actually get an accurate idea of what our cost structure would be for running this on container clusters versus on an equivalent of Kubernetes versus on something like Cloud Run.

So, this is ultimately the format of the analysis path that I ended up taking. This is a very high level thing. All of these are referencing facilities in Pandas. But basically, our format for logs comes as the timestamp for the beginning of the request and the duration of the request, which is actually not great for finding overlaps because duration is not the right type of data to find the overlaps.

I had to do a few things. We had to improve our logs for one. We were logging only at the second level and since these container serverless runtimes operate at a sub-second billing interval, it was not actually accurate enough for us to have just our second-level logging. So we upgraded our logging to have the arrival time of the request down to the millisecond or even more accurate. And we already were tracking the duration of requests to the millisecond level because the vast majority of our responses are under a second.

So, the first step was to convert all of those things to start and end date times. You can basically do that through a single line in Pandas to compute from say arrival at this point in time, this duration, I know it's that long of a request. I can just add that basically to the start time of the request and then I get boundary timestamps for the beginning and end.

Then I converted that and this is really where the magic starts happening in terms of actually collating the data for the purpose of creating an estimate. There's a date range tool in Pandas that allows you to set a frequency that basically is rounding all of the timestamps down to that interval. So, I basically rounded all of them to the containing intervals.

And then in order to actually start thinking in terms of concurrency for containers, there's a function in Pandas called explode where instead of having ranges where I have one request that goes from zero milliseconds to 200, explode turns them all into individual chunks. So it starts looking at the data in a way where it looks like there's one request from zero to 99 milliseconds and then there's one request from 100 milliseconds to 199. And that way it basically chunks them all so that you have these stacks of concurrency for every 100-millisecond window.

And then we could group by the actual individual site representing the service pool serving that traffic. It then turned into something where we actually could start seeing how many requests were concurrent for a given website for each 100-millisecond slice of the clock which can then be divided by the actual concurrency of the containers. Which, let's say each container can serve two requests or 20 requests.

You can plug that in there and then what you end up with is data for request concurrency for every 100 millisecond chunk, and then that actually can translate into what the billing cost is. So it was remarkably close. And actually, this was interesting. I can't share the actual numbers, but the simulation came out at four percent higher cost than our existing static deployment on a... well, we don't actually orchestrate the containers using Kubernetes. We use Kubernetes to orchestrate the virtual machines and then our own internal orchestration system that predates Kubernetes for actually deploying the containers. And we have our own kind of not quite scale-to-zero internal tech around this that is probably a couple generations behind the sort of stuff you're seeing with tech like Cloud Run. So, it's not an entirely different world in the sense that we're not comparing an entirely static deployment against an entirely dynamic one.

But what was really interesting here is that on a raw per second basis, Cloud Run is two and a half times as expensive as the virtual machines we're using for the top bar in terms of the actual infrastructure. But because we're only paying for the infrastructure on the serverless side when it's actually getting used, it means that those slices might be more expensive, but we only get them on demand. It's sort of the difference between buying pizzas that are whole versus buying by the slice. That by the slice is more expensive to buy pizza that way, but if you only need one or two slices at a time, then it might be cheaper than buying a pizza every time you want a slice.

So, with that, I'm happy to answer other questions about our approach on this, or how you might want to look at your own estimations of cost structures for serverless versus self-administered auto-scaling and static deployment.

Which one usually works out cheaper if there are consistent loads, not like... well, what's... it really depends on your workload. Part of why I started getting into investigating this is that there's a bit of a trend in some of the media covering companies like Dropbox, where they basically repatriated some of their things off of the cloud to more on-prem and co-located systems. Like, Dropbox moved their storage out of S3 onto their own systems, and they saved a lot of money by doing that. And so, there's been a lot of talk around the idea of, 'What if pulling things out of the cloud could save us money?'.

But I think it's also really interesting to look at it from this perspective of 'What if going deeper into the cloud actually can save you money?' Because you're paying for those resources only when you need them. And really, what I'm saying here is it's not possible to know just as a guess in many cases, which one's going to be less expensive. The more that your application has sharp spikes in need and the more that it has a low to no traffic behavior for a lot of the time, the more you're going to be advantaged by taking on a serverless deployment.

What this is also not showing you is any of the internal operational costs around this, which you have to have far more employees monitoring and operating the system to do something like Kubernetes here for this application than something like Cloud Run. So this is just comparing the actual check that you write to say Amazon or Google in one case versus the other, not what the actual payroll sort of needs to be to operate some of it too. Or in the case of a small organization, if you're struggling to hire enough people, where people are spending their time.

I think that part of what appeals to me about the latter case here is that I think engineers can get more done more quickly in a lot of cases by working with higher level tools. So a little bit higher cost may still be worth it if it gets you to market faster, allows you to maintain the application more easily. This is about just knowing the path, not necessarily choosing the cheapest one every time. Like, we're still probably moving in the direction of Cloud Run for our future architecture here because a four percent increase in the cost is not that much for the gains that we get out of that sort of isolation and transition.

Okay, I'll take the follow-up. So, one of the cloud providers' goals is to make the solution so sticky that they can't get out of it. For example, if you're logged into X cloud provider, if you start using their serverless, it's so sticky that you cannot get out of it. So, would it... Is that, do you think there's a strategy for them to increase the prices later because there's a lot of network I/O transfer costs that show up which people don't think about from the estimated tools?

That's a really good question. So, dealing with, say, the lock-in... Now, not every provider is exactly the same for this, but I'll share how I think about it and what I know to be true about, say, Cloud Run specifically. So, Cloud Run specifically mitigates some of this issue by having validated compatibility with Knative. So any container that is possible to deploy successfully to Cloud Run is a container image that can also be successfully deployed to Knative. So, you have the option to be able to say, run something like Kubernetes, set up a Knative environment, and then transition to something like Ingress and Knative for actually having the traffic set up in a similar way to Cloud Run.

Another way that, at least Google, provides portability is through their Anthos product which allows you to deploy a Cloud Run-compatible infrastructure on top of any other on-prem, co-lo, or cloud environment. And then you still interact with it as Cloud Run, but it's not actually running on Google Cloud. Now, getting to other infrastructure like Fargate, I don't know if it has as strong of guarantees in terms of being able to migrate the workload.

But what you do get is that you should compare the cost of your potential migration against the cost of avoiding the need for migration. In the sense that, in many cases, I think that you're going to spend more time and money getting your application deployed and operating it on Kubernetes with the hope that you need that portable capability. Then you would spend time and money migrating an app from, say, Cloud Run to Fargate if you wanted to switch clouds.

This is also the case for things like databases. Like, it's probably faster and more effective for you to use products like the cloud databases that are provided high level by a cloud provider than to set up your own database farm in the hopes that it prevents you from getting that lock-in and getting exploited over price.

In practice, I don't see these cloud providers exploiting over price that much once you get infrastructure running there. Like, things like Lambda can be very expensive to run on if you're doing large scale in your deployments not that efficient, but I've never heard of a company, say, getting forced by Amazon into much worse terms over time because they leaned heavily on that.

We're also seeing a lot less of the cloud provider wars around virtual machine pricing. In the sense that, the advantage for going with VMs used to be at least the idea that you were getting a product that was almost at or below cost with these cloud providers and they kept warring over reducing the price. But that kind of ended a few years ago.

So, I think that there's more and more compelling reason to look at serverless infrastructure as possibly more cost effective and not just more productive.

And I think that's time. Thank you.

Share on Reddit Share on X Share on Facebook Share on Linkedin

Down to the Dollar: Turning Logs into Serverless Estimates

Summary

Transcription

Is Kubernetes Too Complicated?

The Sound of Code: Instrument with OpenTelemetry

The 3 ways of K3s

Down to the Dollar: Turning Logs into Serverless Estimates

Summary

Transcription

More talks like this

Is Kubernetes Too Complicated?

The Sound of Code: Instrument with OpenTelemetry

The 3 ways of K3s

Stay up to date