Multi-cluster Failover With A Service Mesh
Speaker: Jason Morgan
Summary
Jason, a technical evangelist with Buoyant, introduces Linkerd, a lightweight, fast, and security-focused service mesh for Kubernetes. He explains the concept of a service mesh, emphasizing its role in enhancing security, observability, and reliability. Jason demonstrates a live failover, where he shifts traffic from one cluster to another without any disruption, showcasing the resilience and efficiency of Linkerd in a Kubernetes environment.
Transcription
So, we got 25 minutes. I'm going to make it as short and sweet as I possibly can. It's a little bit interactive. If you get a chance, go over to emoji.v.o.59 and just check out my awesome voting application that's hosted in two clusters in Sio's New York Region, I'm pretty sure. And we're going to do a live demo today where I'm going to take the web front end, and we're going to fail it in one cluster, and it's going to automatically come up in another, and our app will keep working. So, that's the exciting news for today. Sound good? Anyone need that URL again?
Alright, so just to start, my name is Jason. I am a technical evangelist with Buoyant. So, it's my job to talk to folks about the Linkerd project and why you should use it. So, for those that are here, who all knows what Kubernetes is, just to start? Who is familiar with it? Great, awesome. And service mesh? Heard that terminology? There we go. Alright, we're good. Now, what about the Linkerd project? You familiar with that? Yes? Alright, I've got a solid three folks. Perfect. That's me. You can't actually find me on Twitter anymore because I got rid of social media, and I'm happier for it.
So, let's talk about Linkerd. What is it? It is an extremely lightweight, fast, and security-focused service mesh for Kubernetes. Been around a long time. It's in use by all sorts of folks. If you're looking for a pretty interesting talk about Linkerd at scale, the folks at Xbox Cloud did one at KubeCon EU last year where they talk about their setup and how they use Linkerd. So, it was a good one. And we're a CNCF project, and we're the only service mesh with graduated status within the CNCF.
Just for those that aren't aware, what is a service mesh? There are a couple of different definitions. My definition is it's essentially a bunch of little load balancers that are going to sit effectively beside your application in your Kubernetes clusters. They are part of your pod, and those load balancers are going to intercept all the traffic in and out of your application. They're going to do some stuff with it, right? Specifically, they're going to provide you some additional security, observability, and reliability features. In this case today, all we're really going to talk about is the reliability side. So today, we're going to use our service mesh to watch the web front end for that emoji vote application, and when the web front end is unavailable in our primary cluster, it's going to fail us over to the secondary cluster.
Alright, a little bit about what makes Linkerd special. Well, we have — we're probably not the only one, but the only one that I know of — the only service mesh that doesn't use the Envoy proxy. We instead have our own very lightweight proxy called the Linkerd2 proxy. It's written in Rust, which gives us some security and performance advantages that we think are really important. If you want to know more, there's a good article on why we decided to write our own proxy instead of using something established.
So, at a very high level, how does it work? This is effectively the application that we're looking at today. We have an ingress with a proxy attached to it. We have a web front end and two back ends. In this case, it's going to be 'emoji' and 'vote', not 'fu' and 'bar', but you get the basic concept. All these proxies together, they make up what we call the data plane. And then the management interface, that's the control plane. Tracking? Awesome.
So, I'm going to show you the Linkerd failover operator. It's a link failover operator. If you've ever used Flagger, it's very similar to the way Flagger works. We're building on top of constructs in Kubernetes, and we're using them to handle traffic management. We're going to use something called a traffic split object, and our failover operator is going to watch that traffic split object. And it's going to watch whatever services we tell it. If it sees that a given service no longer has healthy endpoints, it's going to flip that traffic split from going to the primary service to its backup service. And then, when it sees that service has members again, it's going to flip it back. And if that doesn't make sense, we're going to have a lot more slides and a demo to show you.
Instead of 'east-west', I'm doing clusters 'prod one' and 'prod two'. But I didn't feel like changing my slides at the last minute, so you're going to get some bad labeling. But the basic setup is, in each cluster, we have our ingress and our full application. Every deployment is fronted by its own service. So, we use Kubernetes services in Linkerd. There's no custom resources that you need to get this working. Well, at least none yet. We're about to get to that.
Then, we do a multi-cluster connection with Linkerd. So, we deploy a multi-cluster component in our cluster. It gives us this multi-cluster gateway, and then we connect those gateways together. We build a persistent TCP connection between them, and then we can flow traffic from one cluster to the other in a way that's transparent to your applications. Still sounding good?
But what we're doing here is, in order to make a service available from, you know, 'prod 2' over to 'prod one', we have to actually tell it that it's advertised. So, we set an annotation on your web service which just says, 'Hey, this is going to be exported', or it's going to be available in two different clusters instead of just in a standard cluster.
And so, we get a second service called 'web service west' or in our case 'web service prod 2' running in our 'prod one' cluster. We can send traffic to it at any point. If I were to tell my ingress to talk to 'web service prod 2', my traffic would flow through, and I'd be actually serving 'emoji voto' from this cluster. And that's what's going to happen at the end of today.
But in our case, we're going to build the traffic split. We're going to create a service called 'Web Apex' or just 'web', but it's an apex service. It just means it's an empty service with no members, and it's either going to point you at the local in-cluster web front end or the remote web front end. That's the whole story.
We have our failover operator, which is a Kubernetes operator that just watches our primary service, 'web service'. As long as it has healthy endpoints, it does nothing. And if it sees no more healthy endpoints, it's going to shift the traffic over. So, when it sees this one go down, it's going to flip the traffic from 'web service' to 'web service west', and then we're going to flow through.
Y'all want to see it live? So, if you got a chance to go to 'Emoji v.o. 59', you're going to see our web app. I'm going to set a little auto-refresh here, go to 3 seconds, and hope that my live demo works well for everybody.
Alright, so this is my live terminal. On the right-hand side, I've just got a watch that's looking at that traffic split. It's going to tell you where things are going. So right now, the weight is 100% on local, 0% on remote. We have the two services.
So, let's just take a look at our services. These are standard Kubernetes services. So, you didn't have to do anything special with your application to make it work in this environment. Right, we took a normal Kubernetes app, we added our Linkerd proxies beside them, and everything continued to behave as normal. But now, we have mutual TLS, metrics, and the ability to do things like multi-cluster failover.
See, I've got a couple of services. The important thing here is I've got a 'web service' and 'web service prod 2'. That 'web service prod 2' points to an entirely different cluster.
There, alright. And we're just going to do a little live failover and see how it works. So, in our case, we're just going to scale down. So, I'm going to scale my web deployment down to zero. Right, so we won't have any more healthy endpoints. And if you're able to refresh, now is a good time to try it out because we're going to see if we cannot break it.
Alright, so, well, you can't see it because I shrunk the thing a little too small, but let's just do here. We see our web front end is terminating. So, while it can still serve traffic, pretty soon it won't. But my traffic split, my failover operator, noticed that it was no longer a valid target. So now, the web service local has no healthy endpoints, so it shifted us over to web service prod 2, which is our remote cluster.
If we go... oops, not this one. If we go to our app, we can do a refresh. Come on, buddy. There we go. We can do a refresh, and everything still works. So, this is a nice, boring Linkerd demo, right? Which I think highlights for me the strength of Linkerd, which is things kind of just work or are relatively straightforward to do.
We'll stop our auto-refresh just so my screen doesn't keep flashing. We can vote on our components, we can view our leaderboard, we can see traffic. Sorry, it's been running for a while, so we have a lot of votes. A lot of votes going through. And we've now moved to serving traffic over in prod 2 instead of prod 1. As you know, whatever incident that brought down our web front end goes away, we can, assuming that this is it coming healthy, I'm just simulating it by scaling down replicas from 0 to 1, or in this case, scaling up. Once it comes online, right, we have one of two pods ready, or one of two containers ready in my pod. The second one came up, and almost immediately, we saw the traffic split flip back, and there was no change. So, nothing happened the entire time, but we just did a live shift from one cluster to the other, and it was relatively straightforward.
So, we're going to go ahead and finish the slides. Any questions before I go any further? Okay, great.
Like, under more intelligent circumstances? Yeah, absolutely. Right, so this failover operator is just an example of how you would do it. It's fine for a simple scenario, but what if you add more intelligent logic? That's a great thing for you to... I don't know if anyone... I like the Ambassador Ingress. I think that they've got a nice, straightforward, easy gateway to do. But you could put logic in there like, 'Hey, if I start seeing whatever errors here, I want to shift over.' And then Linky multicluster would support that just as well. It's just where you're putting that logic.
One interesting bit of trivia: today, we're using the traffic split object in order to accomplish this shift. We're looking in Lardy 2.13 to finish our move over to the Gateway API specification. It's going to mean that traffic splits and our last vestige of the service mesh implementation, service mesh interface implementation, are going away. And instead, we'll be using HP routes, which are native Kubernetes objects. So, even fewer custom resources to have to deal with.
Alright, so just to show you in slides, in case that wasn't super clear from all the text: as web went down, it shifted traffic over to our West cluster. When web came back up, it shifted it right back for us. And that's it. I appreciate you all coming.
So, we've got, if you're going to CUCon EU, we've got our first ever Linker D day, which we're really excited about. If you're interested in doing talks, please talk to me. The call for proposals ends Friday, and we would love to have y'all give a talk. We expect this to be a zero vendor event for Linker D day, so it's all going to be user talks, which we're really excited about. And we'd love to hear from you and your stories with Linkerd.
If you're looking for more on this or you want to learn other Linkerd stuff, there's lots of good material out there. But the Buoyant Service Mesh Academy has hour-long sessions where we dive deep into different technical topics. You can learn about policy or how to do zero trust with Linkerd. You can learn about how mutual TLS works or why it's important. There's great SE manager talks if you want to look at how can you do automated certificate rotation with Linkerd. Would recommend it highly.
And of course, if you're using Linkerd in production and you're like, 'This is great, but I want to pay somebody,' you can go over here to B.D demo, and you can get a demo of our management solution which builds on top of Linkerd. And we're happy to help you with that. And that's it. Thanks. Thanks so much for coming.
Any other questions before we roll out? Sorry, you know, you can go. Yeah, please. Okay, I've got a...
So, the traffic is still going through one Ingress controller in one region. How are you guys handling like failovers of regions, as in moving an Ingress over into the other region? Is that done yet or that's... I'm sorry, I didn't... I couldn't hear you. So, moving from one Ingress in your primary instance or region, your Ingress is still coming into that Kubernetes cluster. In the case of a complete disaster, failing over the Ingress isn't occurring right there, right? So, are there other mechanisms within the service mesh to do that?
Yeah, so that's a great question. What was your name? John. John, that's a great question. Thank you very much. So, what John pointed out is that this whole demo is occurring with only one service in your cluster failing over. So, I've got two prod clusters, which is cool for redundancy. But what if prod one goes down entirely? Well, it's kind of a cruddy answer. We don't. We can't look beyond the cluster scope. So, the service mesh is an in-cluster thing. So, we have one service mesh in prod one, one instance in prod two. They're connected, and they can share traffic together, but they don't have some sort of global controller between them. So, in the situation of failing over the Emoji Voto URL because prod one is down, is out of scope for what this tool can do. Yeah, but thank you. I appreciate it. Anything else?
Additional layer in... Yeah. So, what was your name? George. George, thank you for that question. George is asking about latency. So, yeah, definitely there's an additional latency introduced. It's fairly fast, but it's like whether or not it works for you is going to be very dependent on where are your clusters and what are the needs of your particular application. Like, would you rather have latency or the failure? And I'm not... that's not a sarcastic question. Those are the sort of things you're going to want to answer.
What is your motive? Yeah, so the Linker multicluster connection, all it needs is... He's asking about what do you need to do from a networking perspective for the multicluster stuff. Linkerd doesn't care. As long as I can build a TCP connection from cluster A to cluster B, there's no other networking requirements. You can have the same IP space for pods in both things. You can be using a private VPC peer, or you can be going over the open internet. For us, it's we're going to respect whatever the routing rules are in your environment. And really, the only consideration that I care about is availability and performance. Like, is that route that you're using highly available, and do you get the right latency for your app? Hey, thanks. Great questions, guys. Yeah. Oh, thank you.
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.