Realtime Collaboration in Kubernetes to Supercharge Team Productivity
Speaker: Narayan Sainaney
Summary
Narayan Sainaney, the co-founder and CTO of Codezero, discusses the importance of real-time collaboration in boosting team productivity in a Kubernetes environment. Through his anecdotal journey, he shares the challenges faced when their platform's demand exceeded expectations, causing enormous compute costs and leading to potential outages. Sainaney emphasizes the complexities of managing infrastructure and the importance of anticipating issues as the scale grows. The talk also covers the introduction of Codezero's tooling, which aims to facilitate and simplify the process of working within a Kubernetes cluster.
Transcription
Hi, Narayan Sainaney. I'm the co-founder and CTO at Codezero, and my talk today is on real-time collaboration to supercharge team productivity around Kubernetes environment. So, when I spoke to Mark and Dinesh, I understand this conference is about education and learning. Before we get started, just looking in the room, who's already using Kubernetes? Okay. And in terms of workloads, how many of you are handling over 100 services in Kubernetes? Wow, okay, pretty sizable workload. Great! And anyone below that? Okay, yeah, great. I'm curious to know how many services Civo is. So, we'll get to that. Okay, that helps because the number of services and how they explode, we'll discuss that as part of this conversation.
There's going to be a long list of distinguished speakers here today, and they've had various degrees of success. I'm sure they'll be talking about their successes, but I'm going to start with a story where success didn't happen, at least not from an infra perspective. Then, we'll get into how team collaboration and working around infrastructure is really important, and how tools like ours can help.
For a couple of decades, I've been working on web applications, mobile applications with a big backend. I've had experience with several products, mainly three-tier architecture products. But in 2012, which is now almost 11 years ago, I started working on a connected car platform. Apparently, IoT had been coined in 1999. When we began building in IoT, the challenge of going to market with a product, where you had data coming from a car going to a central cloud and then going out to devices, was significant. These IoT platforms didn't exist back then, so we had to build a lot of the underlying plumbing. Version one was primarily about collecting vehicle location data, such as when an engine started or stopped. We were a tiny, scrappy startup of 20 people. Kubernetes wasn't publicly available, and even when it came out, it took a few years before it was ready for prime time.
We started getting attention from many cellular carriers. There are only about 30 major cellular carriers worldwide. We also began getting noticed by Amazon, and Microsoft was a good partner. They gave us credits, initially $10,000, then $50,000, and finally half a million in credits to try out Azure. So, we had a lot of cloud compute available. We also ended up in front of the Amazon Marketplace. That blue logo, that's us. The Alexa Fund and the Amazon Alexa team invested in our company. Despite all this marketing, the rollout was slow. In about two or three years, we acquired a few thousand customers.
Now, this was an unproven area in terms of cellular technology and real ESS, but we were consistently adding more features and making the project more stable. Then in 2015, when we were a three-year-old company, Samsung pulled out of the connected car market due to issues with batteries exploding, creating a vacuum in the market. We'd been in talks with T-Mobile, a major American and international carrier. We'd become good friends and gotten to know each other. This $52 billion company, working with just 20 of us from Vancouver, decided to launch with us. They took our product, branded it as SyncUp Drive, and launched nationally for the 2016 Christmas sale season.
By this time, our platform had really matured. We'd gone from being able to just detect vehicle location to engine health. We were analyzing driver behavior, fuel efficiency, and could determine if you were where you were supposed to be or if you were falling asleep at the wheel. What started as one service had become 32 major subsystems. We were using machine learning to ingest this data and determine what was going on with users.
With 3,000 users and a national launch in 4,000 stores, we had conservative estimates; 50,000 sales would have been a great year. So, we started doing all our load testing, deciding to double or triple that to ensure the systems could handle the forecasts. This is where we were on our own, selling directly without a partner like T-Mobile. We went live in November 2016, and our sales began to exceed our expectations daily. Basically, as many users as we had acquired over three years began to sign up every hour.
We had a hit on our hands and were super excited. However, our load testing had accounted for about 150,000 concurrent drivers, and we were seeing problems at a fraction of that. Alerts indicated the system was heavily degraded, with latencies of six to seven minutes. This is problematic if your car is being towed or stolen. The surprising part was we weren't even near our predicted load capacity; we were only at 8,000 users. So, why is the system going down?
The other funny thing at this time was that I had my personal, my corporate credit card on at Azure. It was Christmas, and I started to get these alerts from Azure that said, "You're running out of money. We're going to shut down your system." We went from a few tens of thousands a month to hundreds of thousands of dollars in compute costs every week. Over Christmas, the joke was that we were getting these automated bot emails saying your system's about to be shut down. We had to call Scott Guth's office, the CEO of Azure, and say, "Please don't shut us down. We wouldn't joke. We would drive down to Seattle and slide a certified check under's doorstep." They laughed and said, "Ignore the messages. We're not going to shut you down. We trust you." They were intimately involved in working with us over that Christmas period.
I was spending days with my whole team in front of screens. The challenge was trying to decipher what was going on. It didn't feel like computing science anymore. It felt more like physics, where I felt like I was a particle physicist looking at collider data trying to decipher what was going on. There was a terrifying night when we had a 300 node monolithic cluster set, and the system went down. We were getting calls from customer support and senior VPs from both our partner companies. Up until this point, there was little on the platform that any individual engineer, including myself, couldn't fix. Now, we were out of our element. This is me, I once showed up to work in the unfortunate t-shirt that says, "I'm here because you broke something." Even the support staff at Microsoft weren't able to figure out what was happening. We had very senior engineers from the Azure team drive up to Vancouver to spend days with us. They had the Azure codebase with them, and as we tried to decipher why it was deadlocking, we would get to look at our code, and they would go through theirs. We found issues in the Azure SDKs and drivers, but these would take two weeks to ship.
We then got a call saying there would be a meeting at 8:00 in the morning. We had a clause in our contract with T-Mobile that if we were out of SLA, they could cancel the project. We would have been on the hook for buying back every device sold in the market, which was a significant eight-figure amount. My co-founders and I had seven figures in our bank account, so we were uncertain about the 8 a.m. call. I want to stress that up until this point, we hadn't truly grasped the system's complexity. We had been adding more services and features, which is natural. You start simply and grow into hundreds of services.
In the end, our tooling instrumentation completely failed us. We knew one of the 32 Services was deadlocking, and it was cascading, bringing the entire cluster down. Half the team was trying to look through code, logs, and telemetry, trying to figure out how to remedy this. The other half did a binary segmentation. It took us 60 minutes at the time to deploy our application. So, if you consider 32 systems, you shut down half of them, determine if the deadlock went away, and continue this segmentation process. Eventually, you'll find the problematic system. If you're lucky, it's one system; if not, there are multiple culprits. We figured that by 5:00 a.m., we'd at least know what the deadlock issue was and then maybe be able to figure it out. This wasn't even about finding a solution; this was about identifying where the problem originated. Thankfully, at 2 in the morning, we found a single line of code that brought down our entire 600 node cluster in a cascade. It took us eight hours to diagnose, find that line of code, deploy, and verify. We were just barely 4 hours away from a crucial phone call that, fortunately, never happened.
Instead, we got a message that, even despite the outage, they had record-breaking sales numbers the next day. We were working with some of the best partners and some of the smartest people. Our customers and channel partners were great. Part of us was celebrating, but the other part realized that this was just going to keep getting worse. The sales engine was not going to slow down, and this 6,000 would have been six years of sales for us at the previous trajectory.
Why am I telling this story? Even though we started with this platform, the experience profoundly changed not just me but my entire engineering team. Up until that point, we'd been so focused on features, bugs, and delivering value to customers. Infrastructure was secondary to us, and how we work was also secondary. I found myself becoming more obsessed with the situation, something I'd never encountered before. The challenge wasn't just about architecting better platforms; it was about teamwork. This became the forefront in my mind, so I left the company I co-founded to start Code Zero to feed this obsession.
One thing that was gnawing at me is this ROI curve. When you look at a lot of technology today, or learn about processes, there's this S curve of ROI to growth. In the beginning, whatever you do impacts your ROI. Introduce a new piece of technology or a new process, and it doesn't immediately pay dividends. Instead, it can cause your ROI to go down. This is due to the learning the organization has to absorb. There's a learning curve, culture shift, and procurement that has to happen.
There's a talk going on right now about is Kubernetes too complicated. Some of you might be regretting you're here and not there. This was the thing that was at the back of my mind the entire time once I left. Another realization was there isn't one S curve; there are many curves. Our role as leaders and managers is to figure out the lifespan of a process or a piece of technology within your company's ambitions. Will it support this phase of growth, the next phase of growth, and so on?
When we started Codezero, we were thinking about how to build something that spans multiple life cycles for engineering teams. Even stuff like Redis, which is battle-tested and resilient, has limits. We tore Redis to shreds. There were architectures that books and experts said this is the way to do this. But it only takes you up to a certain point. As a leader, when do you decide to shift? Do it too early and there are challenges. When you're getting huge ROI and everything's smooth, there's resistance to change because nothing's broken. But as a leader, you have to anticipate the good days will come to an end. Then what's there? You need to have those investments ready. So with that foundation, we said we're not going to build a platform.
A platform becomes part of your SLA. It becomes part of your runtime. It locks your company in. So there are challenges with a platform. In the beginning at Codezero, while I was thinking about this, unfortunately, we did build a platform to begin with. But I'm going to hide that slide because it's not relevant. We began to build a prototype of a hybrid cloud platform, something that would help devops bridge to engineers who don't understand infra.
In late 2021, we launched this platform. My co-founder Reed and I started to get interest in it. The thing we found was people liked the platform, but they loved our tooling. They asked, "How did you do that demo? Show that to me again." It reached the point where customers who were interested in the platform actually wanted to buy our tooling instead. Recognizing that signal from the market, we started to package the tooling we developed.
Before packaging that tooling, given our experiences, we returned to certain first principles in cloud development. These principles had become innate to us and had influenced our tooling and processes. We didn't realize that this was where the value was. Many companies preferred to build something more bespoke to their product needs, and our tooling fit right in.
What were these first principles? One of the first principles was the notion of a remote system and a local system. Remote could be your production environment, staging, long-running dev environments, even preview environments. Local is what the developer is most comfortable with, where your tooling resides, and where tools like debuggers operate.
In many organizations, this differentiation is significant. According to the CNCF, 80% of companies have more than 50 microservices. 67% have more than 200 microservices. Even if you may be at five today, your organization will grow. As you add more features, there's an inflation of services, a natural tendency.
At Mojo and other companies, we observed organizations investing heavily in miniaturizing the production environment. They were either mocking things or using other methods. At Codezero, we couldn't crash a car for crash detection, so we had a mock service for crash detection. This meant engineering had to spend time not just on the system but also on creating these mocks. However, as seen with our outage, the issue was that users could change their Wi-Fi password in the car. Our load testing missed that. Over the course of the day, hundreds of users were changing their Wi-Fi passwords where the deadlock occurred. Your mocked service, even your staging, and pre-production environments don't necessarily represent how users will use your system. So you will miss things, things are only caught in production.
This approach was untenable. It wasn't feasible to equip every user with powerful computers with abundant RAM. What if we could have a pre-production environment and only the specific service a developer is working on locally? This was the first principle we aimed to solve, and I'll demonstrate how we achieved it.
The second principle was the feedback loop. In the best organizations, some have deployment times down to 15 minutes, while others achieve it in 5 minutes. Tools like Tilt enable what we call micro deployments. But there's always a wait to deploy, then you test your software. Software is built iteratively. Deployments often occur because there's a backend system or a frontend system that needs coordination, and they need to integrate seamlessly.
So that's really why we're deploying. It's not feasible that you'll get it right the first time unless it's a trivial fix. We were doing this all day long, and this is just considered normal practice. A whole week could go by, and this is how you engineer. Instead, we said, what if we could code and immediately test, and have that be the feedback loop? And when we are certain we've got things working the way we want, then do a deploy. That was the second principle our tooling had to deal with.
Lastly, I want to touch on this aspect of testing, diagnosing, and fixing. The difference between diagnosing, testing, debugging, and fixing is really about tooling. On my local machine, I have certain command line tools, certain git tools. I like git in the command line; my VP of engineering uses Source Tree. All developers have different tooling preferences.
There's something called Omer's Paradox. It's a logical fallacy in thinking that logs will diagnose issues in production. If I had known in advance the information I would need to solve the bug, I probably wouldn't have made the bug in the first place. In subsequent outages, a lot of our deploys were not to fix the bug but to instrument our code to determine the problem.
What if we could use our first-class tools and the ergonomics we're used to during development all the time? That was the third principle we set up to solve. The outcome was seven primitives baked into the Codezero toolchain available on our website.
With that, I'd like to jump into a demo. After the demo, I'll do Q&As and post a link. If anyone's interested, we have a version two coming out, and we have a private preview available.
Great, it didn't work when I tried it earlier. All right, so that's working good. To illustrate our Toolset, we have the sample project. It's really a toned-down project that shows about six services. We have a front end, a core API that has no dependencies, and another that has lots of dependencies in the leaf API. The project illustrates so you can have HP traffic, websockets, and TCP traffic. I've also gone ahead and deployed this project to a Civo Kubernetes cluster.
Now, to illustrate that first point: What would I do if I need to work locally? I would typically use Docker compose and try to run all this. If I was a front-end developer, not only would I need that core API, but I would also have to think about the entire blast radius. I need the core API, the core API needs a database, the database needs to be seeded with data; all that has to be part of my setup process. But really, at the end of the day as a developer, all I care about is that API.
If I came into a browser and tried accessing the API, I would get an error because my computer doesn't know what that is. If I take that URL and go into a command line and try to curl it, that's not going to work either. What I've done is, I've installed the Codezero tooling. Now, the command here is "teleport", where your local machine temporarily becomes part of the Kubernetes cluster. This is not you joining your computer as a new node. This is us taking over the DNS layer and the network layer on your computer. This is not a VPN. There are problems with VPN in terms of all your traffic going through the Kubernetes cluster.
The other thing about this is the capability we find that developers all over the world need. You need it to be fast and responsive to what's happening inside the cluster. Once I'm teleported, if I do this curl command, the service running in the Kubernetes cluster appears like it's on my local machine. It doesn't matter what tooling I use, whether I'm using a browser, curl, or whatever.
Now, as a front-end developer, I can interact with the core API as if the local front end is running inside the cluster. That's the first experience we wanted to address. Oops, I see what's going on; it's PowerPoint that's the issue. We won't go back to PowerPoint. Let's see if I can grab the signal back. There we go. Okay, killed PowerPoint.
So that addresses that issue where we said you don't have to replicate that remote environment locally. If I brought came back to my terminal, there are no Docker containers running on my local machine to make this happen.
I'm now going to close that session and illustrate the second item here. So when we were — and this is not just something you'd use in an outage. What I'm about to show you next would have been crucial: eight or six hours of diagnosing what was going on in production. We would have actually been able to assess the issue within 20 minutes with this next feature. But Reed and I were chatting about that. That's a fire extinguisher. How often do you have an outage, and is Codezero really a tooling for outages? You might have, if you're lucky, one, two, three a year. And they won't be so severe.
But when we first put this tooling out, we thought that there'd be 20, 30, 40 sessions a week going on, and we actually start to see two sessions a week per developer. And we thought, "Oh my God, something's wrong; no one's using the product." And as good product managers, we picked up the phone and called these folks and said, "Hey, what's going on?" They said, "No, your tooling is so good. We turn it on Monday morning, and this is just how we work." So a lot of our customers, and they range from 20 services to hundreds, this is now their way of being.
And I was embedded in one of our client organizations. We had a situation where the front-end engineer was in India, I was on the west coast of Canada, we had product managers on the East Coast. All of us were trying to figure out how to do this. And here's what I'll show you now. So I've deployed this application into Civo. Let's get the cluster name. So let's show you what that looks like. Here's the IP address. Okay, I'm just going to bring that up in a browser. So that sample application, this is just the front end, and every 3 seconds it pulls the backend core and leaf Services every 3 seconds.
I'm also going to bring up an anonymous window and put those side by side. And this time what I'm going to do is called an intercept, and what intercepts doing is it's basically putting a proxy in front of whatever service I want inside your. This could happen in production; it could, in this case, be in a pre-prod environment, whatever you want. And once you intercept a service, well, basically, nothing happened. And you don't want anything to happen because if you were to do this in production, if you were to route, in the case of Mojo, it was 20,000 requests per second, over a billion requests a day, your local laptop would blow up.
What I'm going to do now is use a third-party tool and inject a header. And now what's happening is the browser on the right has traffic for that service redirected to my machine. The browser on the left is still going into production. So in an outage scenario, I could do this and basically carve out a single user's traffic and use my local tooling to figure out what's going on. At this point, I'm getting an error because nothing's running locally. I'm just going to start the service locally. What this is doing is I only need to run that one service I've intercepted. I don't have to worry about the entire application. Now what's happening is here's the service resolving on my local machine. If I went into a debugger and set a breakpoint and attach to that breakpoint, in a few moments there, that breakpoint broke, and I'm actually able to step through code instead of going and looking at logs and Diagnostics and staring at graphs trying to diagnose what's going on.
If I come back to the browser, I'm going to hit refresh. Assuming I was doing this in production, my production users would have no clue I'm doing this. So this will resolve, and if I came to the other place where the service is being intercepted, that's now stalled because I'm at a breakpoint. So just that one user.
Now how do we use this every day? The frontend engineer is making changes to the UI; I'm making changes to the API. We are actually working in a Dev cluster in real time, and as I'm making API changes, they're verifying in the UI that things are working. And in doing this, no one is sitting around waiting for something to deploy. My entire challenge now became, instead of that 15 or whatever minutes, if I came in here and let me just add a "hi" there, if I made changes, effectively my test time is whatever my hot reload time or recompile time is.
So here it is with the changes. So we're effectively able to just work as needed, and this is becoming a way that our, you know, we're taking people into a place where stop thinking about this deploy to test cycle. Work in a way where you and the production or the pre-production cluster are one. Where we're going from here now is we've got this thing working; it's a command line tool, we've got some UI. It still does require some infra knowledge to at least do the initial setup. There technical terms like teleport and intercept, we've learned a lot in terms of how about a 100 companies are using our tooling. Our goal now is to make it dead simple that a front-end engineer who's so busy learning about the latest React version and a backend engineer who's looking at no JS or go that's the other thing everything we're doing is working at the operating system level. You do not need reinstrumentation in any way. So this is what's coming with 2.0 is just making it extremely simple for non-infra Dev users to be able to work and collaborate in real time with each other outside of production environments or in production environments if that's what's needed. Let me see if I can bring up that QR code. If not, I'll pause here and open the floor to questions. Go ahead.
Sorry, go ahead. Thank you. You mentioned you were actively developing the back end and someone else was developing the front end and they were using your API changes. Were you deploying those rapidly to the cluster or was the traffic being routed through the cluster to you? Traffic's being routed through the cluster to us. We're not a micro deployment service. That is a valid technique. We said don't move containers, move traffic and that's really what our solution is about. It's purely a layer four solution. Yeah, awesome. Thank you.
I'm here with my colleague Reed, stand up and wave around so they know you. We're here for the next couple of days, very happy to meet everyone here and grab a pint. If there are any questions or if you want a sneak peek to the next version and things like that, come find us. We're really pleased to be here. Thank you.
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.