Things We've Learned About Better Software Delivery Principles Through A Pandemic
Speaker: Jeremy Meiss
Summary
Jeremy Meiss from CircleCI delves into the characteristics of high-performing engineering teams, emphasizing the importance of software delivery in today's tech landscape. He highlights key metrics like duration, mean time to recovery, success rate, and throughput, underscoring the balance between speed and stability.
Transcription
Hey, thanks for joining! Come on in, have a seat. For those getting set, we're going to talk about high-performing engineering teams and the Holy Grail. I know that on the schedule, it actually has a different talk title. Don't worry, it is the same talk, just adjusted with more updated data. So, if you're here looking for the 'Things We've Learned During a Pandemic' talk, it is the same talk; it's just been adjusted for new data. High-performing engineering teams and the Holy Grail, let's go and get started.
Software delivery has never been more critical to the success of businesses in pretty much every industry. Every company is a tech company now. It's also never been more complex. Expectations for delivering high-quality software experiences have skyrocketed, with the need to deliver fast. Yesterday's exceptional experiences that took some time are now the expected norm. But they also have to be more beautiful, more intuitive, more powerful, and throw in affordable there as well.
The landscape of all the available tools, services, platforms, and architectures has never been more complex than it is today and is constantly evolving. This probably needs updating because the CNCF just tends to keep adding more and more. With this rapid pace of change and the challenges of all the different complexities, the question comes up: how can engineering teams not only succeed but beat out the competition?
So, my name is Jeremy Meiss. I am from CircleCI. I'm a Director of Developer Relations there, and you can follow me on Twitter at 'IamJaredog'. I apologize in advance for anything you might see; it's just me.
Back to the tech industry, we've entered into a period of immense change. We had global economic uncertainties that have presented complexities impacting companies across the spectrum. Amidst all of this environment, stability and reliability have become increasingly important to businesses globally. The ability to provide a reliable, stable platform to customers has become a crucial value metric to engineering teams. Teams that prioritize robust testing as part of their continuous integration practices can save millions of dollars over time. A study done by Forester a couple of years ago talked about that very same thing in the total economic impact. Not 'impackt' - it's not some new metric that's out there.
Since 2019, at CircleCI, we've been building and analyzing a lot of the anonymous data from the different jobs and workflows developers run on our platform. We've identified a set of recommended baseline engineering metrics related to CI/CD and painted a picture of what it means to have a high-performing engineering team. We don't believe in 'one size fits all' metrics. Every company is different. You are not like the Googles or the Netflixs, unless you work at one of those companies. Complexity will be different for your business. You can't just try to be like those other companies; you have to be like you. As we've looked at the data, we've tried to come up with some baseline software delivery metrics, looking at data points from high-performing teams in the industry and showing some key similarities. We're going to go through a lot of that now.
We probably won't find that 'holy hand grenade' for your team. If you do, I'd love to talk to you about it later, really just to shake your hand and give you a high-five. We're going to be looking at different metrics and benchmarks that identify from these core things: duration, mean time to recovery, success rate, and throughput.
As CI adoption has expanded over the years, teams have had a 'grow at all costs' mentality. But as we've seen over the last few months, the complexity of software systems, while increasing the cognitive load developers have to deal with, has presented challenges around maintaining systems and has become overwhelming for many teams. We see situations where teams start to cut corners, systems go down, or companies lay off staff and then wonder why things don't work. But we're also at this inflection point where stability has become the main goal. DevOps teams have to take away some of that accumulated complexity and help systems recover quickly from failures. Many companies have started to institute a 'platform team' idea, recognizing the need for a dedicated team to support that stability.
So, you probably came to this talk with the question of how to achieve elite status through a holistic software developer practice. It could have also been for the Holy Grail if you saw my tweet earlier. I promise that while that's a good question, the answer will hopefully not involve any concussions, but I make no promises.
Now, let's start diving into the data. Duration is the foundation of the velocity software engineering teams want to achieve. It measures the average time in minutes required to move a unit of work through your delivery pipelines. It's important to note that the unit of work doesn't always mean deploying to production; it could be as simple as running a few unit tests on a development branch. But getting ready to move things into production is crucial. Duration is a proxy for how efficiently your pipeline is working.
The core promise of most software delivery practices, from Agile to DevOps, is speed and agility. It's about taking information from customers and stakeholders and responding quickly and effectively to those points. Rapid feedback and delivery cycles don't just benefit the organization or users; they're also crucial to keeping developers engaged and in a state of flow. When more complexities and breakages arise, it's not as enjoyable for developers.
But we've seen this exclusive focus on speed. And you know, that starts to come, as I mentioned earlier, at the expense of stability. A pipeline starts getting optimized to deliver unverified changes because then it's this highly efficient way. But it's more like a highly efficient way of shipping bugs to your users, exposing your organization to a lot of risks—unnecessary risk—everything from security to customer satisfaction. An ideal duration we've seen from the benchmark in the data is 10 minutes or less. This comes from the accepted benchmark dating back to Paul Duvall's work. He was very instrumental in this idea around continuous integration that he talked about. In this range of 10 minutes or less, it is possible to generate enough information necessary to feel confident in what you're trying to build.
Among the workflows that we've seen across our platform, the data showed that 50% were completing in around 3.3 minutes, which is far below that 10-minute benchmark we have. It's about 30 seconds faster than it was in 2022. The fastest 25% was under a minute. And 75%, that's getting close to that nine-minute timeframe. As the average starts to get higher, that's where we're seeing longer running workflows. Teams have started focusing on more testing and integration tests to ensure better deployments. And so, that's where you see 27 minutes or more in that 95th percentile.
Many of the teams that we've observed continue to favor speed over robust testing. So, the number one opportunity we've identified for software delivery teams is to increase their test suites, to have more robust test coverage. This means adding integration, unit, UI testing across all application layers, incorporating code coverage tools into your pipeline, and including static, dynamic security scans. Incorporating test-driven development practices from the beginning will help improve test coverage and hopefully achieve better duration times.
While these changes may result in longer durations, they are hallmarks of high-performing development teams. When you add more testing, your duration time might go up, but everything else will likely go down. The next step is to maximize the efficiency of your pipelines. This involves test splitting, parallelism of jobs, caching dependencies, using custom images for your CI environment, and choosing the right machine size.
The path to optimizing workflow duration involves combining comprehensive testing practices with efficient workflow orchestration. Teams that focus solely on speed will spend more time rolling back broken updates, debugging in production, and facing greater risks to their organization's bottom line and reputation.
Developers naturally gravitate towards speed. But platform teams need to identify and eliminate those impediments that hinder developers from achieving velocity. They should set guidelines, standardize test suites, welcome fast failures in development testing branches, and monitor all pipelines across the organization.
Meantime recovery (MTTR) is indicative of a team's resilience and ability to respond quickly. From an end-user perspective, there's nothing more important than a team's ability to recover from a broken build. If you've done the work to set up a robust pipeline, it becomes easier to address issues. An ideal MTTR, based on data from high-performing teams, is 60 minutes or less on your default branch.
Driving a 10-minute change in MTTR involves understanding economic pressures, prioritizing product stability, leveraging platform engineering teams, treating your default branch with utmost importance, optimizing pipeline duration, setting up instant alerts, writing clear error messages, and automating processes as much as possible.
Success rate is another indicator that shows the stability of your organization and impacts both customers and development teams. Factors to consider include whether the failure occurred in the default or development branch, if the workflow involved a deployment, and the importance of the application or service being tested. A failed signal doesn't necessarily indicate a significant problem that you've got to address on a larger or deeper level, but it is essential for your team to understand the ability, or for you to grasp your team's capability to adjust and take in that signal and then respond accordingly.
This relates back to the MTTR. There are many scenarios where a broken build might be tolerated or even welcomed, especially in the development branches. I try to focus on the fact that failure is acceptable. Any engineering organization that says you should never have a failed build and that you should always have a 100% success rate is a bad organization. I can't stress that enough. Those processes kill innovation within your teams. They don't have the freedom to make a mistake or try something new because that will reduce their success rate. There's this idea that if you stop failure, you'll be more successful, but time and time again, we see that's not the case. It's vital to have organizations open to failure, but it's also essential that such failures deliver fast, valuable signals to your developer team.
The ideal success rate we've observed is 90% or more on the default branches. Failure on your topic branches or feature branches, depending on your flow, are probably less disruptive to your software delivery practice than issues on your main line. Pushing something to production will be more disruptive to your company and users than on a topic branch. So, it's crucial to emphasize the default branch but also good to track different branches to see trends.
From the data we've seen, success rates on the default branch averaged 77%, while non-default was 67%. These rates in the default branches have remained steady. Neither number meets our benchmark, but the pattern of non-default branches having more failures shows companies are more accepting of failure, which we encourage. Also, the recovery times have fallen sharply year over year. The significance of the success rate to your team will depend largely on collaboration, the type of work, and your ability to recover quickly from failures. Prioritizing resilience will help mitigate the effects of low success rates.
Platform engineers have a responsibility to look beyond surface-level metrics and uncover the most meaningful data. They should understand the baseline success rate, then look for continuous improvement opportunities. If you notice that the success rate drops on Fridays, that might be because people are gearing up for the weekend. During holidays, both success rates and mean times to resolve tend to decrease. Many companies avoid deployments over the holidays for this reason.
Throughput reflects the number of changes your developers commit in a 24-hour period. It is useful as a measurement to help understand team flow and track units of work as they move through your CI system. However, throughput doesn't tell you about the quality of work being performed. It's important to look at how many builds you're pushing each day, but that alone won't tell you how well you're performing. High throughput doesn't mean much if the code's quality isn't there.
Each organization will have its throughput goals based on its specific needs. It's important to set goals aligned with business requirements. From our dataset, we noticed an average of 1.5 deployments per day, a slight increase from previous metrics.
Lastly, the organization's choice in programming language can affect some metrics, but it's not a significant factor. The most used language on the platform is consistent in the industry. Python is a dominant language and appears in the top 25 across various metrics. Config languages, Perl, Go, and Vue offer excellent package management. Hackett, a PHP superset, is focused on developer speed, resulting in higher throughput.
In conclusion, the last two reports are available online, and a new one will be released soon. I invite everyone to check them out for a more in-depth look into the data we've discussed. Thank you.
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.