8 Steps to Build an Open Source GitOps Cloud Native Platform
Speaker: John Dietz
Summary
In this Navigate NA 2023 talk, John Dietz, co-founder of Kubeshop, explores the process of building an Open Source GitOps Cloud Native Platform. He outlines the 8 crucial steps required to establish the platform from scratch, covering aspects such as Git provider selection, Infrastructure as Code (IaC) implementation with tools like Terraform and Atlantis, and the significance of GitOps in handling microservice ecosystems.
The talk also delves into essential components like secrets management with Vault, Artifact Repository options, Continuous Integration, and Authentication/Authorization setups using Kubernetes and RBAC.
Transcription
We've got a lot to cover and I am very excited to be here today. Before I get started, I wanted to take just a quick moment to thank our sponsors for hosting this event, the folks at Civo and our parent company, Kubeshop, for basically buying my project and sending me out here and paying me a salary. It means a lot to me.
It was supposed to be me and my buddy Jared Edwards presenting today, but he had to bail on me last second. Nothing to worry about, just a little sniffles, but too much for him to make it down here. So, I'm going to be presenting solo today. My name is John Dietz. I am a co-founder of Kubeshop. I've spent decades in the computing space, largely centered around automation in all capacities. Back before DevOps, I was deep into test automation and then I got into development, cloud engineering. After a long enough adventure, I stumbled upon Kubernetes. I fell in love and started a company surrounding enabling people to get started on Kubernetes fast.
In the middle of that adventure, I stumbled upon GitOps. GitOps was a game changer for us. It really messed us up, frankly. We had to throw a lot of code out because it's clearly the right way to do Kubernetes in our minds. So, we rebuilt our platform from the ground up and here we are today.
So, I can't really start talking about Kubernetes platform without setting the table about Kubernetes itself. Kubernetes is the de facto container orchestration engine. I'm sure if you're here, you're well aware of that. Over the last seven or eight years, it's produced this enormous CNCF landscape of tools because the cloud native ecosystem is flourishing. Micro products are the way of the future. Companies need to be able to build their platforms using these micro components, these micro products. Before you know it, it gets to be a little bit too much to handle.
What we love about GitOps in general is that it's able to wrangle that all back in. So, what we're going to talk about today are the eight steps in sequence that you have to go through in order to establish a cloud native ecosystem from the ground up, from scratch, you're starting with nothing at all. The first thing that you're going to have to do is you're going to have to pick a Git provider. There are a handful of Git providers out there, the two big ones in the room are GitLab and GitHub. I highly recommend either of them. The right one for you depends a little bit on what it is that you do in my mind.
If you're doing anything open source, it's very valuable for you to at least have a GitHub presence in my mind. Now, that's not to say the GitLab, you absolutely can host your open source product in GitLab, people just won't find it easily. GitLab, however, has a niche in the market where self-hosting your Git provider is very core to their product. They've had for many years this Omnibus GitLab ecosystem that you can self-host, you can have your own Git provider, you can have your own runners, and you can run without any SaaS at all. So, if you're in any type of high privacy, any type of high compliance, air gapped type of environment, GitLab might be a really good choice for you if you want ultimate control of your Git provider and need the ability to cut off to the outside world. GitHub and GitLab both have all the same features as the opposite. If you're somewhere in between high compliance and open source, you could really just flip a coin and pick one and you'll have a pretty good experience.
The next thing that we're going to talk about is Infrastructure as Code. It's 2023, last I checked, and if you're running any infrastructure, any cloud resources with any complexity about it, you should consider buying into Infrastructure as Code. Infrastructure as Code is not just for infrastructure, despite its term. The IaC tools today are very well suited for configuration as code as well.
Terraform is an open source IaC product and it's basically the de facto standard, but there are some challengers in the space now. A couple technologies that we wanted to call out in this presentation is Atlantis. Atlantis is an open source tool you can self-host and you can leverage Atlantis as a mechanism to automate your Terraform. Terraform itself is a command line tool which makes it kind of not a great option for GitOps. GitOps has an ecosystem where you want to be constantly evaluating the desired state, comparing it against the actual state, and automatically applying to reconcile that delta, if there's a delta.
With Atlantis, you're able to sort of emulate that ecosystem. Let's go ahead and take a look at what that looks like. Here, I have this console that I'm going to keep coming back to. This is the Kubefirst console. You get it at the end of an installation. You basically just run a single command and you get this output at the end and a console window that lets you into all your new cloud native applications. It's all free and open source, it's incredible, I love Kubefirst.
But let's get our GitOps repository. When you install Kubefirst, you're automatically going to get a GitOps repository that gets hosted in your GitHub or your GitLab ecosystem, and it's pre-hydrated with all of these very popular open-source free tier tools that are all pre-configured to work with each other. It comes with a single sign-on and blah, blah, blah. You should look us up, it's really cool. This isn't a product pitch though.
Let's go to the Terraform directory. I opened a pull request before I showed up here so there's a pull request waiting that's going to show what it looks like to not really use Infrastructure as Code as infrastructure but use Infrastructure as Code as Configuration as Code. One of the directories that we have in our GitHub repository is a Terraform GitHub entry point for Terraform and it allows us to manage our GitHub repositories as Infrastructure as Code.
A lot of organizations start out the same way every organization starts out. Somebody has an ownership role in GitHub and they just start creating repos. Before you know it, some developer whispers that they should have some teams, so maybe some teams get clicked off together. Before you know it, you've got repos everywhere and no real good organization. Wrangling that back together with something like Terraform and establishing it in code makes a lot of sense.
To create a GitHub repository, I just added these six lines of code. I said, 'Hey, I want to have a new repository called Civo Demo Repo. I want it to auto-init with a README, don't archive it on destroy and add my Developer and Admin team to it, please.' When I open this pull request, something really cool happens in Atlantis where it automatically runs your Terraform plan. If we go down below this line here, this is what would happen if we were to apply the Terraform change. It'll show you, 'Hey, if we're going to apply this plan, it's going to create a GitHub repository and it's going to add your admins and developers to that GitHub organization.'
That plan looks great to me, so I'm going to type 'Atlantis apply' and hit go. Just like that, Atlantis is going to receive a web hook from GitHub and it's going to say, 'Hey, somebody with access to this repository says that this plan looks good, let's go ahead and execute this plan.' If the plan's successful, then it's automatically gonna apply the Terraform change, close your pull request, merge it in with the main branch and do all that stuff automatically.
As a result, you get this full audit log in your GitOps repository of every single change that's ever happened. If you can combine your Infrastructure as Code with your GitOps configurations, then you get the complete system of everything that's ever changed across your organization forevermore. You can really lock down who has access. All of a sudden, your SREs don't necessarily need access to the system because Atlantis is conducting that plan.
So, it's the Atlantis role that needs the permission in order to make those changes, it's all auditable. What happens when you're running Terraform as an engineer, a lot of people use it as a command line tool because it is. But that means when you go to run a Terraform apply, you get in these fights with Terraform because you have a name collision or because you don't have access to something. Nobody gets to see that that fight's going on, and you're making changes on the fly and there's no auditability, there's no visibility that that fight is happening. With something like Atlantis, it gives Terraform that extra superpower of being able to say, 'Hey, everything that's ever happened, there is a space for it.'
From an engineering standpoint, it's all predicated on pull requests in GitHub or GitLab or whatever your Git provider is. So, it allows you the explanation to your engineering team that, 'Hey, if you want to know what happened in the system, just go to the GitOps repository, look at the closed pull requests, and those are the ones, that's everything that happened in chronological order.' It's really easy to understand. A lot of people tend to lean on like cloud logs and stuff like that, and it's really hard to use and the engineers don't like it. Half of them don't have access. So, it's a much better paradigm with a tool like Atlantis.
So, I guess I should show you the proof there. So, I have this GitHub, YourCompany.io organization and as of just a moment ago, it now has a Civo demo repo. Very cool! So that's Terraform, that's Atlantis. Let's get back to this slide.
There are some other technologies that you should be considering as well. There's Pulumi. Pulumi has an interesting angle that they're taking on Infrastructure as Code, which is that Terraform requires you to learn this language called HCL, HashiCorp Language, which is a pretty simple language but it's not the language that your engineers are gonna know by default. So, if you're a really heavy Node.js shop, or a really heavy GoLang shop, Pulumi lets you write your IAC in those languages and that may be attractive to some folks.
I also want to call out Crossplane. Crossplane is an incredible promise for GitOps specifically. Crossplane works in a way where it is an Infrastructure as Code control plane, so it's always reconciling. Where Atlantis and Terraform, a moment ago I was showing you that basically, you have to do a proactive action in order to get it to measure what the change would be. You have to open that pull request. With Crossplane, that goes away and there's just this engine that's always watching your configurations that understands how to apply Infrastructure as Code the same way Terraform does. Crossplane has some neat integrations with Terraform available to it as well so if you're using a Terraform provider, you can shift that into a Crossplane provider. You can also wrap that Terraform in a Crossplane provider. I think Crossplane is going to be an up and comer. We're certainly going to be looking at adding it to the Kubefirst platform. But for today, I'd say Terraform is probably the safest bet that you can have if you're starting something new.
Next, you want to consider the GitOps driver. Now, GitOps is incredible. Like I was saying, you have all these different micro components that are going to blossom and double and triple and quadruple in your ecosystem. You start out with 10 microservices and before you know it, you have 100. You blink your eyes again, and there's a thousand of them.
Keeping asset management under control in a Kubernetes environment where you have multiple clusters, you have multiple environments, and each one of those clusters has hundreds of micro components, you have application developers that are constantly building these applications and delivering it to a development, staging, production environment. Sometimes those applications break and the pipelines don't work. GitOps says no to all of it.
If you take nothing away from this talk at all, take away that if you're starting Kubernetes, you are crazy to not buy into GitOps. It's a complex microservice ecosystem that you're buying into, and GitOps is able to get that back under control. So much so, that you would take a company that you were founding and throw everything that you did away one year into your adventure and start over on the GitOps discipline. It's a passionate discipline of mine with good purpose.
When you're evaluating a GitOps driver, you have some decisions to make. There are some different options. There's Flux CD, which is CNCF graduated. There's Argo CD, which is also CNCF graduated. Both are very, very good products. GitLab I understand is also getting into the game. They all have different strengths and weaknesses.
When it comes time to evaluate what you want out of your GitOps tool, it's important to maybe take a vendor-agnostic view. OpenGitOps is CNCF workgroup, that is working to establish a vendor-agnostic. OpenGitOps is a work, a CNCF workgroup, that is working to establish a vendor-agnostic definition of the GitOps discipline and how it works. Then, with those eyes, you can evaluate Argo CD, Plexity, GitLab, etc.
I'm going to show you, real quick, an example of how GitOps work. So here we have Argo CD. Argo CD is our favorite GitOps engine. Let me sign in here. With Argo CD, the way we have it organized is we like to have a registry for every cluster that we have in our ecosystem. Now, our registry, in technical terms, is just a folder in a Git repository. You could argue semantics and whether it has to be Git in GitOps or whatever, but practically speaking, you're talking about a folder in a Git repository.
Let's take a look at that registry folder here. So here we have a registry, it has all our applications in it. So if you want to look at the Datadog configurations for example, it says, well, this one actually points at components Datadog. So let's go there.
And there's a Datadog.yaml, and it says, 'Hey, let's install this version of Datadog, let's add the APM service, and let's create a secret so that the Datadog agent can have access to your Datadog API key and app key.' Pretty simple, it pulls that secret from Vault. And all of the applications work this way.
So if you want to change any configuration of any application in your ecosystem on a GitOps paradigm, it's just a matter of pull requesting a change to your GitOps repository. Then, within a three-minute sync period in Argo CD, it's going to automatically apply that to your infrastructure. There was a problem with it, it's just a matter of reverting that pull request, and you can generally speak and get back to the state that you were just in.
Now, if you have stateful applications, there's some devils in that detail. You want to make sure that you're not just blindly hoping that magic's gonna happen. There's no magic to be had. If you have a database with stateful data that's going to be destroyed, you know, you would want to make sure that you were attending to that requirement for your rollbacks. That's Argo CD, in a nutshell. And your GitOps driver is probably the most important decision that you're going to have when you're building up your architecture.
Secrets management, generally speaking, I think about secrets as like you have a password to connect an application to a database, you have to keep it somewhere safe. You can keep it in your cloud secrets, that's a fine answer. The problem is that there's a lot of different types of secrets, and they all have different workflows. Leveraging Git for the implementation of your secrets, I don't think is the right workflow because you have to be able to rotate your secrets without impacting your applications.
Vault does a really great job at that. We don't like to bind our applications directly to Vault. We use this tool called External Secrets Operator that serves as a shim layer to make sure that the secrets involved are being kept in sync with the secrets that you have in Kubernetes.
Vault, their open source, also has an OIDC provider, very important for a single sign-on. I'll give you a quick look at what that looks like. Let's hop in.
It sounds complicated. You probably heard about Vault for many many years, and there used to be a lot of horror stories about how hard it was to set up and configure. It's not the case anymore. Now that they've implemented a Helm chart, they've been very, very stable for the last four years or so.
You can keep some secrets in here. I'll show you, this is just a kind of a sample development secret, and you can keep a secret for each one of your technologies, our Datadog API keys, and whatever else. That's Vault, long and short. We use Vault as a SSO OIDC provider for all the tools on our platform. We think that's a really good approach. And that's all I have time to say about Vault.
Artifact Repository. This is going to fluctuate from shop to shop because every shop has different artifacts that it needs to produce. But in a Kubernetes environment, you're always going to at least need a container registry, somewhere, somehow. Typically, I like to recommend folks to use the cloud providers' container registry because they generally speaking have really high nines, high availability, and you need your containers to always be available. If they're not available, it's kind of crippling.
But in a SRE-oriented environment, which is the nature of this talk, you would probably want to consider leveraging GitHub or GitLab. They each have their own container registries built into them. Harbor is also a really great CNCF project if you want to self-host your container registry. There's no pretty pictures of demos to show you with registries. They're pretty boring.
Continuous Integration is the next step. You have to have it. If you pick GitHub, GitHub Actions is going to be really convenient for you. If you have GitLab, GitLab Runners are going to be really convenient for you. But we believe that having a Kubernetes native CI engine is very powerful in the way that you can define what secrets are being used, how they're being used, what has access to those secrets. From a rotation policy standpoint, it's far superior and you can leverage something like Argo Workflows, as an example, or Tekton is also a great one. Just have a shim layer in your Git provider that invokes those workflows and you get that same developer experience that you have in your pull requests. But you're using a Kubernetes native technology. We've found that to work really well.
Authentication and authorization: you have to have auth built into your IDP, your Identity Provider. You have to have it built into your Kubernetes engine, from an RBAC standpoint. You have to have it built into your cloud, from a role standpoint. And establishing all of that upfront and early is really important. To at least start somewhere. Start with just admins and devs. If that's as simple as you want to start out, at least start out that complex so that you're not handing out admin credentials to everyone and dismissing that part of the discipline. Because it's hard to wrangle that back under control once everybody's an admin in your organization.
The last thing that we'll call out is observability. Now, we really like Datadog as an observability layer. It provides a lot of stuff out of the box, especially in Kubernetes. You can install an agent and you just get a lot of really cool stuff. So, if I pull up this as an example, this is like the view. This took literally, I installed this entire environment, everything that I showed you, 11:40 last night. And all I did was install a Datadog agent, gave it a Datadog API key, and said go. That was it. And you get all of this observability. You get great dashboards for free. You can drop into your infrastructure host map and let me pull up the containers and break it down by namespace.
So here you can see, these are representations of pods that are running in my Kubernetes cluster and each one of these is a different component with an application running in it. Something that you're going to have to track logging on, you're going to have to set up monitors for, etc., etc. and it's really important that this is always up and operational for your organization. So, offloading that burden to a SAS provider like Datadog is a smart idea. From a disaster standpoint, if something's going really wrong, you can't have your observability layer going down also. Datadog also is rather fair with being able to enable you with configuration tools to keep the costs really under control.
So, now you don't need Datadog. You can go with the open source layer of InfluxDB. I think there's a talk right after this that you should stick around for. You could use the ELK Stack for logging, you could use Grafana for pretty graphs, you could use Jaeger for tracing. There are all these different tools that you could implement. The problem is that it's a lot of engineering care that you have to put into log rotation and making sure that everything's always up and highly available and operational and just not having that as a problem that you have to care about is very, very valuable to a lot of organizations.
So, that's all I'm going to say about Datadog. As a quick recap... Well, I don't need to recap. That was a quick enough session. Take a quick picture and scenarios to avoid: don't establish CI without a Secrets manager. People will put secrets in your CI and you'll never get them back. It's going to deeply ingrain in your CI ecosystem and when it comes time to rotate those secrets, you'll have a nightmare on your hands. Definitely do a Secrets manager first.
Don't start scripting, leverage GitOps. Don't run local Terraform, automated with something like Atlantis or buy into Crossplane. But definitely automate it, make it transparent and don't delay single sign-on. Single sign-on and roles is really important to your organization.
So, lastly, I wanted to mention, I do have a product. It's free, it's open source. We literally can't take your money. So, check us out. It's Kubefirst. All of the technologies that we showed you today, you can have in 30 minutes with a single command and that's our story. So, thank you for joining our talk today. We love GitHub stars because we don't receive any money from anybody and GitHub stars are incredibly valuable to us. So, github.com/kubefirst/kubefirst. My whole team is waiting to see if we get some stars after this. So, thank you guys so much for seeing my talk and I hope you enjoy the rest of the conference.
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.