Building a Secure By-Design Pipeline with an Open Source Stack
Speaker: Rotem Refael
Summary
Rotem Refael, Director of Engineering at Armo, delves deep into building a Secure by Design Pipeline using an open-source stack. Emphasizing the importance of the "Shift Left" strategy in the DevSecOps realm, she addresses the gaps in the traditional CI/CD pipeline and suggests comprehensive tools to address them. Rotem showcases the benefits of Kubescape, an open-source solution by Armo for Kubernetes security, and highlights the need for security gates throughout the development process.
Transcription
Hi everyone, Today we're going to talk about how to build a Secure by Design Pipeline with an open-source stack. My name is Rotem Refael, and I've been a developer for the past 15 years. I've also been in the DevOps industry for the past five to six years, practice yoga, and love basketball. Currently, I'm working at Armo as Director of Engineering. Armo creates Kubescape, an open-source solution for security in Kubernetes, checking for misconfigurations and vulnerabilities. We have around 8K stars on GitHub, which is exciting. We got accepted to CNCF at sandboxing recently and hope for incubation this year. We focus on Kubernetes security, specifically misconfigurations and vulnerabilities. We're based in Israel with offices in Tel Aviv and Jerusalem.
Let's get to business. As we are familiar with the DevOps industry, it's rapidly developing. Developers embrace DevOps and the CI/CD pipeline because it aids in quick development. However, security often gets left behind. A world without CI/CD is recognizable by many of us, and the real goal of the CI/CD pipeline is to reduce the risk in software development. We need to plan for potential issues.
Let's examine a typical CI/CD pipeline. I'm developing my code using tools like Terraform, IAC, YAMLs, Helm charts, and Visual Studio Code, Go. I then PR into GitHub, initiating the CI/CD service like GitHub Actions, Jenkins, CircleCI, and more. After that, I store the artifact in the container registry and deploy it using tools like Ansible and Terraform. This cycle involves continuous monitoring.
Developers are quickly creating and deploying applications, indicating a shift right in the development paradigm.
Looking at this pipeline, the process involves integrating into the repo, running GitHub actions, storing in the container registry, deploying across clusters with Argo CD, and monitoring using Grafana and Prometheus. This cycle repeats.
What can possibly go wrong? We have code misconfigurations, dependency management, and third-party projects. Even with tools like Prometheus and Grafana, which everyone uses, we don't investigate their content, their CVEs, or misconfigurations, especially when installing health metrics in Grafana. There's code verification; developers often see warnings but don't always pay attention to them. Production misconfigurations, RBAC violations - when I mention "RBAC" to many DevOps and developers, they often ask, "RBAC? Do you mean Kubernetes RBAC?" We also have packet CVEs and zero days, which we can't always immediately address, but they're present.
The solution? Shifting left. It's a term that's gaining traction. Shifting left means the DevOps team ensures application security early in the development life cycle. This is integral to the organizational approach known as DevSecOps, which emphasizes security. Developers may be rapidly deploying and shifting right, but security needs to be forefront, shifting left. It's crucial to identify and address security issues early.
Most know the concept: pinpointing bugs and issues early in the design and requirement phase of the SDLC is cheaper and less impactful than addressing them in production. The same applies to security. Identifying security concerns on the left side of the process reduces costs and impacts down the line. Hence, the emphasis on shifting left.
Now, looking at this from a practical standpoint, let's begin from the right side and explore securing our pipeline using open-source tools, spanning from production to code.
Let's start. In the cluster deployment phase, we have vulnerability scanners and misconfiguration scanners. For vulnerability scanners, we have Clair and Aquatrivy. For misconfiguration and in-cluster scanner, we have Kube-bench among others. We also have kubescape, which scans for vulnerabilities and misconfigurations, as well as compliance frameworks.
With Encore, we can see which vulnerabilities exist. In about two months, we will introduce a component that doesn't just scan for vulnerabilities but will also inform users which CVEs can harm their cluster in real-time. For instance, when scanning Redis, while there might be 100 vulnerabilities, only two are critical.
For misconfigurations, we use OPA and regular libraries to define controls and tests about Kubernetes misconfigurations.
Let's move forward to the container registry and its scanning process. Similar to a vulnerability scanner, we also scan the container registry. Our goal is to identify images with CVEs, critical, high, and have remote code execution (RCE) capabilities, among others. It's essential to understand the security of our container registry before deploying these images into our clusters. Tools like Grype, Aqua Trivy, and Kubescape perform these tasks. While many platforms offer these services, my focus is on open-source solutions and those I've personally tested and verified.
Moving on to the CI/CD server, I'm referring to platforms like Jenkins, Circle CI, and GitHub Actions that manage the entirety of the development pipeline. The main players in this arena are KubeScape, OWASP, and WhiteSource. I believe WhiteSource might've changed their name recently, but I'm not entirely certain. The purpose here is to scan the Infrastructure-as-Code (IAC) repository for any misconfigurations in the pipeline. For instance, when I draft my Helm chart or YAML file, after or during my pull request (PR), I run tests to identify any potential misconfigurations.
For example, with GitHub Actions using Kubescape, you simply add another GitHub Action, initiate the workflow, and monitor the results. Here's a practical example: in one of my test repositories, I can see the number of tests that failed or passed. This information is broken down further by severity levels. There's also an option to set thresholds - say if a single resource fails, then the entire pipeline shouldn't proceed.
Let's delve into the code. When I'm drafting a YAML file or a Helm chart, I want to validate it before initiating my pull request (PR). In response to this, we developed a Visual Studio code plugin. You can directly run this plugin within Visual Studio on your YAML or Helm charts to instantly identify which tests didn't pass. For instance, if I missed specifying resource or request limits in my YAML, it would be immediately evident.
In the realm of scanners, I categorize them into Infrastructure-as-Code and application scanners. Kubescape is predominantly on the IaC front, whereas tools like SonarQube, Checkmarx, and more cater to the application side of things. Other notable security plugins for source code include WhiteSource and Snyk, serving both IaC and application code domains.
This is an example of Cubescape, consider running Kubescape on your code. If you forget to specify CPU limits, an alert will pop up right there.
Monitoring is an ongoing task. Tools like Prometheus and Grafana serve as monitoring solutions. We, alongside many others, often develop exporters. In Kubescape, we created a Prometheus exporter. This integrates seamlessly with Grafana dashboards, showing data points like the number of resource failures and vulnerabilities. For instance, if I launch a new application into my production cluster, these tools allow me to assess any increased risks using the graphs.
So all that brings me to Waptrick calling and everyone is calling security gates in the whole pipeline right. I want to have the security gate at each step that I just mentioned. I want it prior to the code, within the GitHub actions, and to have the ability to block someone from pushing code that doesn't meet the security tests. I want these gates in the container registry and in my cluster. Of course, I frequently set up cron jobs that constantly check the environment, allowing me to know the daily status of my setups.
Hope I didn't alarm you too much. While there are numerous tools available, being from Ammo and Kubescape, that's where my emphasis lies. We pride ourselves in providing a solution that spans the entire development pipeline, which not all platforms offer. I'm grateful for your time, and invite you to our booth where my colleagues Eric and Amalia can assist further. If you have questions or want a demo, now's the time.
Thank you.
I'm not a developer, I'm more on the cloud architect side. My question is around your product and its capabilities for Kubernetes. Does it change for managed instances, serverless fargate, or does it go into AWS ECS?
That's a great question. Kubescape works regardless of whether you're running EKS, AKS, vanilla, or OpenShift. We connect to the Kube API to get the data. We have plugins for Jenkins and GitHub actions. You can set thresholds, for example, if you have more than three criticals, stop the pipeline and don't merge the PR.
Thanks, Rotten, for that insightful talk. I've been pondering, how far do we want to shift left? With the rising emphasis on 'shift left' methodologies, I feel like there's an increased cognitive load for those on the left side. They're burdened not only with delivering business logic to keep stakeholders happy, but now there's an added pressure to 'shift left' for security, auditing, and more. Some of my developer colleagues voice their frustrations: "Now I have to run this on my Kubernetes manifest, I need to learn this new system, I'm getting feedback on resource settings, etc." So, what's your take on this?
My perception of 'shifting left' is similar to the principle of unit testing. You wouldn't write code without having unit tests in place for that code. In the same vein, when you're authoring code, be it a YAML file or a Helm chart, if you're not mindful of the security aspects and considerations, it's akin to neglecting unit tests. It's undoubtedly challenging to centralize everything on the left side, but that's why the approach is about incorporating these checks at every step along the way, including within the cluster. Suppose you craft some code and an issue manifests in a live production environment. You identify it in production since you have monitoring tools in place. But, realistically, nobody wants to uncover these issues in a production setting, especially not at midnight. It's far more efficient and less stressful to spot them earlier in the process. I believe it's an integral responsibility of both developers and DevOps. But naturally, things can fall through the cracks, so it's vital to have mechanisms that detect these gaps throughout the pipeline. That's my viewpoint on the matter. How about others?
Between your runtime scans and build scans on container registries, there must be overlap. How do you handle duplication? We have a platform named "Armor Platform." If you're interested, I can demonstrate it briefly. Right here, you can observe that the initial four stages are reflective of what you witness in a live production environment. The subsequent two stages represent what you'd see in your CI/CD pipelines, like the repository scan and the registry. Everything is seamlessly integrated. For instance, after scanning my cluster, I might discover a vulnerable image. To preemptively address this, I'd like to be informed about such vulnerabilities even before they reach the production stage. Consequently, I simply integrate this image into the registry scanning process. This way, there's no overlap; it's a continuous cycle.
Regarding the CVEs: I understand this platform scans during the pipeline progression. But what about CVEs that are identified post-deployment, especially once they've been operational? Does this tool have the capability to intermittently scan the images to pinpoint new vulnerabilities? Let's consider a scenario where a zero-day exploit is identified on the 10th day post-deployment. Can this platform detect that?
Absolutely, I can provide a relevant example. About six months ago, there was this "Log4j" CVE that garnered widespread attention and concern. We promptly identified it by simply scanning our system. With the kubescape Helm chart, there's a built-in cron job feature that executes periodic scans—daily or at any predefined interval. This ensures you're constantly updated about potential vulnerabilities. Moreover, there's a provision to set a specific parameter. With this, each time a new image is detected, an immediate scan is triggered. This proactive approach is, in my opinion, what you're seeking.
Yeah, so you can, let me navigate to this section right here. I'm generally not a fan of on-the-spot demonstrations, but I'll make an exception this time. Essentially, what you can do is export the information. Say you conduct a scan; you can then take those results, export them, and disseminate them to the necessary parties. The process is pretty intuitive. Here, you have the option to implement filters based on your requirements and then forward the exported data to the respective individuals.
Now, because I'm using MiniKube for this example, I only have a singular scan to showcase. But this interface offers a graphical representation of all your image scans, detailing when each scan was conducted and its severity. And if you were to delve deeper into this, specifically under the workloads, you'll gain a more granular perspective.
Do you have any white papers on this for MLOps?
The principle remains consistent across different operational sectors. Be it ML Ops, DevOps, or any other segment, the focus is on identifying vulnerabilities in the complete range of images. The tools or software might vary, like specific databases or stateful sets, but our scanning capability encompasses all. So, irrespective of the ops discipline, once someone registers with their user credentials, their clusters get scanned, and they can proceed with the next steps.
Your mention of the RBAC visualizer resonates with me. Although I may not be entirely impartial, I do concur that it's among the most valuable tools we offer. The visual representation it provides is insightful, allowing users to instantly discern high-privilege entities, spot misconfigurations, or even identify service accounts with no associated workloads. Plus, the option to craft custom queries amplifies its utility. Let's say you're keen on identifying entities with deletion privileges on deployments; this tool can visualize that for you. Seeing all of this represented graphically truly facilitates a deeper investigation.
Thanks, everyone. It was fun.
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.