Monitoring MLOps Workflows with Flyte
Speaker: Shivay Lamba
Summary
Learn how to monitor MLOps workflows effectively with Flyte-powered Grafana dashboards in this talk from Navigate NA 2023. Shivay Lamba discusses the importance of monitoring in the MLOps journey, highlighting the unique challenges in monitoring machine learning models compared to standard software development. Discover how Flyte, an open-source project, can help manage and monitor ML tasks efficiently, and see a live demonstration of setting up Grafana dashboards to visualize system metrics like CPU and GPU utilization. Take advantage of this opportunity to enhance your MLOps monitoring skills and optimize your machine learning workflows.
Transcription
Welcome everyone, I hope everyone is having a great second day. I think it's just been lunch, so I hope everyone is having a great time. Today, I'll be talking about MLOps and how monitoring is such an important aspect of MLOps. Now, you might have been to a few machine learning specific talks at Civo Navigate because there is a dedicated ML track. So my specific talk will be based on the monitoring side of MLOps.
One thing to keep into consideration is that when we talk about MLOps, there are multiple facets to this entire journey, and we'll be understanding why monitoring is such an integral part and then how you can use something like 'Flight', which is an open-source project, to help you with that journey for MLOps. We'll be covering all of that today.
In reality, I'm supposed to present this talk with my friend Aakash, but he couldn't make it because of visa issues, so I'll be presenting the talk alone. But yeah, a quick introduction about myself, I'm a Developer Relations Engineer at Milliseearch, which is an open-source trust-based search engine, and I'm also a Civo Ambassador. That's basically a Civo Developer Advocate and I pretty much like to work a lot with cloud-native, WebAssembly, and MLOps.
Moving forward, when we talk about monitoring in general, in terms of your standard software development, you usually use monitoring for things like being able to see your logs or being able to see the performance of your application at various times, and you might combine it with the use of Grafana to visualize this on a user dashboard. But monitoring in terms of machine learning does take a different picture.
Normally, whenever you're training your machine learning models to get them productionized, you work with a limited amount of dataset, you might not take into consideration all of the edge cases that might actually come up when you are working in a live production environment. So when it comes during the time when you actually make your machine learning model and you productionize your machine learning model, that's where some of the issues actually come up.
It could be the data distribution has been shifted, so there are a number of different techniques that there are, especially things like data drift, allocation drift that if you are into the ML ecosystem, you might be aware of these. And of course, the biggest one is that when you productionize your model, the type of data that it's being exposed to is a lot different than your data that you trained it with.
So there's a lot of hidden edge cases that might degrade the performance of your machine learning model that you might have not actually taken into consideration. So when you productionize your machine learning model, it's really important to continuously monitor the overall performance of the machine learning model on a number of different parameters
Your model can still make predictions, but you're not sure whether those predictions are actually accurate or not. In terms of your standard software development lifecycle, in case there are some errors inside of your application, your Prometheus logs could actually or your monitoring could actually pick that up. But with machine learning, it's not actually very easy to just say whether the performance of your model is working well or not because the monitoring will just give you the logs, but it's up to the machine learning scientist and data scientist to actually take a look at whether these logs actually make sense or not, so to evaluate the performance of the machine learning model in production. That is why monitoring is really essential in terms of ML and how is it different than your standard software development.
Now if you take a look at the standard machine learning lifecycle, again, you might have come across this multiple times during this entire series, but again I'll just like to iterate very quickly. So you start with the model collection of your data, then you'll start with training your data, you'll be cleaning up your data, and then once you have cleaned up your data, you have done some feature extraction, feature engineering, then the time comes for you to go ahead and train your model, and once you have done your training, you'll do evaluation, and finally, you will go ahead and deploy. Now, once you deploy, that's where the observability also comes in the picture because that allows you to ensure whether the performance of a machine learning model is good or not.
Probably, I just this result is probably move ahead. Now, they can actually monitor when it comes to your machine learning monitoring. So the first one is the system metrics. One key thing that you have to ensure is that whenever you're deploying your machine learning model and it's making the predictions, you need to ensure that the CPU utilization and the GPU utilization that you have for your machine learning model does not go off the charts. Because of course, machine learning can be very power intensive, so you want to ensure that whenever you are continuously monitoring the performance of your machine learning model, your CPU utilization does not go beyond what's expected.
So you need to continuously monitor your system metrics whenever you're dealing with a production machine learning. And then of course, in terms of resources, because your compute resources on which your machine learning models are running, you need to ensure that those are not going away because again you need to ensure that the machine models that you are running, they should be running efficiently as well.
But the other main thing that I would also like to highlight is the ML metrics and these are essentially our model metrics that how does your model perform over time and does the performance of your model actually degrade over time. So these are the three classical things that you have to basically take into consideration when you are monitoring any machine learning workload that you're running in production.
Now, of course, as with any standard software development, Prometheus and Grafana are what come to the rescue. These basically combine the combination of both Prometheus and Grafana provide a really amazing way to be able to not only look at the different logs but also keep a track of time series data because a lot of times with machine learning you might be working with a time series database.
So whether it's your time series database or you're just looking at the logs for your CPU utilization or your GPU utilization or even your resource metrics, you can basically use the combination of both Prometheus and Grafana to be able to achieve that, and that's what we'll also see in today's demo.
Now this is where I'd like to introduce Flyte formerly known as 'Flyte'. Flyte is a Kubernetes-native workflow automation platform for machine learning and even for data processes. So if I had to kind of break it down, so in a nutshell or in very layman's language, Flyte is a platform that not only allows you to be able to run machine learning workloads very efficiently but it also provides you a Platform as a Service for not just managing these workloads but also helps your different ML teams to not only work together but also on the ops side where you can actually go ahead and run CPU utilization. You can increase or decrease your recipient by running this under Kubernetes because it's built on top or natively built on top of Kubernetes. So you can manage both the ML and the ops side from Flyte itself.
Now, the major goal for Flyte, as I mentioned, is to provide that platform as a solution for everything related to machine learning. And of course, as you can see from this diagram, it allows you to provide the level of segregation between your ML teams, which are essentially your data scientists or you might be, let's say, working with different types of machine learning frameworks, and also for your ops, where you can very easily scale up your clusters in order to execute your machine learning task loads.
At the core of Flyte are primarily your workflows and tasks. So this is very similar to components and pipelines in Kubeflow. So if you are aware of what Kubeflow is, it's also an open-source platform that allows you to run machine learning workloads in production level with the help of Kubernetes. Tasks are your analogues to the components and they are the basic building blocks of your entire machine learning pipelines. And the task could be something related like one task could be related to, let's say, the training of the model, another task could be related to just doing the feature engineering. And then workflows are essentially the combination of multiple tasks that you run in sequence and they are kind of they internally use something like a DSLR or a dependency graph so using which you are able to connect your multiple tasks and run them one by one. And we'll be seeing that with the help of an example of how you can actually visualize these tasks. As you can see over here in the diagram where we start with an input task and then there will be some tasks in between and you'll get an output and all of this is encapsulated inside of a workflow.
And here is an example for how you can actually write your tasks and workflows. So here we have taken an example of two different tasks. And what you'll see is that one great advantage of Flyte over Kubeflow is that Flyte resembles Pythonic code a lot more as compared to Kubeflow. So if you are well-versed with decorators, you can see that the entire definition of how we have defined these tasks and then how we have encapsulated them inside the workflow is what you can actually do. And then you can do a lot of different things like being able to execute these tasks locally in your systems and also enable caching as well for better performance.
And of course, the main idea is that you can create a different type of jobs, you can create pipelines, so end-to-end machine learning pipelines directly inside of Flyte and then execute these jobs as based on your business requirement. Now, of course, when it comes to monitoring, right, because we are looking at Flyte. So there are out-of-box integration for Prometheus with Flyte. So a lot of the different things such as your task execution, your workflow execution comes out of the box directly with the help of Flyte. And Flyte basically exposes these metrics as Prometheus logs.
So let's say because if you're running your machine learning workloads, you might have to monitor the performance. So you might want to see how is the performance of your ML tasks as your workflows, how they are going on and you might want to also look at the CPU utilization. So you can very easily do that with the help of these Flyte-powered dashboards that we'll again see how very easily it is possible to monitor these with the help of Grafana.
And that is where we'll quickly take a look at in our demonstration. We will be creating an account on Grafana Cloud and we'll be monitoring, and we'll be configuring using Grafana Loki that allows you to keep a track of all of your logs. As the next step, we'll take a look at our demo. Again, it's a very short demo, but this should help you to understand what we are trying to do.
Right at the beginning, what I've done is that Flyte out of the box allows you to either run your machine learning workloads locally. And you can very easily just do 'Flyte sandbox', 'Flyte sandbox create' and that allows you to manage everything directly inside of your cluster. So you just need Docker and 'kubectl' installed and for you. So if I take a look at my dashboard over here, which is basically my command prompt.
What I'm about to do is that I'm about to actually run and, actually before I do this, I'll also quickly show what does my current 'kubectl' logs look like. So let me quickly go ahead and show that. So you can take a look at my current pods. And of course, what you'll also see is that you can install Flyte as well.
Now what I'll do is that I'll quickly show you an example file. So this is my workflow file that I'll be running. So if you're able to see, we have basically defined three different tasks. So the first task is where I actually get my data. So again, what we are trying to do is we are trying to replicate a very simple end-to-end machine learning project and break it down in terms of the tasks and the workflows that are again analogous to your pipelines inside of Kubeflow.
Over here, you can see that we have defined three different tasks. So the first task is just to get our data which is basically loading our wine dataset. Then we are basically processing our data to ensure that we are able to get our data as a three class dataset for a binary classification. And finally, we have also defined a task for actually training our machine learning model where we are going to be using logistic regression. And all of these different tasks are carefully integrated inside of our workflow. As I mentioned, your workflow encapsulates all the different tasks that you have.
Now what I'll do is that I'll run this task with the help of the 'PyFlyte' command. So this is essentially the Python SDK for Flyte. And again, another great advantage of using something like Flyte is that there's a very easy-to-use Python SDK. So you don't need to know what's happening behind the scenes or under the hood, especially from the Kubernetes side, because you can use the Python SDK directly.
So here what you are seeing is that I'm going to be using the 'PyFlyte run' command with that 'example.py' file and I'm specifically calling my training workflow which has been defined over here, as you can see. And I'll be just passing some hyperparameters for running this execution.
Now as soon as I run this, we'll just probably wait for a few seconds and we should see over here that our execution has become ready. And we can actually go to our Flyte dashboard. So over here, you can see that this is my latest workflow that I'm running. And if I go to my workflow, these are the three separate tasks that I had. So the three different tasks that you can see are the 'get data', 'process data', and the 'train model'. So as these tasks are running right now, they will be running one by one. And again, they are all interdependent on each other. So our task execution started with the fetching of our data and then we have defined separate tasks for processing the data and for the training of our model.
Now what I've done is that in order to generate these metrics on Grafana, you can very simply just go ahead and create an account on Grafana Cloud. And what we are using is Grafana Loki, which is basically an aggregation platform that allows you to manage all of your different logs for your machine learning workloads. So I have already done that and as soon as you create an account on Grafana Cloud, what you'll be able to see is, I have created this 'device17lama.grafana.net'. So if you also sign up for Grafana Cloud, you will be able to generate these dashboards on your own. And again, the way that we do it is that we have to sign up for the Prometheus Grafana operator.
And once you do that, you'll also see that if I take a look at my pods right now, there is one for Grafana agents. And this confirms that, yes, it's running fine for my system. And again, once you actually set up your Grafana dashboard or Loki, you'll have to set that up. So there are some basic instructions that you can do and set up locally for your Kubernetes cluster so that your Grafana can actually take a look at all the different resources that are running inside of your Kubernetes cluster.
So in this case, what I'll do is I'll just go back to my Flyte Cloud, and you can see that these are all the different workloads that are currently there. So the one that I am specifically looking forward to is the sandbox, the Flyte sandbox because that's what actually is running locally in my system. And you can see over here that, let me actually go back to the one where I'll be able to see the logs. So let me go over here, this one. Let it... Yeah, so these are all the different logs that you can see that as I'm creating these workflows and I'm running and executing these workflows. So all the logs for each of these workflows are coming over here inside of these latest logs.
Now another dashboard that I can see over here is to take a look at my CPU utilization and how the memory utilization is taking place for running these workloads. And in fact, what I can also do is that if you see over here inside the CPU quota, you can see the 'Flyte Sandbox Development'. So these are all the different pods that are basically getting created when I run my workloads. And you can monitor the health of your pods or basically each and every task as well. So basically, you can monitor how the tasks or even your workflows are executing with the help of Loki, and this gives you a very good understanding of how your machine learning workflows are actually running.
Now of course, as I mentioned, there are primarily three different types of metrics that you're primarily looking at. So the first one is your system metrics, your resource metrics, and also your model metrics. So all these three can be very easily managed and looked at with the help of the combination of Prometheus and Grafana.
So I would also like to just go ahead and share that how you can actually run Flyte on Sivo because the demonstration that I showcased was primarily for a Flyte cluster that is running locally, that is the Flyte Sandbox. So in order to generate that, you can just create a new cluster on Sivo and just ensure that you have Helm and Traefik installed. So in my case, I'm over here inside of my cluster that I already created yesterday. And if you see that inside of my installed applications, I do have Helm and I have Traefik installed. So Helm will be required to basically install the Helm jobs and charts for being able to run Flyte on top of your Sivo cluster.
Information for that you can find inside of this GitHub repository that I have linked. So if I go to this GitHub repository, you can see that there are a number of different charts that are there. These are Helm charts, and the ones that we are going to be installing are primarily the dependencies and the core. So once you install these on your Sivo cluster, then with just a basic configuration of your Traefik, it should be very easy to just set it up. And this is an example for my Sivo cluster running Flyte.
So if you're able to see the URL and I'll just go ahead and refresh and actually try to probably execute one task that I had already run. So I'll just go ahead and relaunch this particular task. So the idea is that now you could very easily also configure Grafana and you can also configure Prometheus logging for your Sivo cluster directly with the help of the installed applications. And you'll be able to manage this entire life cycle of being able to run, execute your machine learning workloads alongside tracking the metrics directly with the help of Sivo as well.
So yeah, I mean this is my managed Flyte running on my managed Sivo cluster. And it should work. I mean, it's just running so it's probably wait for a few seconds. Yeah, so as you can see that the tasks have already started to run again. These are not running locally. And I would have probably loved to show you this being run also locally but right now my 'kubeconfig' is configured with my local cluster. But of course, you can very easily configure your kubeconfig with your Sivo cluster as well.
But yeah, as you can see that my workloads run successfully. Now, the next step would be that you could very easily create an account on Grafana Cloud and then just connect your operator, your Prometheus and your Grafana operator, and then view all of your logs on the Grafana dashboard that I showcased in today's presentation.
But with that, I will probably conclude. And of course, before I stop, because we are nearing the end of this entire Sivo Navigate and since we had a number of different talks on MLOps, so I just wanted to share some best practices when it comes to MLOps. I mean, it's not directly related to the talk, but of course, since monitoring is such a huge aspect of your entire MLOps life cycle, these are some of the tips that I'll recommend to everyone. If you should follow them if you are into MLOps. But yeah, with that, I'll conclude and thanks for watching. And I'll be open to questions now. Thank you.
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.