Serverless computing simplifies the deployment process by effectively managing and scaling resources on demand. In this tutorial, we will demonstrate how to deploy a Large Language Model (LLM) like Llama 3.1 on Kubeflow, utilizing Civo Kubernetes and CPUs with KServe in a serverless manner. Through this, we will focus on using Civo’s Kubeflow as a Service that provides a fully managed environment, making machine learning workflows simpler, more portable, and scalable.

What is Llama 3.1?

Llama 3.1 is the latest iteration of Meta’s open-source LLM, boasting significant improvements over its predecessor in language understanding and generation. With its variants, the 8 Billion and 70 Billion parameter models, Llama 3.1 is designed to excel in various applications, from complex problem-solving in scientific research to content generation and multilingual translation.

Why Kubeflow on Kubernetes?

Kubeflow is a popular open-source project that significantly enhances the capabilities of Kubernetes for machine learning projects. By using Civo’s fully managed Kubeflow as a Service, you gain access to a robust environment that integrates various components of machine learning workflows into a cohesive system, simplifying the process from model training to deployment. This not only reduces operational overhead but also allows you to focus more on model optimization and effective deployment.

This tutorial will guide you through deploying the popular open-source LLM, Llama 3.1, on a CPU-only instance provided by Civo’s Kubeflow as a Service, making it an ideal choice for efficiently managing and scaling machine learning models.

While this tutorial focuses on deploying Llama 3.1 in a production environment using a CPU-only example, the steps provided can easily be adapted to deploy other models as well. Although we've concentrated on CPU-based deployments here, it's important to note that these methods are also perfectly suitable for running models on GPU clusters, which can significantly enhance performance for more compute-intensive tasks. For an in-depth look at deploying Llama 3.1 with a GPU-focused approach, leveraging the power of Terraform on Civo, see our companion guide: Deploy your own Private ChatGPT with Llama 3.1 and Civo GPU Clusters.

Prerequisites

Before following along with this tutorial, you will need to have the following prerequisites in place:

Developing a container image to serve the model

To deploy our LLM in a serverless manner, we must first create a container image. This image will facilitate the model's inference operations and manage API requests. Once developed, we'll deploy this image on Civo's Kubeflow platform.

A pre-built container image is available for those who prefer a quicker setup. If you choose to use this, you can skip the building process and proceed to the next section, titled Serve the Model with Kubeflow.

To begin the setup process, we'll clone an example repository that contains the necessary inference code and the definition for our container image. This repository serves as a base for deploying the LLM.

Here’s how to clone the repository and navigate to the appropriate directory:

git clone https://github.com/civo-learn/kubeflow-examples
cd kubeflow-examples/llm-deployments/llama3/q4/

Currently, our primary focus is on the app.py and Dockerfile. The app.py contains the model serving code, and the Dockerfile provides the necessary configurations for building our container image. But before diving into these files, it's important to address the preparation needed to run our model efficiently on a CPU cluster, specifically concerning the model's weight quantization.

Model Optimization through Quantization

Quantization is a process that reduces the precision of the model's weights, which can significantly speed up inference times, especially on CPUs, where computational resources are more limited compared to GPUs. This process can make deploying large models more feasible by decreasing memory usage and computational demands. For more detailed information on quantization's effects on model performance and quality, consider reading the paper on GPTQ.

To implement quantization, we will follow the steps outlined in the llama.cpp repository to download the Llama 3.1 weights and quantize them using the GGML library, which is designed for efficient CPU execution.

Depending on your needs, you can choose 2, 4, or 8-bit quantization (noted as q2, q4, q8 in the repository). Keep in mind that lower bits of quantization increase speed but may reduce the quality of the model's outputs. You could most certainly follow the steps in GGML to have different-sized weights.

After quantizing the weights, you should change directory (cd) into the corresponding folder (q2, q4, or q8) to proceed with setting up the deployment:

cd {chosen quantization directory}

In app.py, we utilize the llama-cpp-python library, a llama.cpp wrapper, to serve our model. The below Python file uses the Flask framework to establish a web server that listens at the endpoint /v1/models/serving:predict. Here's a breakdown of how this is set up:

  • Flask routes API calls to this endpoint.
  • A POST request to this endpoint triggers the model to run and return predictions.

The Dockerfile plays a crucial role in setting up our environment:

  • It initializes the Flask service environment.
  • Downloads the quantized model weights.
  • Configures the service to start serving predictions at the specified endpoint.

You can build and push the Docker container using the following commands, replacing {username} with your Docker Hub username and {quantization level} with q2, q4, or q8 based on your chosen quantization:

docker build -t {username}/llama3-kf-civo-{quantization level}:latest .
docker push {username}/llama3-kf-civo-{quantization level}:latest

Alternatively, if you prefer not to build the image yourself, you can use a pre-built image from a container registry:

docker pull ghcr.io/civo-learn/llama3-flask-kf-{quantization level}:0.1.0

Deploy your KServe configuration using the Kubeflow dashboard

Begin by setting up your infrastructure with a new Civo Kubeflow as a Service cluster, following the instructions provided on how to create a Kubeflow cluster.

Note: For deploying models with medium complexity (like the q4 model), select at least a medium-sized cluster. For more complex models (such as q8), opt for at least a large cluster to ensure adequate resources.

Why use KServe?

KServe is a serverless framework used within Kubeflow to deploy and serve machine learning models efficiently. It simplifies model management in Kubernetes environments, making it easier to roll out and manage machine learning models at scale. For more comprehensive details on KServe, check out the KServe documentation.

Serverless LLM Deployment Flow with KServe on Kubernetes

Deploying model with KServe

To deploy your serving container image using KServe, begin with the kserve.yaml file. This YAML file contains the necessary configuration to deploy your serving container.

Update the image field in the YAML file to the path of the container image you have just created. You may also adjust the resource allocations for CPU and RAM according to the model's demands, but ensure these remain within the limits of your cluster's capacity. You could do this particularly by updating the cpu and memory fields in the kserve.yaml.

Here’s how you can deploy your KServe configuration using the Kubeflow dashboard:

  1. Navigate to the Kubeflow Dashboard. You can access it by following this link to the Kubeflow Dashboard overview.
  2. Locate and click on the "endpoint" section within the dashboard.
  3. Copy and paste your kserve.yaml content into the provided interface to deploy your model.

After deployment, the KServe dashboard will update the status of your endpoint to ‘Ready’, indicating it is prepared to receive traffic and perform inference tasks. You are now set to begin model inference on your newly deployed serverless architecture.

KServe dashboard will update the status of your endpoint

We're now ready to begin inference with our deployed Llama 3.1 model.

Inference and Managing Deployments

By Creating a Kubeflow Notebook

To start, create a new Kubeflow Notebook and run the inference.ipynb notebook within this instance. This setup allows you to manage and execute your model inference tasks seamlessly.

Through a terminal

Alternatively, you can execute the inference process directly from a terminal within a Kubeflow notebook environment. To do this, install the necessary libraries by running pip install -r requirements.txt and run the inference script with a sample prompt.

Here are the commands to get started:

python infer.py --prompt "I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples do I have left? Let's think step by step."
Note: We can also serve Llama 2 in a similar manner, start by cd into the llama2 directory and found the quantized version you would want to work with and you should find the kserve.yaml files for each of the quantized variants which you can use in Kubeflow. We can run other kinds of LLM models with the same structure.

Troubleshooting Common Issues

Both above methods will initially output the KServe inference URL, which typically follows the format: {servicename}-{namespace}.svc.cluster.local/{endpoint}. This URL is essential for sending requests to your deployed model.

For instance, your inference script might display a reasoning process like the one below:

python inference.py --prompt "I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples do I have left? Let's think step by step."
Step 1: Start with 10 apples.
Step 2: After giving 2 apples each to the neighbor and the repairman, you have 10 - 2 - 2 = 6 apples left.
Step 3: Buying 5 more apples brings your total to 6 + 5 = 11 apples.
Step 4: Eating 1 apple leaves you with 11 - 1 = 10 apples.

If you've followed the setup instructions without deviation, your endpoint URL should match the one mentioned in the guide.

However, if you customized your setup or named your resources differently, your endpoint URL would reflect those changes. It's important to use the correct URL when sending requests to your model. Here is how you can send a POST request to your model using the Python requests library:

import requests

response = requests.post(
    "http://your-custom-service-name.your-namespace.svc.cluster.local/v1/models/serving:predict",
    json={"prompt": "I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples do I have left? Let's think step by step.", "stream": "True"},
)

Replace your-custom-service-name.your-namespace.svc.cluster.local with your actual KServe inference URL to ensure successful model communication.

Deleting a Kubeflow cluster

As long as the Kubeflow Cluster is running, it will be billed, so you should tear it down and delete the cluster following the instructions here to avoid being charged when you are not using the cluster.

Summary

In this tutorial, we explored the deployment of a production-ready Large Language Model (LLM), specifically Llama 3.1, using Civo’s Kubeflow as a Service on a CPU. This service harnesses the power of Kubernetes and KServe to offer a serverless framework that simplifies the management and scaling of machine learning models. This approach not only simplifies the deployment process but also enhances the manageability and scalability of machine learning workflows in production environments.

Additional Resources

  • Deploy your own Private ChatGPT with Llama 3.1 and Civo GPU Clusters: Explore how to supercharge your Llama 3.1 deployments using GPUs and Terraform on Civo with this tutorial.
  • Find ready-to-use configurations and deployment insights for Llama 3.1 on our GitHub repository here.
  • Hardware choice guide: Whether to use CPUs or GPUs affects both speed and cost. For advice on selecting the appropriate hardware, visit our CPU vs GPU guide.
  • Django and Kubeflow: Build interactive applications using Django with Kubeflow. Learn how by accessing this guide.