Following our tutorial on CPU-focused serverless deployment of Llama 3.1 with Kubeflow on Kubernetes, we created this guide which takes a leap into high-performance computing using Civo’s best in class Nvidia GPUs.
We explore the deployment of Llama 3.1, a Large Language Model, using GPUs—a crucial tool for processing intensive machine learning tasks. We leverage the infrastructure automation capabilities of Terraform with Civo’s NVIDIA GPU-enabled Kubernetes clusters. This powerful combination of Terraform’s infrastructure management, open source machine learning and robust GPU resources facilitates the deployment of state-of-the-art language models, greatly enhancing speed and efficiency over traditional local deployments.
Alternatively, if you prefer a fully automated cloud-based approach that bypasses local setup, you can utilize GitHub Actions. This method automates the deployment process directly from your GitHub repository, allowing for seamless, script-driven deployment without any local execution. To explore this option, please proceed to this section.
With this guide, you can get your own private ChatGPT setup today.
What is Llama 3.1?
Image source: Meta Llama 3.1
Meta's Llama 3.1 is the pinnacle of open-source Large Language Models, representing a milestone in the evolution of natural language processing. It stands out with its ability to understand and generate human-like text, pushing the envelope further than Llama 2. Its architecture, trained on a staggering 15 trillion tokens, allows for nuanced understanding and sophisticated generation of language. Llama 3’s diverse applications stem from its ability to process and generate coherent and contextually relevant text, whether it be for composing complex reports, translating between multiple languages with high accuracy, or even writing code.
With two main variants, Llama 3.1 is versatile: the 8 Billion parameter version shines with its agility in providing quick responses, making it ideal for interactive applications, while the 70 Billion parameter heavyweight excels in deep analytics and generating detailed content, albeit with a higher computational cost. Such power and flexibility make Llama 3.1 a prime candidate for deployment on robust GPU infrastructures, where the model can operate to its fullest potential.
Why Deploy with GPUs?
When it comes to deploying large models like Llama 3.1, GPUs offer unparalleled computational power. This guide will show you how to leverage Civo's Kubernetes clusters equipped with GPUs to deploy Llama 3.1, resulting in faster inference times and the ability to handle more complex inference requests.
Example Use Cases Llama 3.1 Applications
Harnessing GPUs' computational strength can amplify Llama 3.1's capabilities across a variety of use cases:
- Customer Service Automation: GPU acceleration enables Llama 3.1 to power real-time, responsive chatbots and virtual assistants that can process customer queries more rapidly, thereby elevating the customer experience to new heights.
- Document Analysis: The intensive task of analyzing vast volumes of text is markedly expedited with GPUs, allowing Llama 3.1 to offer deeper insights and faster summaries, crucial for timely decision-making in business and research.
- Creative Writing Aids: When it comes to creativity, speed, and responsiveness fuel the fire. GPUs allow Llama 3.1 to offer instantaneous writing assistance, from ideation to draft completion, helping creators stay in their creative flow without interruption.
- Data Mining and Scraping: With the sheer force of GPU processing, Llama 3.1 becomes an even more potent tool for data mining, capable of scraping and processing large datasets from the web with unprecedented efficiency, paving the way for robust analytical applications.
In these scenarios, the GPU's prowess not only enhances the model's response times but also enables more complex and demanding tasks to be performed at scale, which is particularly beneficial for enterprise-level deployments of Llama 3.1.
Prerequisites
Deploying Llama 3.1 on Civo using Terraform
Deploying with Terraform, an open-source infrastructure as code software tool, allows you to automate the provisioning of resources required for your LLM deployment, including GPU-enabled Kubernetes clusters.
The technology stack we will be deploying today brings together the best of Open Source Technology:
Ollama: Ollama is a tool designed to streamline the deployment of open-source large language models by efficiently managing their complexities of their configuration. It packages model weights, configurations, and associated data into a single, manageable unit, significantly enhancing GPU utilization. This optimization is crucial for deployments on GPU-intensive infrastructures, as it ensures that the computational power of the GPUs is used effectively, reducing overhead and increasing performance.
Open-Web-UI: Open-Web-UI enables a ChatGPT-esc interface over the models that Ollama provides. It allows for easy access to complex administrative processes such as installing new models, RAG and secures access to your hosted models.
Now, we will walk you through the steps to deploy Llama 3.1 on Civo by running Terraform commands locally.
- Cloning the Required Repository
- Configuring Terraform Variables
- Initializing and Applying Terraform Plans
Start by cloning the repository that contains a default LLM deployment using Ollama, Nvidia Device Manager, and a GPU-enabled Kubernetes Cluster.
git clone https://github.com/civo-learn/civo-llm-boilerplate
Within this repository we've built a one-click deployment which enables best in class LLMs to be executed on the latest Open Source infrastructure.
To begin deploying the model, set your current working directory to the Terraform directory to begin configuring your deployment. You can do this by entering the following command in your terminal:
cd infra/tf
This step positions you in the correct directory where you'll find Terraform configuration files necessary for deploying the model on Civo's Kubernetes platform.
Once in the infra/tf
directory, you'll encounter several configuration files, including variables.tf
, which you can customize to suit your deployment needs.
You must update the terraform.tfvars file
with your Civo API key for authentication purposes. This is essential for allowing Terraform to manage resources, including enabling GPU support, on your behalf within the Civo cloud platform.
To do so, run the follow command to rename the terraform.tfvars.example
file to terraform.tfvars
:
mv terraform.tfvars.example terraform.tfvars
Once you’ve renamed terraform.tfvars.example
to terraform.tfvars you can update the contents to match:
civo_token = "YOUR CIVO API TOKEN"
Initialize your Terraform setup and apply your deployment plan, which Terraform will use to create a GPU-powered Kubernetes cluster on Civo.
First, initialize your Terraform setup:
terraform init
Once you have initialized your terraform repository you can plan your deployment using:
terraform plan
This command displays the deployment plan, showing what resources will be created or modified:
Finally, apply the Terraform deployment plan using:
terraform apply
This command applies the deployment plan. Terraform will prompt for confirmation before proceeding with the creation of resources.
Deployment takes around 10 minutes to stand up the Civo Kubernetes Cluster, assign a GPU node, deploy the helm charts and GPU configuration before downloading the models and running them on your Nvidia GPU.
Troubleshooting
If you experience any issues during the deployment (for example if you experience a timeout), you can reattempt the deployment by rerunning:
terraform apply
Deploy Llama 3.1 through GitHub Actions
For those who prefer a fully automated cloud-based approach, GitHub Actions offers a powerful solution. As a part of GitHub's CI/CD platform, Actions allow you to automate your software workflows, including deployments. This method simplifies the deployment process, ensuring that it is repeatable and error-free, which is particularly beneficial for managing and updating large-scale machine learning models like Llama 3.1 without manual intervention.
First, navigate to the repository: https://github.com/civo-learn/civo-llm-boilerplate and then use the template to create a new repository.
After doing so, go to the settings of your newly created repository and make sure GitHub Actions are allowed to run.
Make a new secret through the settings for the repository called CIVO_TOKEN
and set it to your Civo account token.
Now you could head to the actions tab and run the deployment.
Accessing and Managing Your Deployment
Once you have successfully deployed Llama 3.1 using either Terraform or GitHub Actions, the next step is to verify and utilize the deployment:
Checking the Load Balancers
After deployment, you can check the load balancers attached to your Kubernetes cluster to locate the Open Web UI endpoint. Navigate to the load balancer section in your Civo Dashboard and find the DNS name labeled “ollama-ui-open-webui.”
Completing the initial open-web-ui setup, which involves registering an initial administrator account and configuring the deployment options will grant you access to a “ChatGPT-like” interface, where you can interact with the deployed LLM directly.
From this window you can further configure your environment such as setting your security and access preferences and what access newly registered users can access. In addition you can make additional users administrators, in addition to the first registered account.
Deploying Additional Models
If you wish to expand your LLM capabilities, simply navigate to the settings menu found in the top right-hand corner of the Open Web UI screen. Select “models” from the left-hand menu to add or manage additional models. This feature allows for versatile deployment configurations and model management, ensuring that your setup can adapt to various requirements and tasks.
If you would like to change the default models deployed or disable GPU support, simply modify the ollama-values.yaml
file in the infra/tf/values
folder.
ollama:
gpu:
# -- Enable GPU integration
enabled: true
# -- Specify the number of GPU to 1
number: 1
# -- List of models to pull at container startup
models:
- llama3
- gemma
# - llava
# - mixtral
# Get more models from: https://ollama.com/library
persistentVolume:
enabled: true
size: 250Gi # file size of model repository
Summary
Deploying Llama 3.1 on Civo's Kubernetes clusters using GPUs and Terraform or GitHub Actions marks a breakthrough in handling intensive machine learning tasks efficiently. This guide has equipped you with the tools to automate and optimize large model deployments, enhancing performance across various applications. As you leverage these technologies, continue to explore and innovate, harnessing the full potential of Llama 3.1 to drive advancements in machine learning and artificial intelligence.
Additional Resources
For further reading and to enhance your understanding of deploying machine learning models efficiently and predicting future trends, explore these resources:
- Reducing the Cost of Machine Learning: Discover strategies for minimizing expenses while maximizing the efficiency of machine learning projects on the cloud.
- AI & Machine Learning 2024 Predictions: Dive into expert forecasts for the future of artificial intelligence and machine learning, preparing you for upcoming trends and technological shifts.