meta-llama/Llama-3.1-8B-Instruct
model from Hugging Face.
Requirements
You’ll need:- A Runpod account with an API key.
- Python 3.8 or higher installed on your local machine.
pip
(orpip3
on macOS).- Basic utilities like
curl
.
Windows usersUse WSL (Windows Subsystem for Linux) or Git Bash to follow along with the Unix-like commands in this guide. Alternatively, use PowerShell or Command Prompt and adjust commands as needed.
Set up dstack
Install and configure the server
1
Prepare your workspace
Open a terminal and create a new directory:
2
Set up a Python virtual environment
- macOS
- Linux
- Windows
3
Install dstack
Install dstack using
pip
:- macOS
- Linux
- Windows
Configure dstack for Runpod
1
Create the global configuration file
Create a Replace
config.yml
file in the dstack configuration directory. This file stores your Runpod credentials for all dstack deployments.-
Create the configuration directory:
- macOS
- Linux
- Windows
-
Navigate to the configuration directory:
- macOS
- Linux
- Windows
config.yml
with the following content:YOUR_RUNPOD_API_KEY
with your actual Runpod API key.2
Start the dstack server
Start the dstack server:You’ll see output like this:
Save the
ADMIN-TOKEN
to access the dstack web UI.3
Access the dstack web UI
Open your browser and go to 
http://127.0.0.1:3000
. Enter the ADMIN-TOKEN
from the server output to access the web UI where you can monitor and manage deployments.
Deploy vLLM
Configure the deployment
1
Prepare for deployment
Open a new terminal and navigate to your tutorial directory:Activate the Python virtual environment:
- macOS
- Linux
- Windows
2
Create a directory for the task
Create a new directory for the deployment:
3
Create the dstack configuration file
Create a file named
.dstack.yml
with the following content:Replace
YOUR_HUGGING_FACE_HUB_TOKEN
with your Hugging Face access token. The model is gated and requires authentication to download.Initialize and deploy
1
Initialize dstack
In the directory with your
.dstack.yml
file, run:2
Apply the configuration
Deploy the task:You’ll see the deployment configuration and available instances. When prompted:Type
y
and press Enter.The ports
configuration forwards the deployed Pod’s port to localhost:8000
on your machine.3
Monitor the deployment
dstack will provision the Pod, download the Docker image, install packages, download the model, and start the vLLM server. You’ll see progress logs in the terminal.To view logs at any time, run:Wait until you see logs indicating the server is ready:
Test the deployment
The vLLM server is now accessible athttp://localhost:8000
.
Test it with curl
:
- macOS
- Linux
- Windows
Clean up
Stop the task when you’re done to avoid charges. PressCtrl + C
in the terminal where you ran dstack apply
. When prompted:
y
and press Enter.
The instance will terminate automatically. To ensure immediate termination, run:
Use volumes for persistent storage
Volumes let you store data between runs and cache models to reduce startup times.Create a volume
Create a file namedvolume.dstack.yml
:
The
region
ties your volume to a specific region, which also ties your Pod to that region.Use the volume in your task
Modify your.dstack.yml
file to include the volume:
/data
directory inside your container, letting you store models and data persistently. This is useful for large models that take time to download.
For more information, see the dstack blog on volumes.