Skip to main content
dstack is an open-source tool that automates Pod orchestration for AI and ML workloads. It lets you define your application and resource requirements in YAML files, then handles provisioning and managing cloud resources on Runpod so you can focus on your application instead of infrastructure. This guide shows you how to set up dstack with Runpod and deploy vLLM to serve the meta-llama/Llama-3.1-8B-Instruct model from Hugging Face.

Requirements

You’ll need: These instructions work on macOS, Linux, and Windows.
Windows usersUse WSL (Windows Subsystem for Linux) or Git Bash to follow along with the Unix-like commands in this guide. Alternatively, use PowerShell or Command Prompt and adjust commands as needed.

Set up dstack

Install and configure the server

1

Prepare your workspace

Open a terminal and create a new directory:
mkdir runpod-dstack-tutorial
cd runpod-dstack-tutorial
2

Set up a Python virtual environment

  • macOS
  • Linux
  • Windows
python3 -m venv .venv
source .venv/bin/activate
3

Install dstack

Install dstack using pip:
  • macOS
  • Linux
  • Windows
pip3 install -U "dstack[all]"

Configure dstack for Runpod

1

Create the global configuration file

Create a config.yml file in the dstack configuration directory. This file stores your Runpod credentials for all dstack deployments.
  • Create the configuration directory:
    • macOS
    • Linux
    • Windows
    mkdir -p ~/.dstack/server
    
  • Navigate to the configuration directory:
    • macOS
    • Linux
    • Windows
    cd ~/.dstack/server
    
Create a file named config.yml with the following content:
projects:
  - name: main
    backends:
      - type: runpod
        creds:
          type: api_key
          api_key: YOUR_RUNPOD_API_KEY
Replace YOUR_RUNPOD_API_KEY with your actual Runpod API key.
2

Start the dstack server

Start the dstack server:
dstack server
You’ll see output like this:
[INFO] Applying ~/.dstack/server/config.yml...
[INFO] The admin token is ADMIN-TOKEN
[INFO] The dstack server is running at http://127.0.0.1:3000
Save the ADMIN-TOKEN to access the dstack web UI.
3

Access the dstack web UI

Open your browser and go to http://127.0.0.1:3000. Enter the ADMIN-TOKEN from the server output to access the web UI where you can monitor and manage deployments.

Deploy vLLM

Configure the deployment

1

Prepare for deployment

Open a new terminal and navigate to your tutorial directory:
cd runpod-dstack-tutorial
Activate the Python virtual environment:
  • macOS
  • Linux
  • Windows
source .venv/bin/activate
2

Create a directory for the task

Create a new directory for the deployment:
mkdir task-vllm-llama
cd task-vllm-llama
3

Create the dstack configuration file

Create a file named .dstack.yml with the following content:
type: task
name: vllm-llama-3.1-8b-instruct
python: "3.10"
env:
  - HUGGING_FACE_HUB_TOKEN=YOUR_HUGGING_FACE_HUB_TOKEN
  - MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
  - MAX_MODEL_LEN=8192
commands:
  - pip install vllm
  - vllm serve $MODEL_NAME --port 8000 --max-model-len $MAX_MODEL_LEN
ports:
  - 8000
spot_policy: on-demand
resources:
  gpu:
    name: "RTX4090"
    memory: "24GB"
  cpu: 16..
Replace YOUR_HUGGING_FACE_HUB_TOKEN with your Hugging Face access token. The model is gated and requires authentication to download.

Initialize and deploy

1

Initialize dstack

In the directory with your .dstack.yml file, run:
dstack init
2

Apply the configuration

Deploy the task:
dstack apply
You’ll see the deployment configuration and available instances. When prompted:
Submit the run vllm-llama-3.1-8b-instruct? [y/n]:
Type y and press Enter.The ports configuration forwards the deployed Pod’s port to localhost:8000 on your machine.
3

Monitor the deployment

dstack will provision the Pod, download the Docker image, install packages, download the model, and start the vLLM server. You’ll see progress logs in the terminal.To view logs at any time, run:
dstack logs vllm-llama-3.1-8b-instruct
Wait until you see logs indicating the server is ready:
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Test the deployment

The vLLM server is now accessible at http://localhost:8000. Test it with curl:
  • macOS
  • Linux
  • Windows
curl -X POST http://localhost:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
          "model": "meta-llama/Llama-3.1-8B-Instruct",
          "messages": [
             {"role": "system", "content": "You are Poddy, a helpful assistant."},
             {"role": "user", "content": "What is your name?"}
          ],
          "temperature": 0,
          "max_tokens": 150
        }'
You’ll receive a JSON response:
{
  "id": "chat-f0566a5143244d34a0c64c968f03f80c",
  "object": "chat.completion",
  "created": 1727902323,
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "My name is Poddy, and I'm here to assist you with any questions or information you may need.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 49,
    "total_tokens": 199,
    "completion_tokens": 150
  },
  "prompt_logprobs": null
}

Clean up

Stop the task when you’re done to avoid charges. Press Ctrl + C in the terminal where you ran dstack apply. When prompted:
Stop the run vllm-llama-3.1-8b-instruct before detaching? [y/n]:
Type y and press Enter. The instance will terminate automatically. To ensure immediate termination, run:
dstack stop vllm-llama-3.1-8b-instruct
Verify termination in your Runpod dashboard or the dstack web UI.

Use volumes for persistent storage

Volumes let you store data between runs and cache models to reduce startup times.

Create a volume

Create a file named volume.dstack.yml:
type: volume
name: llama31-volume

backend: runpod
region: EUR-IS-1

# Required size
size: 100GB
The region ties your volume to a specific region, which also ties your Pod to that region.
Apply the volume configuration:
dstack apply -f volume.dstack.yml

Use the volume in your task

Modify your .dstack.yml file to include the volume:
volumes:
- name: llama31-volume
 path: /data
This mounts the volume to the /data directory inside your container, letting you store models and data persistently. This is useful for large models that take time to download. For more information, see the dstack blog on volumes.
I