LLM Engine - Multi-Model GPU Serving

Run multiple LLM models concurrently on a single GPU using Ray Serve and KubeRay operator.

Quick Start

1. Install KubeRay Operator

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0

2. Deploy LLM Engine

kubectl apply -f ray-serve.yaml

3. Use the API

# Chat completion
curl -X POST "http://<service-ip>:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# List available models
curl "http://<service-ip>:8000/v1/models"

What It Does

Multiple Models: Run several LLMs on one GPU simultaneously
GPU Efficiency: Uses vLLM sleep mode to share GPU memory
OpenAI Compatible: Works with any OpenAI-compatible client
Auto-scaling: Automatically scales based on load

Important: Ensure you have enough free RAM to offload all LLMs when using sleep mode.

Configuration

Edit ray-serve.yaml to change models:

env_vars:
  MODELS: "model1,model2,model3"  # Your models here

Python Client

from openai import OpenAI

client = OpenAI(
    base_url="http://<service-ip>:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

Check Status

# See if it's running
kubectl get rayservice llm-engine

# View logs
kubectl logs <pod-name>

# Access dashboard
kubectl port-forward service/llm-engine-head-svc 8265:8265

Requirements

Kubernetes cluster with GPU
KubeRay operator
NVIDIA drivers
Hugging Face model access

Troubleshooting

Out of memory: Reduce number of models or use smaller models
Model not loading: Check model names and HF_TOKEN
Connection issues: Verify service IP and ports

Development

Local Setup

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Running on Remote Ray Cluster

To run the LLM Engine on a remote Ray cluster for development:

Create virtualenv
Install requirements:

pip install -r requirements.txt

Run serve

serve run --address ray://127.0.0.1:10001 --runtime-env-json='{"env_vars": {"VLLM_USE_V1": "1"}, "pip":["runai-model-streamer"], "working_dir": "./"}' engine:app

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
hack		hack
helm		helm
manifests		manifests
.gitignore		.gitignore
README.md		README.md
engine.py		engine.py
example_serve_app.py		example_serve_app.py
requirements.txt		requirements.txt
s3_model_loader.py		s3_model_loader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Engine - Multi-Model GPU Serving

Quick Start

1. Install KubeRay Operator

2. Deploy LLM Engine

3. Use the API

What It Does

Configuration

Python Client

Check Status

Requirements

Troubleshooting

Development

Local Setup

Running on Remote Ray Cluster

About

Uh oh!

Releases

Packages

Languages

kaasops/llm-engine

Folders and files

Latest commit

History

Repository files navigation

LLM Engine - Multi-Model GPU Serving

Quick Start

1. Install KubeRay Operator

2. Deploy LLM Engine

3. Use the API

What It Does

Configuration

Python Client

Check Status

Requirements

Troubleshooting

Development

Local Setup

Running on Remote Ray Cluster

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages