Skip to content

kaasops/llm-engine

Repository files navigation

LLM Engine - Multi-Model GPU Serving

Run multiple LLM models concurrently on a single GPU using Ray Serve and KubeRay operator.

Quick Start

1. Install KubeRay Operator

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0

2. Deploy LLM Engine

kubectl apply -f ray-serve.yaml

3. Use the API

# Chat completion
curl -X POST "http://<service-ip>:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# List available models
curl "http://<service-ip>:8000/v1/models"

What It Does

  • Multiple Models: Run several LLMs on one GPU simultaneously
  • GPU Efficiency: Uses vLLM sleep mode to share GPU memory
  • OpenAI Compatible: Works with any OpenAI-compatible client
  • Auto-scaling: Automatically scales based on load

Important: Ensure you have enough free RAM to offload all LLMs when using sleep mode.

Configuration

Edit ray-serve.yaml to change models:

env_vars:
  MODELS: "model1,model2,model3"  # Your models here

Python Client

from openai import OpenAI

client = OpenAI(
    base_url="http://<service-ip>:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

Check Status

# See if it's running
kubectl get rayservice llm-engine

# View logs
kubectl logs <pod-name>

# Access dashboard
kubectl port-forward service/llm-engine-head-svc 8265:8265

Requirements

  • Kubernetes cluster with GPU
  • KubeRay operator
  • NVIDIA drivers
  • Hugging Face model access

Troubleshooting

  • Out of memory: Reduce number of models or use smaller models
  • Model not loading: Check model names and HF_TOKEN
  • Connection issues: Verify service IP and ports

Development

Local Setup

  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Running on Remote Ray Cluster

To run the LLM Engine on a remote Ray cluster for development:

  1. Create virtualenv

  2. Install requirements:

pip install -r requirements.txt
  1. Run serve
serve run --address ray://127.0.0.1:10001 --runtime-env-json='{"env_vars": {"VLLM_USE_V1": "1"}, "pip":["runai-model-streamer"], "working_dir": "./"}' engine:app

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published