Serving Models¶

Deploy models with HTTP APIs for production use.

🎯 What You'll Learn¶

✅ Starting HTTP server
✅ Making API requests
✅ API endpoints
✅ Production deployment
✅ Scaling strategies

🚀 Quick Start¶

Start HTTP server:

gpux serve model-name --port 8080

Output:

INFO: Started server on http://0.0.0.0:8080
INFO: Using provider: CoreMLExecutionProvider
INFO: Model loaded: model-name v1.0.0

📡 API Endpoints¶

Health Check¶

curl http://localhost:8080/health

Response:

{
  "status": "healthy"
}

Model Info¶

curl http://localhost:8080/info

Response:

{
  "name": "model-name",
  "version": "1.0.0",
  "provider": "CoreMLExecutionProvider"
}

Prediction¶

curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "I love this!"}'

Response:

{
  "sentiment": [0.1, 0.9]
}

🔧 Server Configuration¶

Configure in gpux.yml:

serving:
  port: 8080
  host: 0.0.0.0
  batch_size: 1
  timeout: 5
  max_workers: 4

Command-Line Options¶

# Custom port
gpux serve model --port 9000

# Bind to localhost only
gpux serve model --host 127.0.0.1

# Multiple workers
gpux serve model --workers 4

📊 OpenAPI Documentation¶

Automatic API documentation:

Swagger UI: http://localhost:8080/docs
ReDoc: http://localhost:8080/redoc
OpenAPI spec: http://localhost:8080/openapi.json

🐍 Python Client¶

Make requests programmatically:

import requests

url = "http://localhost:8080/predict"
data = {"text": "This is great!"}

response = requests.post(url, json=data)
result = response.json()
print(result)

🚀 Production Deployment¶

Docker¶

Create Dockerfile:

FROM python:3.11-slim

WORKDIR /app

# Install GPUX
RUN pip install gpux

# Copy model and config
COPY model.onnx .
COPY gpux.yml .

# Expose port
EXPOSE 8080

# Start server
CMD ["gpux", "serve", "model-name", "--port", "8080"]

Build and run:

docker build -t my-model .
docker run -p 8080:8080 my-model

Reverse Proxy (nginx)¶

server {
    listen 80;
    server_name api.example.com;

    location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Load Balancing¶

Use multiple workers:

gpux serve model --workers 4

Or use external load balancer (nginx, HAProxy).

📈 Monitoring¶

Metrics¶

Track performance: - Request latency - Throughput (requests/sec) - Error rates - Memory usage

Logging¶

Enable verbose logging:

gpux serve model --verbose

💡 Key Takeaways¶

What You Learned

✅ Starting HTTP server ✅ API endpoints and usage ✅ Configuration options ✅ Production deployment with Docker ✅ Scaling and monitoring

🎉 Congratulations!¶

You've completed the GPUX tutorial! You now know how to:

✅ Install and configure GPUX
✅ Build and run models
✅ Optimize performance
✅ Deploy to production

Next steps: - User Guide - Deep dive into concepts - Examples - Real-world use cases - Deployment - Production guides

Previous: Benchmarking