Skip to content

gpux serve

Start HTTP server for model serving from registries or local projects.


Overview

The gpux serve command starts a FastAPI server that provides REST API endpoints for model inference. It supports both registry models (pulled from Hugging Face) and local models with gpux.yml configuration.

gpux serve MODEL_NAME [OPTIONS]

Arguments

MODEL_NAME (required)

Name of the model to serve. Can be:

  • Registry model: distilbert-base-uncased-finetuned-sst-2-english
  • Local model: sentiment-analysis (requires gpux.yml)
  • Model path: ./models/bert or /path/to/model

Examples:

# Registry models
gpux serve distilbert-base-uncased-finetuned-sst-2-english
gpux serve facebook/opt-125m
gpux serve sentence-transformers/all-MiniLM-L6-v2

# Local models
gpux serve sentiment-analysis
gpux serve image-classifier
gpux serve ./models/bert


Options

Server Options

--port, -p

Port to serve on.

  • Type: integer
  • Default: 8080
gpux serve sentiment --port 9000
gpux serve sentiment -p 3000

--host, -h

Host to bind to.

  • Type: string
  • Default: 0.0.0.0
gpux serve sentiment --host 127.0.0.1
gpux serve sentiment -h localhost

--workers

Number of worker processes.

  • Type: integer
  • Default: 1
gpux serve sentiment --workers 4

Configuration Options

--config, -c

Configuration file name.

  • Type: string
  • Default: gpux.yml
gpux serve sentiment --config custom.yml

--provider

Preferred execution provider.

  • Type: string
  • Choices: cuda, coreml, rocm, directml, openvino, tensorrt, cpu
gpux serve sentiment --provider cuda

Other Options

--verbose

Enable verbose output.

  • Type: boolean
  • Default: false
gpux serve sentiment --verbose

API Endpoints

The server exposes the following REST API endpoints:

POST /predict

Run inference on input data.

Request:

curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "I love this product!"}'

Response:

{
  "sentiment": [0.1, 0.9]
}

GET /health

Health check endpoint.

Request:

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "model": "sentiment-analysis"
}

GET /info

Get model information.

Request:

curl http://localhost:8080/info

Response:

{
  "name": "sentiment-analysis",
  "version": "1.0.0",
  "format": "onnx",
  "inputs": [
    {
      "name": "text",
      "type": "string",
      "required": true
    }
  ],
  "outputs": [
    {
      "name": "sentiment",
      "type": "float32",
      "shape": [2]
    }
  ]
}

GET /metrics

Get performance metrics and provider information.

Request:

curl http://localhost:8080/metrics

Response:

{
  "provider": {
    "name": "CUDAExecutionProvider",
    "available": true,
    "platform": "NVIDIA CUDA"
  },
  "available_providers": [
    "CUDAExecutionProvider",
    "CPUExecutionProvider"
  ]
}


Examples

Basic Server

Start server on default port (8080):

gpux serve sentiment-analysis

Output:

Model Information
┌──────────┬────────────────────┐
│ Property │ Value              │
├──────────┼────────────────────┤
│ Name     │ sentiment-analysis │
│ Version  │ 1.0.0              │
│ Inputs   │ 1                  │
│ Outputs  │ 1                  │
└──────────┴────────────────────┘

Server Configuration
┌──────────┬──────────────────────────┐
│ Property │ Value                    │
├──────────┼──────────────────────────┤
│ Host     │ 0.0.0.0                  │
│ Port     │ 8080                     │
│ Workers  │ 1                        │
│ URL      │ http://0.0.0.0:8080      │
└──────────┴──────────────────────────┘

API Endpoints
┌────────┬───────────┬─────────────────────┐
│ Method │ Path      │ Description         │
├────────┼───────────┼─────────────────────┤
│ POST   │ /predict  │ Run inference       │
│ GET    │ /health   │ Health check        │
│ GET    │ /info     │ Model information   │
│ GET    │ /metrics  │ Performance metrics │
└────────┴───────────┴─────────────────────┘

🚀 Starting GPUX server...
Server will be available at: http://0.0.0.0:8080
Press Ctrl+C to stop the server

Custom Port

Serve on a custom port:

gpux serve sentiment --port 9000

Test:

curl -X POST http://localhost:9000/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "Great product!"}'

Localhost Only

Serve on localhost only (not accessible externally):

gpux serve sentiment --host 127.0.0.1 --port 8080

Multiple Workers

Use multiple workers for better throughput:

gpux serve sentiment --workers 4

GPU Memory with Multiple Workers

Each worker loads the model into GPU memory. Ensure you have enough GPU memory: - 1 worker: ~256 MB - 4 workers: ~1 GB - 8 workers: ~2 GB

With Specific Provider

Serve with CUDA provider:

gpux serve sentiment --provider cuda --port 8080

Making Requests

Using cURL

Single Inference:

curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "I love GPUX!"}'

Health Check:

curl http://localhost:8080/health

Using Python

import requests

# Predict
response = requests.post(
    "http://localhost:8080/predict",
    json={"text": "I love GPUX!"}
)
result = response.json()
print(result)  # {"sentiment": [0.1, 0.9]}

# Health check
health = requests.get("http://localhost:8080/health")
print(health.json())  # {"status": "healthy", "model": "sentiment-analysis"}

Using JavaScript

// Predict
const response = await fetch('http://localhost:8080/predict', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ text: 'I love GPUX!' })
});
const result = await response.json();
console.log(result);  // {sentiment: [0.1, 0.9]}

// Health check
const health = await fetch('http://localhost:8080/health');
const healthData = await health.json();
console.log(healthData);  // {status: "healthy", model: "sentiment-analysis"}

OpenAPI Documentation

The server automatically generates interactive API documentation:

Swagger UI

Visit http://localhost:8080/docs for interactive API documentation.

ReDoc

Visit http://localhost:8080/redoc for alternative API documentation.


Production Deployment

Behind Nginx

Use Nginx as a reverse proxy:

server {
    listen 80;
    server_name api.example.com;

    location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

With Systemd

Create a systemd service:

[Unit]
Description=GPUX Model Server
After=network.target

[Service]
Type=simple
User=www-data
WorkingDir=/opt/models/sentiment
ExecStart=/usr/local/bin/gpux serve sentiment --port 8080 --workers 4
Restart=always

[Install]
WantedBy=multi-user.target

Docker Deployment

See Docker Deployment Guide for containerized deployment.


Error Handling

Model Not Found

Error: Model 'sentiment-analysis' not found

Solution: Ensure the model exists and gpux.yml is properly configured.

Port Already in Use

Error: [Errno 48] Address already in use

Solution: Use a different port or stop the process using the port:

gpux serve sentiment --port 9000

Missing Dependencies

Error: FastAPI and uvicorn are required for serving
Install with: pip install fastapi uvicorn

Solution: Install FastAPI dependencies:

uv add fastapi uvicorn


Best Practices

Use Multiple Workers

For production, use multiple workers to handle concurrent requests:

gpux serve model --workers 4

Health Check Monitoring

Monitor the /health endpoint for uptime monitoring:

*/5 * * * * curl -f http://localhost:8080/health || alert

Use Process Manager

In production, use a process manager like systemd, supervisord, or PM2.

Bind to 0.0.0.0 with Caution

Only bind to 0.0.0.0 if you need external access. For local development, use 127.0.0.1:

gpux serve model --host 127.0.0.1

Set Resource Limits

Configure timeout and memory limits in gpux.yml:

runtime:
  timeout: 30
  gpu:
    memory: 2GB


Performance Tips

  1. Multiple Workers: Use --workers for concurrent request handling
  2. GPU Provider: Use GPU providers (cuda, coreml) for best performance
  3. Batch Requests: Send batch requests when possible
  4. Connection Pooling: Use HTTP connection pooling in clients
  5. Load Balancing: Use multiple server instances behind a load balancer


See Also