`gpux serve`¶

Start HTTP server for model serving from registries or local projects.

Overview¶

The gpux serve command starts a FastAPI server that provides REST API endpoints for model inference. It supports both registry models (pulled from Hugging Face) and local models with gpux.yml configuration.

gpux serve MODEL_NAME [OPTIONS]

Arguments¶

`MODEL_NAME` (required)¶

Name of the model to serve. Can be:

Registry model: distilbert-base-uncased-finetuned-sst-2-english
Local model: sentiment-analysis (requires gpux.yml)
Model path: ./models/bert or /path/to/model

Examples:

# Registry models
gpux serve distilbert-base-uncased-finetuned-sst-2-english
gpux serve facebook/opt-125m
gpux serve sentence-transformers/all-MiniLM-L6-v2

# Local models
gpux serve sentiment-analysis
gpux serve image-classifier
gpux serve ./models/bert

Options¶

Server Options¶

`--port`, `-p`¶

Port to serve on.

Type: integer
Default: 8080

gpux serve sentiment --port 9000
gpux serve sentiment -p 3000

`--host`, `-h`¶

Host to bind to.

Type: string
Default: 0.0.0.0

gpux serve sentiment --host 127.0.0.1
gpux serve sentiment -h localhost

`--workers`¶

Number of worker processes.

Type: integer
Default: 1

gpux serve sentiment --workers 4

Configuration Options¶

`--config`, `-c`¶

Configuration file name.

Type: string
Default: gpux.yml

gpux serve sentiment --config custom.yml

`--provider`¶

Preferred execution provider.

Type: string
Choices: cuda, coreml, rocm, directml, openvino, tensorrt, cpu

gpux serve sentiment --provider cuda

Other Options¶

`--verbose`¶

Enable verbose output.

Type: boolean
Default: false

gpux serve sentiment --verbose

API Endpoints¶

The server exposes the following REST API endpoints:

`POST /predict`¶

Run inference on input data.

Request:

curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "I love this product!"}'

Response:

{
  "sentiment": [0.1, 0.9]
}

`GET /health`¶

Health check endpoint.

Request:

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "model": "sentiment-analysis"
}

`GET /info`¶

Get model information.

Request:

curl http://localhost:8080/info

Response:

{
  "name": "sentiment-analysis",
  "version": "1.0.0",
  "format": "onnx",
  "inputs": [
    {
      "name": "text",
      "type": "string",
      "required": true
    }
  ],
  "outputs": [
    {
      "name": "sentiment",
      "type": "float32",
      "shape": [2]
    }
  ]
}

`GET /metrics`¶

Get performance metrics and provider information.

Request:

curl http://localhost:8080/metrics

Response:

{
  "provider": {
    "name": "CUDAExecutionProvider",
    "available": true,
    "platform": "NVIDIA CUDA"
  },
  "available_providers": [
    "CUDAExecutionProvider",
    "CPUExecutionProvider"
  ]
}

Examples¶

Basic Server¶

Start server on default port (8080):

gpux serve sentiment-analysis

Output:

Model Information
┌──────────┬────────────────────┐
│ Property │ Value              │
├──────────┼────────────────────┤
│ Name     │ sentiment-analysis │
│ Version  │ 1.0.0              │
│ Inputs   │ 1                  │
│ Outputs  │ 1                  │
└──────────┴────────────────────┘

Server Configuration
┌──────────┬──────────────────────────┐
│ Property │ Value                    │
├──────────┼──────────────────────────┤
│ Host     │ 0.0.0.0                  │
│ Port     │ 8080                     │
│ Workers  │ 1                        │
│ URL      │ http://0.0.0.0:8080      │
└──────────┴──────────────────────────┘

API Endpoints
┌────────┬───────────┬─────────────────────┐
│ Method │ Path      │ Description         │
├────────┼───────────┼─────────────────────┤
│ POST   │ /predict  │ Run inference       │
│ GET    │ /health   │ Health check        │
│ GET    │ /info     │ Model information   │
│ GET    │ /metrics  │ Performance metrics │
└────────┴───────────┴─────────────────────┘

🚀 Starting GPUX server...
Server will be available at: http://0.0.0.0:8080
Press Ctrl+C to stop the server

Custom Port¶

Serve on a custom port:

gpux serve sentiment --port 9000

Test:

curl -X POST http://localhost:9000/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "Great product!"}'

Localhost Only¶

Serve on localhost only (not accessible externally):

gpux serve sentiment --host 127.0.0.1 --port 8080

Multiple Workers¶

Use multiple workers for better throughput:

gpux serve sentiment --workers 4

GPU Memory with Multiple Workers

Each worker loads the model into GPU memory. Ensure you have enough GPU memory: - 1 worker: ~256 MB - 4 workers: ~1 GB - 8 workers: ~2 GB

With Specific Provider¶

Serve with CUDA provider:

gpux serve sentiment --provider cuda --port 8080

Making Requests¶

Using cURL¶

Single Inference:

curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "I love GPUX!"}'

Health Check:

curl http://localhost:8080/health

Using Python¶

import requests

# Predict
response = requests.post(
    "http://localhost:8080/predict",
    json={"text": "I love GPUX!"}
)
result = response.json()
print(result)  # {"sentiment": [0.1, 0.9]}

# Health check
health = requests.get("http://localhost:8080/health")
print(health.json())  # {"status": "healthy", "model": "sentiment-analysis"}

Using JavaScript¶

// Predict
const response = await fetch('http://localhost:8080/predict', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ text: 'I love GPUX!' })
});
const result = await response.json();
console.log(result);  // {sentiment: [0.1, 0.9]}

// Health check
const health = await fetch('http://localhost:8080/health');
const healthData = await health.json();
console.log(healthData);  // {status: "healthy", model: "sentiment-analysis"}

OpenAPI Documentation¶

The server automatically generates interactive API documentation:

Swagger UI¶

Visit http://localhost:8080/docs for interactive API documentation.

ReDoc¶

Visit http://localhost:8080/redoc for alternative API documentation.

Production Deployment¶

Behind Nginx¶

Use Nginx as a reverse proxy:

server {
    listen 80;
    server_name api.example.com;

    location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

With Systemd¶

Create a systemd service:

[Unit]
Description=GPUX Model Server
After=network.target

[Service]
Type=simple
User=www-data
WorkingDir=/opt/models/sentiment
ExecStart=/usr/local/bin/gpux serve sentiment --port 8080 --workers 4
Restart=always

[Install]
WantedBy=multi-user.target

Docker Deployment¶

See Docker Deployment Guide for containerized deployment.

Error Handling¶

Model Not Found¶

Error: Model 'sentiment-analysis' not found

Solution: Ensure the model exists and gpux.yml is properly configured.

Port Already in Use¶

Error: [Errno 48] Address already in use

Solution: Use a different port or stop the process using the port:

gpux serve sentiment --port 9000

Missing Dependencies¶

Error: FastAPI and uvicorn are required for serving
Install with: pip install fastapi uvicorn

Solution: Install FastAPI dependencies:

uv add fastapi uvicorn

Best Practices¶

Use Multiple Workers

For production, use multiple workers to handle concurrent requests:

gpux serve model --workers 4

Health Check Monitoring

Monitor the /health endpoint for uptime monitoring:

*/5 * * * * curl -f http://localhost:8080/health || alert

Use Process Manager

In production, use a process manager like systemd, supervisord, or PM2.

Bind to 0.0.0.0 with Caution

Only bind to 0.0.0.0 if you need external access. For local development, use 127.0.0.1:

gpux serve model --host 127.0.0.1

Set Resource Limits

Configure timeout and memory limits in gpux.yml:

runtime:
  timeout: 30
  gpu:
    memory: 2GB

Performance Tips¶

Multiple Workers: Use --workers for concurrent request handling
GPU Provider: Use GPU providers (cuda, coreml) for best performance
Batch Requests: Send batch requests when possible
Connection Pooling: Use HTTP connection pooling in clients
Load Balancing: Use multiple server instances behind a load balancer

gpux run - Run inference directly
gpux build - Build models before serving
gpux inspect - Inspect model details

gpux serve¶

Overview¶

Arguments¶

MODEL_NAME (required)¶

Options¶

Server Options¶

--port, -p¶

--host, -h¶

--workers¶

Configuration Options¶

--config, -c¶

--provider¶

Other Options¶

--verbose¶

API Endpoints¶

POST /predict¶

GET /health¶

GET /info¶

GET /metrics¶

Examples¶

Basic Server¶

Custom Port¶

Localhost Only¶

Multiple Workers¶

With Specific Provider¶

Making Requests¶

Using cURL¶

Using Python¶

Using JavaScript¶

OpenAPI Documentation¶

Swagger UI¶

ReDoc¶

Production Deployment¶

Behind Nginx¶

With Systemd¶

Docker Deployment¶

Error Handling¶

Model Not Found¶

Port Already in Use¶

Missing Dependencies¶

Best Practices¶

Performance Tips¶

Related Commands¶

See Also¶

`gpux serve`¶

`MODEL_NAME` (required)¶

`--port`, `-p`¶

`--host`, `-h`¶

`--workers`¶

`--config`, `-c`¶

`--provider`¶

`--verbose`¶

`POST /predict`¶

`GET /health`¶

`GET /info`¶

`GET /metrics`¶