gpux serve¶
Start HTTP server for model serving from registries or local projects.
Overview¶
The gpux serve command starts a FastAPI server that provides REST API endpoints for model inference. It supports both registry models (pulled from Hugging Face) and local models with gpux.yml configuration.
Arguments¶
MODEL_NAME (required)¶
Name of the model to serve. Can be:
- Registry model:
distilbert-base-uncased-finetuned-sst-2-english - Local model:
sentiment-analysis(requiresgpux.yml) - Model path:
./models/bertor/path/to/model
Examples:
# Registry models
gpux serve distilbert-base-uncased-finetuned-sst-2-english
gpux serve facebook/opt-125m
gpux serve sentence-transformers/all-MiniLM-L6-v2
# Local models
gpux serve sentiment-analysis
gpux serve image-classifier
gpux serve ./models/bert
Options¶
Server Options¶
--port, -p¶
Port to serve on.
- Type:
integer - Default:
8080
--host, -h¶
Host to bind to.
- Type:
string - Default:
0.0.0.0
--workers¶
Number of worker processes.
- Type:
integer - Default:
1
Configuration Options¶
--config, -c¶
Configuration file name.
- Type:
string - Default:
gpux.yml
--provider¶
Preferred execution provider.
- Type:
string - Choices:
cuda,coreml,rocm,directml,openvino,tensorrt,cpu
Other Options¶
--verbose¶
Enable verbose output.
- Type:
boolean - Default:
false
API Endpoints¶
The server exposes the following REST API endpoints:
POST /predict¶
Run inference on input data.
Request:
curl -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-d '{"text": "I love this product!"}'
Response:
GET /health¶
Health check endpoint.
Request:
Response:
GET /info¶
Get model information.
Request:
Response:
{
"name": "sentiment-analysis",
"version": "1.0.0",
"format": "onnx",
"inputs": [
{
"name": "text",
"type": "string",
"required": true
}
],
"outputs": [
{
"name": "sentiment",
"type": "float32",
"shape": [2]
}
]
}
GET /metrics¶
Get performance metrics and provider information.
Request:
Response:
{
"provider": {
"name": "CUDAExecutionProvider",
"available": true,
"platform": "NVIDIA CUDA"
},
"available_providers": [
"CUDAExecutionProvider",
"CPUExecutionProvider"
]
}
Examples¶
Basic Server¶
Start server on default port (8080):
Output:
Model Information
┌──────────┬────────────────────┐
│ Property │ Value │
├──────────┼────────────────────┤
│ Name │ sentiment-analysis │
│ Version │ 1.0.0 │
│ Inputs │ 1 │
│ Outputs │ 1 │
└──────────┴────────────────────┘
Server Configuration
┌──────────┬──────────────────────────┐
│ Property │ Value │
├──────────┼──────────────────────────┤
│ Host │ 0.0.0.0 │
│ Port │ 8080 │
│ Workers │ 1 │
│ URL │ http://0.0.0.0:8080 │
└──────────┴──────────────────────────┘
API Endpoints
┌────────┬───────────┬─────────────────────┐
│ Method │ Path │ Description │
├────────┼───────────┼─────────────────────┤
│ POST │ /predict │ Run inference │
│ GET │ /health │ Health check │
│ GET │ /info │ Model information │
│ GET │ /metrics │ Performance metrics │
└────────┴───────────┴─────────────────────┘
🚀 Starting GPUX server...
Server will be available at: http://0.0.0.0:8080
Press Ctrl+C to stop the server
Custom Port¶
Serve on a custom port:
Test:
curl -X POST http://localhost:9000/predict \
-H "Content-Type: application/json" \
-d '{"text": "Great product!"}'
Localhost Only¶
Serve on localhost only (not accessible externally):
Multiple Workers¶
Use multiple workers for better throughput:
GPU Memory with Multiple Workers
Each worker loads the model into GPU memory. Ensure you have enough GPU memory: - 1 worker: ~256 MB - 4 workers: ~1 GB - 8 workers: ~2 GB
With Specific Provider¶
Serve with CUDA provider:
Making Requests¶
Using cURL¶
Single Inference:
curl -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-d '{"text": "I love GPUX!"}'
Health Check:
Using Python¶
import requests
# Predict
response = requests.post(
"http://localhost:8080/predict",
json={"text": "I love GPUX!"}
)
result = response.json()
print(result) # {"sentiment": [0.1, 0.9]}
# Health check
health = requests.get("http://localhost:8080/health")
print(health.json()) # {"status": "healthy", "model": "sentiment-analysis"}
Using JavaScript¶
// Predict
const response = await fetch('http://localhost:8080/predict', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text: 'I love GPUX!' })
});
const result = await response.json();
console.log(result); // {sentiment: [0.1, 0.9]}
// Health check
const health = await fetch('http://localhost:8080/health');
const healthData = await health.json();
console.log(healthData); // {status: "healthy", model: "sentiment-analysis"}
OpenAPI Documentation¶
The server automatically generates interactive API documentation:
Swagger UI¶
Visit http://localhost:8080/docs for interactive API documentation.
ReDoc¶
Visit http://localhost:8080/redoc for alternative API documentation.
Production Deployment¶
Behind Nginx¶
Use Nginx as a reverse proxy:
server {
listen 80;
server_name api.example.com;
location / {
proxy_pass http://localhost:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
With Systemd¶
Create a systemd service:
[Unit]
Description=GPUX Model Server
After=network.target
[Service]
Type=simple
User=www-data
WorkingDir=/opt/models/sentiment
ExecStart=/usr/local/bin/gpux serve sentiment --port 8080 --workers 4
Restart=always
[Install]
WantedBy=multi-user.target
Docker Deployment¶
See Docker Deployment Guide for containerized deployment.
Error Handling¶
Model Not Found¶
Solution: Ensure the model exists and gpux.yml is properly configured.
Port Already in Use¶
Solution: Use a different port or stop the process using the port:
Missing Dependencies¶
Solution: Install FastAPI dependencies:
Best Practices¶
Use Multiple Workers
For production, use multiple workers to handle concurrent requests:
Health Check Monitoring
Monitor the /health endpoint for uptime monitoring:
Use Process Manager
In production, use a process manager like systemd, supervisord, or PM2.
Bind to 0.0.0.0 with Caution
Only bind to 0.0.0.0 if you need external access. For local development, use 127.0.0.1:
Set Resource Limits
Configure timeout and memory limits in gpux.yml:
Performance Tips¶
- Multiple Workers: Use
--workersfor concurrent request handling - GPU Provider: Use GPU providers (cuda, coreml) for best performance
- Batch Requests: Send batch requests when possible
- Connection Pooling: Use HTTP connection pooling in clients
- Load Balancing: Use multiple server instances behind a load balancer
Related Commands¶
gpux run- Run inference directlygpux build- Build models before servinggpux inspect- Inspect model details