Serving Configuration¶
HTTP server configuration in gpux.yml.
Overview¶
The serving section configures the FastAPI HTTP server.
serving:
port: int # Server port (default: 8080)
host: string # Server host (default: "0.0.0.0")
batch_size: int # Serving batch size (default: 1)
timeout: int # Request timeout (default: 5)
max_workers: int # Max worker processes (default: 4)
Fields¶
port¶
HTTP server port.
- Type:
integer - Required: No
- Default:
8080
host¶
Server host/address.
- Type:
string - Required: No
- Default:
0.0.0.0(all interfaces)
serving:
host: 0.0.0.0 # All interfaces (public)
host: 127.0.0.1 # Localhost only (private)
host: localhost # Localhost alias
batch_size¶
Maximum batch size for serving.
- Type:
integer - Required: No
- Default:
1
timeout¶
Request timeout in seconds.
- Type:
integer - Required: No
- Default:
5
max_workers¶
Maximum number of worker processes.
- Type:
integer - Required: No
- Default:
4
Examples¶
Minimal¶
Development¶
Production¶
High-Throughput¶
Complete Example¶
name: sentiment-api
version: 1.0.0
model:
source: ./model.onnx
inputs:
- name: text
type: string
outputs:
- name: sentiment
type: float32
shape: [2]
serving:
port: 9000
host: 0.0.0.0
batch_size: 32
timeout: 10
max_workers: 4
Best Practices¶
Multiple Workers for Production
Use multiple workers for concurrency:
GPU Memory with Workers
Each worker loads the model. With 4 workers: - Model size: 256 MB - GPU memory: 256 MB × 4 = 1 GB
Adjust Batch Size for Throughput
Larger batches improve throughput: