Serving Configuration¶

HTTP server configuration in gpux.yml.

Overview¶

The serving section configures the FastAPI HTTP server.

serving:
  port: int               # Server port (default: 8080)
  host: string            # Server host (default: "0.0.0.0")
  batch_size: int         # Serving batch size (default: 1)
  timeout: int            # Request timeout (default: 5)
  max_workers: int        # Max worker processes (default: 4)

Fields¶

`port`¶

HTTP server port.

Type: integer
Required: No
Default: 8080

serving:
  port: 8080   # Default
  port: 9000   # Custom port

`host`¶

Server host/address.

Type: string
Required: No
Default: 0.0.0.0 (all interfaces)

serving:
  host: 0.0.0.0      # All interfaces (public)
  host: 127.0.0.1    # Localhost only (private)
  host: localhost    # Localhost alias

`batch_size`¶

Maximum batch size for serving.

Type: integer
Required: No
Default: 1

serving:
  batch_size: 1    # Single inference
  batch_size: 32   # Batch up to 32

`timeout`¶

Request timeout in seconds.

Type: integer
Required: No
Default: 5

serving:
  timeout: 5    # 5 second timeout
  timeout: 30   # 30 second timeout

`max_workers`¶

Maximum number of worker processes.

Type: integer
Required: No
Default: 4

serving:
  max_workers: 1   # Single worker
  max_workers: 8   # 8 workers

Examples¶

Minimal¶

serving:
  port: 8080

Development¶

serving:
  port: 8080
  host: 127.0.0.1  # Localhost only
  max_workers: 1   # Single worker for debugging

Production¶

serving:
  port: 8080
  host: 0.0.0.0
  batch_size: 16
  timeout: 10
  max_workers: 8

High-Throughput¶

serving:
  port: 8080
  host: 0.0.0.0
  batch_size: 64
  timeout: 30
  max_workers: 16

Complete Example¶

name: sentiment-api
version: 1.0.0

model:
  source: ./model.onnx

inputs:
  - name: text
    type: string

outputs:
  - name: sentiment
    type: float32
    shape: [2]

serving:
  port: 9000
  host: 0.0.0.0
  batch_size: 32
  timeout: 10
  max_workers: 4

Best Practices¶

Use Localhost in Development

Bind to localhost for local development:

serving:
  host: 127.0.0.1

Multiple Workers for Production

Use multiple workers for concurrency:

serving:
  max_workers: 8  # Based on CPU cores

GPU Memory with Workers

Each worker loads the model. With 4 workers: - Model size: 256 MB - GPU memory: 256 MB × 4 = 1 GB

Adjust Batch Size for Throughput

Larger batches improve throughput:

serving:
  batch_size: 32  # Process 32 items at once