NVIDIA GPUs¶
Optimize GPUX for NVIDIA GPUs with CUDA and TensorRT.
Overview¶
NVIDIA GPUs provide excellent performance for ML inference with two execution providers:
- TensorRT: Best performance (4-10x faster than CPU)
- CUDA: Good performance, easier setup
Supported GPUs¶
GeForce Series¶
- RTX 40 Series (Ada Lovelace)
- RTX 30 Series (Ampere)
- RTX 20 Series (Turing)
- GTX 16 Series (Turing)
- GTX 10 Series (Pascal)
Professional¶
- Quadro RTX Series
- Tesla/A-Series (A100, A10, etc.)
- H-Series (H100)
Minimum Requirements¶
- CUDA Compute Capability 6.0+
- 4GB VRAM (8GB+ recommended)
Installation¶
1. Install CUDA Toolkit¶
Ubuntu/Debian:
# Add NVIDIA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# Install CUDA 12.0
sudo apt-get install cuda-toolkit-12-0
Windows: 1. Download CUDA Toolkit 2. Run installer 3. Add to PATH
2. Install GPUX with GPU Support¶
3. Verify Installation¶
Configuration¶
CUDA Provider¶
name: my-model
model:
source: ./model.onnx
runtime:
gpu:
backend: cuda
memory: 4GB
batch_size: 8
inputs:
- name: input
type: float32
shape: [1, 10]
outputs:
- name: output
type: float32
shape: [1, 2]
TensorRT Provider¶
Performance Optimization¶
1. Use TensorRT¶
TensorRT provides 2-10x speedup over CUDA:
from gpux import GPUXRuntime
runtime = GPUXRuntime(
model_path="model.onnx",
provider="tensorrt", # Use TensorRT
memory_limit="8GB"
)
2. Enable FP16 Precision¶
For RTX GPUs with Tensor Cores:
# Convert model to FP16 for 2x speedup
runtime = GPUXRuntime(
model_path="model_fp16.onnx",
provider="tensorrt"
)
3. Optimize Batch Size¶
# Find optimal batch size
for batch_size in [1, 4, 8, 16, 32]:
runtime = GPUXRuntime(
model_path="model.onnx",
batch_size=batch_size
)
metrics = runtime.benchmark(input_data, num_runs=100)
print(f"Batch {batch_size}: {metrics['throughput_fps']:.1f} FPS")
4. GPU Memory Management¶
Performance Benchmarks¶
Typical Performance (RTX 3080)¶
| Model | Provider | Batch Size | Throughput |
|---|---|---|---|
| BERT-base | TensorRT | 32 | 2,400 FPS |
| BERT-base | CUDA | 32 | 800 FPS |
| ResNet-50 | TensorRT | 16 | 1,800 FPS |
| ResNet-50 | CUDA | 16 | 600 FPS |
Provider Comparison¶
- TensorRT: 2-10x faster, requires optimization
- CUDA: Easier setup, good performance
- CPU: 10-50x slower baseline
Troubleshooting¶
CUDA Not Available¶
Error:
Solutions: 1. Install CUDA Toolkit:
-
Install onnxruntime-gpu:
-
Verify GPU:
Out of Memory¶
Error:
Solutions: 1. Reduce batch size:
-
Limit GPU memory:
-
Use model quantization:
TensorRT Build Errors¶
Error:
Solutions: 1. Use CUDA instead:
- Check model compatibility
- Update TensorRT version
Best Practices¶
Use TensorRT in Production
TensorRT provides best performance for production:
Enable Tensor Cores
Use FP16 precision on RTX GPUs: - 2x speedup - Minimal accuracy loss - Requires model conversion
Monitor GPU Utilization
Driver Compatibility
Ensure CUDA driver matches toolkit version: - CUDA 12.0 requires driver 525+ - CUDA 11.8 requires driver 520+
Advanced Configuration¶
Multi-GPU Setup¶
import os
# Select specific GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
runtime = GPUXRuntime(
model_path="model.onnx",
provider="cuda"
)