Performance Optimization¶
Optimize models for maximum throughput and minimal latency.
🎯 Optimization Strategies¶
1. Model Optimization¶
Quantization¶
Convert models to lower precision for better performance:
from onnxruntime.quantization import quantize_dynamic
quantize_dynamic(
"model.onnx",
"model_quant.onnx",
weight_type=QuantType.QUInt8
)
Benefits: - 4x smaller models - 2-4x faster inference - Minimal accuracy loss
Types: - Dynamic Quantization: Quantize weights to INT8 (easiest, good speedup) - Static Quantization: Quantize weights and activations (best performance) - FP16: Half precision (2x speedup on RTX GPUs)
2. Batch Size Tuning¶
Guidelines: | Batch Size | Use Case | Latency | Throughput | |------------|----------|---------|------------| | 1 | Real-time | Low | Low | | 8-16 | Balanced | Medium | Medium | | 32-64 | Batch processing | High | High |
3. Provider Selection¶
Performance ranking: 1. TensorRT (NVIDIA) - Best 2. CUDA (NVIDIA) 3. CoreML (Apple Silicon) 4. ROCm (AMD) 5. CPU - Fallback
4. Memory Optimization¶
5. Graph Optimization¶
Enable ONNX Runtime optimizations:
📊 Benchmarking¶
Compare optimizations:
# Baseline
gpux run model --benchmark --runs 1000
# Optimized
gpux run model_optimized --benchmark --runs 1000
💡 Key Takeaways¶
Success
✅ Quantization reduces size and improves speed ✅ Batch size affects latency/throughput tradeoff ✅ Provider selection critical for performance ✅ Memory allocation impacts stability ✅ Always benchmark before/after
Previous: Custom Providers | Next: Memory Management →