Skip to content

Performance Optimization

Optimize models for maximum throughput and minimal latency.


🎯 Optimization Strategies

1. Model Optimization

Quantization

Convert models to lower precision for better performance:

from onnxruntime.quantization import quantize_dynamic

quantize_dynamic(
    "model.onnx",
    "model_quant.onnx",
    weight_type=QuantType.QUInt8
)

Benefits: - 4x smaller models - 2-4x faster inference - Minimal accuracy loss

Types: - Dynamic Quantization: Quantize weights to INT8 (easiest, good speedup) - Static Quantization: Quantize weights and activations (best performance) - FP16: Half precision (2x speedup on RTX GPUs)

2. Batch Size Tuning

runtime:
  batch_size: 32  # Optimize for throughput

Guidelines: | Batch Size | Use Case | Latency | Throughput | |------------|----------|---------|------------| | 1 | Real-time | Low | Low | | 8-16 | Balanced | Medium | Medium | | 32-64 | Batch processing | High | High |

3. Provider Selection

Performance ranking: 1. TensorRT (NVIDIA) - Best 2. CUDA (NVIDIA) 3. CoreML (Apple Silicon) 4. ROCm (AMD) 5. CPU - Fallback

4. Memory Optimization

runtime:
  gpu:
    memory: 4GB  # Allocate sufficient memory

5. Graph Optimization

Enable ONNX Runtime optimizations:

runtime:
  enable_profiling: true


📊 Benchmarking

Compare optimizations:

# Baseline
gpux run model --benchmark --runs 1000

# Optimized
gpux run model_optimized --benchmark --runs 1000

💡 Key Takeaways

Success

✅ Quantization reduces size and improves speed ✅ Batch size affects latency/throughput tradeoff ✅ Provider selection critical for performance ✅ Memory allocation impacts stability ✅ Always benchmark before/after


Previous: Custom Providers | Next: Memory Management →