Testing Modern HuggingFace Audio Models¶
Complete guide for testing GPUX with real and modern HuggingFace audio models using the GPUX CLI.
🎯 What is this?¶
This guide shows you how to test GPUX with modern HuggingFace audio models using the gpux pull and gpux run commands, including:
- Whisper (OpenAI) - Speech recognition
- Wav2Vec2 (Facebook) - Speech recognition
- HuBERT (Facebook) - Speech recognition
- SpeechT5 (Microsoft) - Speech synthesis
- And more modern models
🚀 Quick Start¶
1. Download an Audio Model¶
This command:
- Downloads the model from HuggingFace Hub
- Automatically converts it to ONNX format
- Generates the gpux.yml configuration with audio preprocessing
- Saves everything in ~/.gpux/models/
2. Run Inference with Audio¶
# Run inference with an audio file
gpux run openai/whisper-tiny \
--input '{"audio": "/path/to/your/audio.wav"}'
Audio preprocessing is applied automatically according to the model configuration.
3. Complete Workflow in One Step¶
# 1. Download the model
gpux pull openai/whisper-base
# 2. Run inference
gpux run openai/whisper-base \
--input '{"audio": "path/to/audio.wav"}' \
--output results.json
📋 Available Models¶
Recommended Models to Get Started¶
| Model | Size | Description |
|---|---|---|
openai/whisper-tiny |
~39MB | Whisper tiny - fast and lightweight |
openai/whisper-base |
~74MB | Whisper base - good balance |
openai/whisper-small |
~244MB | Whisper small - better precision |
facebook/wav2vec2-base-960h |
~315MB | Wav2Vec2 base trained on 960h |
facebook/hubert-base-ls960 |
~315MB | HuBERT base for recognition |
microsoft/speecht5_tts |
~500MB | SpeechT5 for speech synthesis |
Advanced Models (larger)¶
| Model | Size | Description |
|---|---|---|
facebook/wav2vec2-large-960h-lv60-self |
~1.1GB | Wav2Vec2 large with self-learning |
facebook/seamless-m4t-medium |
~1.2GB | SeamlessM4T multilingual |
facebook/mms-1b-all |
~4GB | MMS 1B parameters (very large) |
💻 Detailed Usage with CLI¶
gpux pull Command¶
Downloads and converts models from HuggingFace Hub:
Available options:
--registry, -r: Registry to use (default:huggingface)--revision: Model revision/branch (default:main)--force, -f: Force re-download even if it exists locally--opset: ONNX opset version to use--verbose, -v: Verbose output
Examples:
# Download Whisper Tiny
gpux pull openai/whisper-tiny
# Download with specific revision
gpux pull openai/whisper-base --revision main
# Force re-download
gpux pull openai/whisper-small --force
gpux run Command¶
Runs inference on downloaded or local models:
Available options:
--input, -i: Input data (JSON string or file with@)--file, -f: Input file--output, -o: Save results to file--provider, -p: Execution provider (cuda, coreml, rocm, etc.)--benchmark: Run benchmark instead of single inference--verbose, -v: Verbose output
Examples:
Example 1: Test Whisper with your own audio¶
# Download the model first
gpux pull openai/whisper-tiny
# Run inference
gpux run openai/whisper-tiny \
--input '{"audio": "/path/to/my/recording.wav"}'
Example 2: Use JSON input file¶
Create an input.json file:
Run:
Or with the @ prefix:
Example 3: Save results to file¶
Example 4: Test multiple models with the same audio¶
# Whisper Base
gpux pull openai/whisper-base
gpux run openai/whisper-base \
--input '{"audio": "my_audio.wav"}' \
--output whisper_base_results.json
# Wav2Vec2
gpux pull facebook/wav2vec2-base-960h
gpux run facebook/wav2vec2-base-960h \
--input '{"audio": "my_audio.wav"}' \
--output wav2vec2_results.json
Example 5: Use audio URL¶
Example 6: Performance benchmark¶
gpux run openai/whisper-tiny \
--input '{"audio": "audio.wav"}' \
--benchmark \
--runs 100 \
--warmup 10
Example 7: Specify GPU provider¶
# Use CoreML on Apple Silicon
gpux run openai/whisper-tiny \
--input '{"audio": "audio.wav"}' \
--provider coreml
# Use CUDA on NVIDIA
gpux run openai/whisper-tiny \
--input '{"audio": "audio.wav"}' \
--provider cuda
🔍 What GPUX Does¶
When you use gpux pull and gpux run, GPUX automatically performs:
- Model Pull (
gpux pull): Downloads the model from HuggingFace Hub - ONNX Conversion: Automatically converts the PyTorch model to ONNX format
- Configuration Generation: Creates
gpux.ymlwith audio preprocessing configured - Audio Preparation (
gpux run): Automatically loads and preprocesses the audio file - Inference: Runs inference using GPUX Runtime with the best available GPU provider
- Results: Displays results in console or saves them to file
Complete Flow¶
┌─────────────────┐
│ HuggingFace │
│ Hub │
└────────┬────────┘
│ Pull
▼
┌─────────────────┐
│ PyTorch Model │
│ (local cache) │
└────────┬────────┘
│ Convert
▼
┌─────────────────┐
│ ONNX Model │
│ (optimized) │
└────────┬────────┘
│ Load
▼
┌─────────────────┐
│ GPUX Runtime │
│ (with GPU) │
└────────┬────────┘
│ Infer
▼
┌─────────────────┐
│ Results │
└─────────────────┘
📊 Results¶
Console Output¶
GPUX shows detailed information during execution:
When pulling:
Downloading openai/whisper-tiny...
✓ Model downloaded successfully
Converting to ONNX...
✓ Model converted to ONNX
✓ Configuration generated: gpux.yml
When running inference:
Save Results to File¶
The results.json file will contain the complete inference results.
🎵 Supported Audio Formats¶
The script supports the following audio formats:
- WAV (recommended)
- MP3
- FLAC
Create a Test Audio File¶
If you don't have an audio file, you can create a test one:
import numpy as np
import soundfile as sf
# Generate test signal (1 second, 440 Hz)
duration = 1.0
sample_rate = 16000
t = np.linspace(0, duration, int(sample_rate * duration))
audio = np.sin(2 * np.pi * 440 * t).astype(np.float32)
# Save
sf.write("test_audio.wav", audio, sample_rate)
Or use the script directly without --audio and it will create one automatically.
⚙️ Advanced Configuration¶
Use Already Downloaded Models¶
If you already have a downloaded and converted model, you can skip the conversion:
uv run python scripts/test_huggingface_audio_models.py \
--model openai/whisper-tiny \
--skip-conversion \
--audio my_audio.wav
Specify Cache Directory¶
Models are saved in ~/.gpux/models/ by default. You can change this by modifying the script or using environment variables.
🐛 Troubleshooting¶
Error: "Model not found"¶
Solution: Verify that the model ID is correct. You can search for models on HuggingFace Hub:
Error: "Failed to convert to ONNX"¶
Solution: Some models may require additional configuration:
- Verify you have all dependencies:
uv sync - Try with a smaller model first:
openai/whisper-tiny - Review logs with
--verbose:
Error: "Audio file not found"¶
Solution: Verify the path to the audio file is correct:
# Verify the file exists
ls -la /path/to/audio.wav
# Use absolute path
gpux run openai/whisper-tiny \
--input '{"audio": "/absolute/path/to/audio.wav"}'
Error: "Out of memory"¶
Solution: Try with a smaller model:
# Use whisper-tiny instead of whisper-large
gpux pull openai/whisper-tiny
gpux run openai/whisper-tiny \
--input '{"audio": "my_audio.wav"}'
Error: "Preprocessing failed"¶
Solution: Verify the audio file is valid and the format is compatible (WAV, MP3, FLAC):
# Verify file format
file audio.wav
# Try with another audio file
gpux run openai/whisper-tiny \
--input '{"audio": "other_audio.wav"}'
📚 Models by Task¶
Speech Recognition (ASR)¶
openai/whisper-tiny- Fast and lightweightopenai/whisper-base- Good balanceopenai/whisper-small- Better precisionfacebook/wav2vec2-base-960h- Trained on 960hfacebook/hubert-base-ls960- HuBERT model
Speech Synthesis (TTS)¶
microsoft/speecht5_tts- Text-to-speech
Multilingual¶
facebook/seamless-m4t-medium- Multimodal multilingualfacebook/mms-1b-all- MMS 1B parameters
🎯 Next Steps¶
Once you've tested the models:
- Integrate into your application: Use the converted models in your code
- Optimize performance: Adjust configuration according to your hardware
- Test with real audio: Use real audio files from your use case
- Explore more models: Search for other models on HuggingFace Hub
📖 References¶
- HuggingFace Audio Models
- GPUX Preprocessing Guide
- GPUX Configuration Reference
- ONNX Runtime Documentation
💡 Tips¶
- Start small: Try
whisper-tinyfirst - Use real audio: Models work better with real audio than synthetic signals
- Check logs: If something fails, review error messages for more details
- Save results: The JSON file allows you to compare results between models
Enjoy testing modern audio models with GPUX! 🎵