Testing Modern HuggingFace Audio Models¶

Complete guide for testing GPUX with real and modern HuggingFace audio models using the GPUX CLI.

🎯 What is this?¶

This guide shows you how to test GPUX with modern HuggingFace audio models using the gpux pull and gpux run commands, including:

Whisper (OpenAI) - Speech recognition
Wav2Vec2 (Facebook) - Speech recognition
HuBERT (Facebook) - Speech recognition
SpeechT5 (Microsoft) - Speech synthesis
And more modern models

🚀 Quick Start¶

1. Download an Audio Model¶

# Download Whisper Tiny (small and fast model)
gpux pull openai/whisper-tiny

This command: - Downloads the model from HuggingFace Hub - Automatically converts it to ONNX format - Generates the gpux.yml configuration with audio preprocessing - Saves everything in ~/.gpux/models/

2. Run Inference with Audio¶

# Run inference with an audio file
gpux run openai/whisper-tiny \
  --input '{"audio": "/path/to/your/audio.wav"}'

Audio preprocessing is applied automatically according to the model configuration.

3. Complete Workflow in One Step¶

# 1. Download the model
gpux pull openai/whisper-base

# 2. Run inference
gpux run openai/whisper-base \
  --input '{"audio": "path/to/audio.wav"}' \
  --output results.json

📋 Available Models¶

Recommended Models to Get Started¶

Model	Size	Description
`openai/whisper-tiny`	~39MB	Whisper tiny - fast and lightweight
`openai/whisper-base`	~74MB	Whisper base - good balance
`openai/whisper-small`	~244MB	Whisper small - better precision
`facebook/wav2vec2-base-960h`	~315MB	Wav2Vec2 base trained on 960h
`facebook/hubert-base-ls960`	~315MB	HuBERT base for recognition
`microsoft/speecht5_tts`	~500MB	SpeechT5 for speech synthesis

Advanced Models (larger)¶

Model	Size	Description
`facebook/wav2vec2-large-960h-lv60-self`	~1.1GB	Wav2Vec2 large with self-learning
`facebook/seamless-m4t-medium`	~1.2GB	SeamlessM4T multilingual
`facebook/mms-1b-all`	~4GB	MMS 1B parameters (very large)

💻 Detailed Usage with CLI¶

`gpux pull` Command¶

Downloads and converts models from HuggingFace Hub:

gpux pull MODEL_ID [OPTIONS]

Available options:

--registry, -r: Registry to use (default: huggingface)
--revision: Model revision/branch (default: main)
--force, -f: Force re-download even if it exists locally
--opset: ONNX opset version to use
--verbose, -v: Verbose output

Examples:

# Download Whisper Tiny
gpux pull openai/whisper-tiny

# Download with specific revision
gpux pull openai/whisper-base --revision main

# Force re-download
gpux pull openai/whisper-small --force

`gpux run` Command¶

Runs inference on downloaded or local models:

gpux run MODEL_NAME [OPTIONS]

Available options:

--input, -i: Input data (JSON string or file with @)
--file, -f: Input file
--output, -o: Save results to file
--provider, -p: Execution provider (cuda, coreml, rocm, etc.)
--benchmark: Run benchmark instead of single inference
--verbose, -v: Verbose output

Examples:

Example 1: Test Whisper with your own audio¶

# Download the model first
gpux pull openai/whisper-tiny

# Run inference
gpux run openai/whisper-tiny \
  --input '{"audio": "/path/to/my/recording.wav"}'

Example 2: Use JSON input file¶

Create an input.json file:

{
  "audio": "/path/to/audio.wav"
}

Run:

gpux run openai/whisper-tiny --file input.json

Or with the @ prefix:

gpux run openai/whisper-tiny --input @input.json

Example 3: Save results to file¶

gpux run openai/whisper-tiny \
  --input '{"audio": "audio.wav"}' \
  --output results.json

Example 4: Test multiple models with the same audio¶

# Whisper Base
gpux pull openai/whisper-base
gpux run openai/whisper-base \
  --input '{"audio": "my_audio.wav"}' \
  --output whisper_base_results.json

# Wav2Vec2
gpux pull facebook/wav2vec2-base-960h
gpux run facebook/wav2vec2-base-960h \
  --input '{"audio": "my_audio.wav"}' \
  --output wav2vec2_results.json

Example 5: Use audio URL¶

gpux run openai/whisper-tiny \
  --input '{"audio": "https://example.com/audio.mp3"}'

Example 6: Performance benchmark¶

gpux run openai/whisper-tiny \
  --input '{"audio": "audio.wav"}' \
  --benchmark \
  --runs 100 \
  --warmup 10

Example 7: Specify GPU provider¶

# Use CoreML on Apple Silicon
gpux run openai/whisper-tiny \
  --input '{"audio": "audio.wav"}' \
  --provider coreml

# Use CUDA on NVIDIA
gpux run openai/whisper-tiny \
  --input '{"audio": "audio.wav"}' \
  --provider cuda

🔍 What GPUX Does¶

When you use gpux pull and gpux run, GPUX automatically performs:

Model Pull (gpux pull): Downloads the model from HuggingFace Hub
ONNX Conversion: Automatically converts the PyTorch model to ONNX format
Configuration Generation: Creates gpux.yml with audio preprocessing configured
Audio Preparation (gpux run): Automatically loads and preprocesses the audio file
Inference: Runs inference using GPUX Runtime with the best available GPU provider
Results: Displays results in console or saves them to file

Complete Flow¶

┌─────────────────┐
│  HuggingFace    │
│      Hub        │
└────────┬────────┘
         │ Pull
         ▼
┌─────────────────┐
│  PyTorch Model  │
│   (local cache) │
└────────┬────────┘
         │ Convert
         ▼
┌─────────────────┐
│   ONNX Model    │
│  (optimized)    │
└────────┬────────┘
         │ Load
         ▼
┌─────────────────┐
│  GPUX Runtime   │
│  (with GPU)     │
└────────┬────────┘
         │ Infer
         ▼
┌─────────────────┐
│    Results      │
└─────────────────┘

📊 Results¶

Console Output¶

GPUX shows detailed information during execution:

When pulling:

Downloading openai/whisper-tiny...
✓ Model downloaded successfully
Converting to ONNX...
✓ Model converted to ONNX
✓ Configuration generated: gpux.yml

When running inference:

{
  "logits": [[...]],
  "last_hidden_state": [[...]]
}

Save Results to File¶

gpux run openai/whisper-tiny \
  --input '{"audio": "audio.wav"}' \
  --output results.json

The results.json file will contain the complete inference results.

🎵 Supported Audio Formats¶

The script supports the following audio formats:

WAV (recommended)
MP3
FLAC

Create a Test Audio File¶

If you don't have an audio file, you can create a test one:

import numpy as np
import soundfile as sf

# Generate test signal (1 second, 440 Hz)
duration = 1.0
sample_rate = 16000
t = np.linspace(0, duration, int(sample_rate * duration))
audio = np.sin(2 * np.pi * 440 * t).astype(np.float32)

# Save
sf.write("test_audio.wav", audio, sample_rate)

Or use the script directly without --audio and it will create one automatically.

⚙️ Advanced Configuration¶

Use Already Downloaded Models¶

If you already have a downloaded and converted model, you can skip the conversion:

uv run python scripts/test_huggingface_audio_models.py \
  --model openai/whisper-tiny \
  --skip-conversion \
  --audio my_audio.wav

Specify Cache Directory¶

Models are saved in ~/.gpux/models/ by default. You can change this by modifying the script or using environment variables.

🐛 Troubleshooting¶

Error: "Model not found"¶

Solution: Verify that the model ID is correct. You can search for models on HuggingFace Hub:

# Verify the model exists
gpux pull openai/whisper-tiny --verbose

Error: "Failed to convert to ONNX"¶

Solution: Some models may require additional configuration:

Verify you have all dependencies: uv sync
Try with a smaller model first: openai/whisper-tiny
Review logs with --verbose:

gpux pull openai/whisper-tiny --verbose

Error: "Audio file not found"¶

Solution: Verify the path to the audio file is correct:

# Verify the file exists
ls -la /path/to/audio.wav

# Use absolute path
gpux run openai/whisper-tiny \
  --input '{"audio": "/absolute/path/to/audio.wav"}'

Error: "Out of memory"¶

Solution: Try with a smaller model:

# Use whisper-tiny instead of whisper-large
gpux pull openai/whisper-tiny
gpux run openai/whisper-tiny \
  --input '{"audio": "my_audio.wav"}'

Error: "Preprocessing failed"¶

Solution: Verify the audio file is valid and the format is compatible (WAV, MP3, FLAC):

# Verify file format
file audio.wav

# Try with another audio file
gpux run openai/whisper-tiny \
  --input '{"audio": "other_audio.wav"}'

📚 Models by Task¶

Speech Recognition (ASR)¶

openai/whisper-tiny - Fast and lightweight
openai/whisper-base - Good balance
openai/whisper-small - Better precision
facebook/wav2vec2-base-960h - Trained on 960h
facebook/hubert-base-ls960 - HuBERT model

Speech Synthesis (TTS)¶

microsoft/speecht5_tts - Text-to-speech

Multilingual¶

facebook/seamless-m4t-medium - Multimodal multilingual
facebook/mms-1b-all - MMS 1B parameters

🎯 Next Steps¶

Once you've tested the models:

Integrate into your application: Use the converted models in your code
Optimize performance: Adjust configuration according to your hardware
Test with real audio: Use real audio files from your use case
Explore more models: Search for other models on HuggingFace Hub

📖 References¶

💡 Tips¶

Start small: Try whisper-tiny first
Use real audio: Models work better with real audio than synthetic signals
Check logs: If something fails, review error messages for more details
Save results: The JSON file allows you to compare results between models

Enjoy testing modern audio models with GPUX! 🎵