Skip to content

Testing Modern HuggingFace Audio Models

Complete guide for testing GPUX with real and modern HuggingFace audio models using the GPUX CLI.


🎯 What is this?

This guide shows you how to test GPUX with modern HuggingFace audio models using the gpux pull and gpux run commands, including:

  • Whisper (OpenAI) - Speech recognition
  • Wav2Vec2 (Facebook) - Speech recognition
  • HuBERT (Facebook) - Speech recognition
  • SpeechT5 (Microsoft) - Speech synthesis
  • And more modern models

🚀 Quick Start

1. Download an Audio Model

# Download Whisper Tiny (small and fast model)
gpux pull openai/whisper-tiny

This command: - Downloads the model from HuggingFace Hub - Automatically converts it to ONNX format - Generates the gpux.yml configuration with audio preprocessing - Saves everything in ~/.gpux/models/

2. Run Inference with Audio

# Run inference with an audio file
gpux run openai/whisper-tiny \
  --input '{"audio": "/path/to/your/audio.wav"}'

Audio preprocessing is applied automatically according to the model configuration.

3. Complete Workflow in One Step

# 1. Download the model
gpux pull openai/whisper-base

# 2. Run inference
gpux run openai/whisper-base \
  --input '{"audio": "path/to/audio.wav"}' \
  --output results.json

📋 Available Models

Model Size Description
openai/whisper-tiny ~39MB Whisper tiny - fast and lightweight
openai/whisper-base ~74MB Whisper base - good balance
openai/whisper-small ~244MB Whisper small - better precision
facebook/wav2vec2-base-960h ~315MB Wav2Vec2 base trained on 960h
facebook/hubert-base-ls960 ~315MB HuBERT base for recognition
microsoft/speecht5_tts ~500MB SpeechT5 for speech synthesis

Advanced Models (larger)

Model Size Description
facebook/wav2vec2-large-960h-lv60-self ~1.1GB Wav2Vec2 large with self-learning
facebook/seamless-m4t-medium ~1.2GB SeamlessM4T multilingual
facebook/mms-1b-all ~4GB MMS 1B parameters (very large)

💻 Detailed Usage with CLI

gpux pull Command

Downloads and converts models from HuggingFace Hub:

gpux pull MODEL_ID [OPTIONS]

Available options:

  • --registry, -r: Registry to use (default: huggingface)
  • --revision: Model revision/branch (default: main)
  • --force, -f: Force re-download even if it exists locally
  • --opset: ONNX opset version to use
  • --verbose, -v: Verbose output

Examples:

# Download Whisper Tiny
gpux pull openai/whisper-tiny

# Download with specific revision
gpux pull openai/whisper-base --revision main

# Force re-download
gpux pull openai/whisper-small --force

gpux run Command

Runs inference on downloaded or local models:

gpux run MODEL_NAME [OPTIONS]

Available options:

  • --input, -i: Input data (JSON string or file with @)
  • --file, -f: Input file
  • --output, -o: Save results to file
  • --provider, -p: Execution provider (cuda, coreml, rocm, etc.)
  • --benchmark: Run benchmark instead of single inference
  • --verbose, -v: Verbose output

Examples:

Example 1: Test Whisper with your own audio

# Download the model first
gpux pull openai/whisper-tiny

# Run inference
gpux run openai/whisper-tiny \
  --input '{"audio": "/path/to/my/recording.wav"}'

Example 2: Use JSON input file

Create an input.json file:

{
  "audio": "/path/to/audio.wav"
}

Run:

gpux run openai/whisper-tiny --file input.json

Or with the @ prefix:

gpux run openai/whisper-tiny --input @input.json

Example 3: Save results to file

gpux run openai/whisper-tiny \
  --input '{"audio": "audio.wav"}' \
  --output results.json

Example 4: Test multiple models with the same audio

# Whisper Base
gpux pull openai/whisper-base
gpux run openai/whisper-base \
  --input '{"audio": "my_audio.wav"}' \
  --output whisper_base_results.json

# Wav2Vec2
gpux pull facebook/wav2vec2-base-960h
gpux run facebook/wav2vec2-base-960h \
  --input '{"audio": "my_audio.wav"}' \
  --output wav2vec2_results.json

Example 5: Use audio URL

gpux run openai/whisper-tiny \
  --input '{"audio": "https://example.com/audio.mp3"}'

Example 6: Performance benchmark

gpux run openai/whisper-tiny \
  --input '{"audio": "audio.wav"}' \
  --benchmark \
  --runs 100 \
  --warmup 10

Example 7: Specify GPU provider

# Use CoreML on Apple Silicon
gpux run openai/whisper-tiny \
  --input '{"audio": "audio.wav"}' \
  --provider coreml

# Use CUDA on NVIDIA
gpux run openai/whisper-tiny \
  --input '{"audio": "audio.wav"}' \
  --provider cuda

🔍 What GPUX Does

When you use gpux pull and gpux run, GPUX automatically performs:

  1. Model Pull (gpux pull): Downloads the model from HuggingFace Hub
  2. ONNX Conversion: Automatically converts the PyTorch model to ONNX format
  3. Configuration Generation: Creates gpux.yml with audio preprocessing configured
  4. Audio Preparation (gpux run): Automatically loads and preprocesses the audio file
  5. Inference: Runs inference using GPUX Runtime with the best available GPU provider
  6. Results: Displays results in console or saves them to file

Complete Flow

┌─────────────────┐
│  HuggingFace    │
│      Hub        │
└────────┬────────┘
         │ Pull
┌─────────────────┐
│  PyTorch Model  │
│   (local cache) │
└────────┬────────┘
         │ Convert
┌─────────────────┐
│   ONNX Model    │
│  (optimized)    │
└────────┬────────┘
         │ Load
┌─────────────────┐
│  GPUX Runtime   │
│  (with GPU)     │
└────────┬────────┘
         │ Infer
┌─────────────────┐
│    Results      │
└─────────────────┘

📊 Results

Console Output

GPUX shows detailed information during execution:

When pulling:

Downloading openai/whisper-tiny...
✓ Model downloaded successfully
Converting to ONNX...
✓ Model converted to ONNX
✓ Configuration generated: gpux.yml

When running inference:

{
  "logits": [[...]],
  "last_hidden_state": [[...]]
}

Save Results to File

gpux run openai/whisper-tiny \
  --input '{"audio": "audio.wav"}' \
  --output results.json

The results.json file will contain the complete inference results.


🎵 Supported Audio Formats

The script supports the following audio formats:

  • WAV (recommended)
  • MP3
  • FLAC

Create a Test Audio File

If you don't have an audio file, you can create a test one:

import numpy as np
import soundfile as sf

# Generate test signal (1 second, 440 Hz)
duration = 1.0
sample_rate = 16000
t = np.linspace(0, duration, int(sample_rate * duration))
audio = np.sin(2 * np.pi * 440 * t).astype(np.float32)

# Save
sf.write("test_audio.wav", audio, sample_rate)

Or use the script directly without --audio and it will create one automatically.


⚙️ Advanced Configuration

Use Already Downloaded Models

If you already have a downloaded and converted model, you can skip the conversion:

uv run python scripts/test_huggingface_audio_models.py \
  --model openai/whisper-tiny \
  --skip-conversion \
  --audio my_audio.wav

Specify Cache Directory

Models are saved in ~/.gpux/models/ by default. You can change this by modifying the script or using environment variables.


🐛 Troubleshooting

Error: "Model not found"

Solution: Verify that the model ID is correct. You can search for models on HuggingFace Hub:

# Verify the model exists
gpux pull openai/whisper-tiny --verbose

Error: "Failed to convert to ONNX"

Solution: Some models may require additional configuration:

  1. Verify you have all dependencies: uv sync
  2. Try with a smaller model first: openai/whisper-tiny
  3. Review logs with --verbose:
gpux pull openai/whisper-tiny --verbose

Error: "Audio file not found"

Solution: Verify the path to the audio file is correct:

# Verify the file exists
ls -la /path/to/audio.wav

# Use absolute path
gpux run openai/whisper-tiny \
  --input '{"audio": "/absolute/path/to/audio.wav"}'

Error: "Out of memory"

Solution: Try with a smaller model:

# Use whisper-tiny instead of whisper-large
gpux pull openai/whisper-tiny
gpux run openai/whisper-tiny \
  --input '{"audio": "my_audio.wav"}'

Error: "Preprocessing failed"

Solution: Verify the audio file is valid and the format is compatible (WAV, MP3, FLAC):

# Verify file format
file audio.wav

# Try with another audio file
gpux run openai/whisper-tiny \
  --input '{"audio": "other_audio.wav"}'

📚 Models by Task

Speech Recognition (ASR)

  • openai/whisper-tiny - Fast and lightweight
  • openai/whisper-base - Good balance
  • openai/whisper-small - Better precision
  • facebook/wav2vec2-base-960h - Trained on 960h
  • facebook/hubert-base-ls960 - HuBERT model

Speech Synthesis (TTS)

  • microsoft/speecht5_tts - Text-to-speech

Multilingual

  • facebook/seamless-m4t-medium - Multimodal multilingual
  • facebook/mms-1b-all - MMS 1B parameters

🎯 Next Steps

Once you've tested the models:

  1. Integrate into your application: Use the converted models in your code
  2. Optimize performance: Adjust configuration according to your hardware
  3. Test with real audio: Use real audio files from your use case
  4. Explore more models: Search for other models on HuggingFace Hub

📖 References


💡 Tips

  • Start small: Try whisper-tiny first
  • Use real audio: Models work better with real audio than synthetic signals
  • Check logs: If something fails, review error messages for more details
  • Save results: The JSON file allows you to compare results between models

Enjoy testing modern audio models with GPUX! 🎵