Skip to content

Preprocessing Configuration

Data preprocessing settings in gpux.yml.


Overview

The preprocessing section defines data preprocessing pipelines.

preprocessing:
  # Text preprocessing fields
  text_tokenizer: string  # Tokenizer name for text models
  text_max_length: int   # Max tokenization length
  text_padding: string   # Padding strategy ("max_length", "longest")
  text_truncation: bool   # Enable truncation

  # Image preprocessing fields
  image_resize: [int, int]     # Image resize dimensions [height, width]
  image_normalize: string      # Normalization method ("imagenet", "custom", or None)
  image_mean: [float, float, float]  # Custom mean values [R, G, B]
  image_std: [float, float, float]   # Custom std values [R, G, B]

  # Audio preprocessing fields
  audio_sample_rate: int       # Target sample rate for resampling (e.g., 16000)
  audio_feature_extraction: string  # Feature extraction method ("mel_spectrogram", "raw", etc.)
  audio_n_mels: int            # Number of mel filter banks for spectrogram (default: 80)
  audio_n_fft: int             # FFT window size for spectrogram
  audio_hop_length: int         # Hop length for spectrogram

  # Cache configuration fields
  cache_enabled: bool          # Enable/disable caching (default: true)
  cache_max_memory_mb: int     # Maximum memory usage in MB (default: 100)
  cache_max_entries: int       # Maximum number of cache entries (default: 100)
  cache_ttl_seconds: int | null # Time to live in seconds (null = no expiration)

Fields

text_tokenizer

Tokenizer name for text preprocessing.

  • Type: string
  • Required: No
  • Examples: HuggingFace tokenizer names
preprocessing:
  text_tokenizer: bert-base-uncased
  text_tokenizer: gpt2
  text_tokenizer: distilbert-base-uncased

text_max_length

Maximum sequence length for tokenization.

  • Type: integer
  • Required: No
preprocessing:
  text_max_length: 128
  text_max_length: 512

text_padding

Padding strategy for tokenization.

  • Type: string
  • Required: No
  • Values: "max_length", "longest", or "do_not_pad"
preprocessing:
  text_padding: max_length

text_truncation

Enable truncation for long sequences.

  • Type: boolean
  • Required: No
preprocessing:
  text_truncation: true

image_resize

Image resize dimensions [height, width].

  • Type: list[int, int] or list[int] (for square)
  • Required: No
preprocessing:
  image_resize: [224, 224]   # Square resize
  image_resize: [640, 480]   # Rectangle resize
  image_resize: [224]        # Square (224x224)

image_normalize

Normalization method for images.

  • Type: string
  • Required: No
  • Values: imagenet, custom, or none
preprocessing:
  image_normalize: imagenet  # ImageNet normalization (default)
  image_normalize: custom    # Custom normalization (requires image_mean/image_std)
  image_normalize: none      # No normalization (just divide by 255.0)

image_mean

Custom mean values for normalization [R, G, B].

  • Type: list[float, float, float]
  • Required: No (only when image_normalize: custom)
preprocessing:
  image_normalize: custom
  image_mean: [0.5, 0.5, 0.5]

image_std

Custom standard deviation values for normalization [R, G, B].

  • Type: list[float, float, float]
  • Required: No (only when image_normalize: custom)
preprocessing:
  image_normalize: custom
  image_std: [0.5, 0.5, 0.5]

audio_sample_rate

Target sample rate for audio resampling.

  • Type: integer
  • Required: No
  • Default: 16000 (common for speech models)
preprocessing:
  audio_sample_rate: 16000
  audio_sample_rate: 8000

audio_feature_extraction

Feature extraction method for audio preprocessing.

  • Type: string
  • Required: No
  • Values: "mel_spectrogram", "raw", or other feature types
  • Default: "raw" (raw audio waveform)
preprocessing:
  audio_feature_extraction: mel_spectrogram  # For Whisper-like models
  audio_feature_extraction: raw              # For Wav2Vec-like models

audio_n_mels

Number of mel filter banks for mel spectrogram extraction.

  • Type: integer
  • Required: No (only when audio_feature_extraction: mel_spectrogram)
  • Default: 80 (Whisper default)
preprocessing:
  audio_feature_extraction: mel_spectrogram
  audio_n_mels: 80

audio_n_fft

FFT window size for mel spectrogram extraction.

  • Type: integer
  • Required: No (only when audio_feature_extraction: mel_spectrogram)
  • Default: 400 (Whisper default)
preprocessing:
  audio_feature_extraction: mel_spectrogram
  audio_n_fft: 400

audio_hop_length

Hop length (stride) for mel spectrogram extraction.

  • Type: integer
  • Required: No (only when audio_feature_extraction: mel_spectrogram)
  • Default: 160 (Whisper default)
preprocessing:
  audio_feature_extraction: mel_spectrogram
  audio_hop_length: 160

cache_enabled

Enable or disable preprocessing cache.

  • Type: boolean
  • Required: No
  • Default: true

When enabled, tokenizers, processed images, and audio features are cached to improve performance.

preprocessing:
  cache_enabled: true

cache_max_memory_mb

Maximum memory usage for the cache in megabytes.

  • Type: integer
  • Required: No
  • Default: 100
preprocessing:
  cache_max_memory_mb: 200

cache_max_entries

Maximum number of entries in the cache.

  • Type: integer
  • Required: No
  • Default: 100
preprocessing:
  cache_max_entries: 200

cache_ttl_seconds

Time to live for cache entries in seconds. Set to null to disable expiration.

  • Type: integer or null
  • Required: No
  • Default: null (no expiration)
preprocessing:
  cache_ttl_seconds: 3600  # 1 hour
  cache_ttl_seconds: null  # No expiration

Examples

Text Preprocessing

preprocessing:
  text_tokenizer: bert-base-uncased
  text_max_length: 128
  text_padding: max_length
  text_truncation: true

Image Preprocessing

preprocessing:
  image_resize: [224, 224]
  image_normalize: imagenet

Custom normalization:

preprocessing:
  image_resize: [224, 224]
  image_normalize: custom
  image_mean: [0.5, 0.5, 0.5]
  image_std: [0.5, 0.5, 0.5]

Audio Preprocessing

Raw audio (for Wav2Vec-like models):

preprocessing:
  audio_sample_rate: 16000
  audio_feature_extraction: raw

Mel spectrogram (for Whisper-like models):

preprocessing:
  audio_sample_rate: 16000
  audio_feature_extraction: mel_spectrogram
  audio_n_mels: 80
  audio_n_fft: 400
  audio_hop_length: 160

Complete Examples

BERT Sentiment Analysis

name: sentiment-analysis
model:
  source: ./bert.onnx

inputs:
  - name: input_ids
    type: int64
    shape: [1, 128]
  - name: attention_mask
    type: int64
    shape: [1, 128]

outputs:
  - name: logits
    type: float32
    shape: [1, 2]

preprocessing:
  text_tokenizer: bert-base-uncased
  text_max_length: 128
  text_padding: max_length
  text_truncation: true

Image Classification

name: image-classifier
model:
  source: ./resnet50.onnx

inputs:
  - name: image
    type: float32
    shape: [1, 3, 224, 224]

outputs:
  - name: probabilities
    type: float32
    shape: [1, 1000]

preprocessing:
  image_resize: [224, 224]
  image_normalize: imagenet

Whisper Speech Recognition

name: whisper-transcription
model:
  source: ./whisper.onnx

inputs:
  - name: mel_spectrogram
    type: float32
    shape: [1, 80, 3000]

outputs:
  - name: logits
    type: float32
    shape: [1, 50257]

preprocessing:
  audio_sample_rate: 16000
  audio_feature_extraction: mel_spectrogram
  audio_n_mels: 80
  audio_n_fft: 400
  audio_hop_length: 160

Using Human-Friendly Image Inputs

With image preprocessing configured, you can use simple image inputs:

# From file path
gpux run image-classifier --input '{"image": "/path/to/image.jpg"}'

# From URL
gpux run image-classifier --input '{"image": "https://example.com/image.png"}'

# From base64 data URI
gpux run image-classifier --input '{"image": "..."}'

The preprocessor automatically: - Loads the image from the specified source - Resizes to the target dimensions (from config or model input shape) - Normalizes using ImageNet or custom parameters - Converts to the correct tensor format

Using Human-Friendly Audio Inputs

With audio preprocessing configured, you can use simple audio inputs:

# From file path
gpux run whisper-model --input '{"audio": "/path/to/audio.wav"}'

# From URL
gpux run whisper-model --input '{"audio": "https://example.com/audio.mp3"}'

# From base64 data URI
gpux run whisper-model --input '{"audio": "data:audio/wav;base64,UklGRiQAAABXQVZFZm10..."}'

The preprocessor automatically: - Loads the audio from the specified source (WAV, MP3, FLAC formats supported) - Resamples to the target sample rate (from config or model requirements) - Extracts features (mel spectrogram for Whisper, raw audio for others) - Converts to the correct tensor format

Multimodal Preprocessing (Text + Image)

For models that accept multiple input types:

preprocessing:
  text_tokenizer: openai/clip-vit-base-patch32
  text_max_length: 77
  image_resize: [224, 224]
  image_normalize: imagenet
  cache_enabled: true
  cache_max_memory_mb: 200

Usage:

# Text + Image input
gpux run clip-model \
  --input '{"text": "a cat", "image": "/path/to/cat.jpg"}'

The multimodal preprocessor automatically: - Detects both text and image keys - Routes each to the appropriate preprocessor - Combines outputs into a single dictionary


See Also