Preprocessing Configuration¶

Data preprocessing settings in gpux.yml.

Overview¶

The preprocessing section defines data preprocessing pipelines.

preprocessing:
  # Text preprocessing fields
  text_tokenizer: string  # Tokenizer name for text models
  text_max_length: int   # Max tokenization length
  text_padding: string   # Padding strategy ("max_length", "longest")
  text_truncation: bool   # Enable truncation

  # Image preprocessing fields
  image_resize: [int, int]     # Image resize dimensions [height, width]
  image_normalize: string      # Normalization method ("imagenet", "custom", or None)
  image_mean: [float, float, float]  # Custom mean values [R, G, B]
  image_std: [float, float, float]   # Custom std values [R, G, B]

  # Audio preprocessing fields
  audio_sample_rate: int       # Target sample rate for resampling (e.g., 16000)
  audio_feature_extraction: string  # Feature extraction method ("mel_spectrogram", "raw", etc.)
  audio_n_mels: int            # Number of mel filter banks for spectrogram (default: 80)
  audio_n_fft: int             # FFT window size for spectrogram
  audio_hop_length: int         # Hop length for spectrogram

  # Cache configuration fields
  cache_enabled: bool          # Enable/disable caching (default: true)
  cache_max_memory_mb: int     # Maximum memory usage in MB (default: 100)
  cache_max_entries: int       # Maximum number of cache entries (default: 100)
  cache_ttl_seconds: int | null # Time to live in seconds (null = no expiration)

Fields¶

`text_tokenizer`¶

Tokenizer name for text preprocessing.

Type: string
Required: No
Examples: HuggingFace tokenizer names

preprocessing:
  text_tokenizer: bert-base-uncased
  text_tokenizer: gpt2
  text_tokenizer: distilbert-base-uncased

`text_max_length`¶

Maximum sequence length for tokenization.

Type: integer
Required: No

preprocessing:
  text_max_length: 128
  text_max_length: 512

`text_padding`¶

Padding strategy for tokenization.

Type: string
Required: No
Values: "max_length", "longest", or "do_not_pad"

preprocessing:
  text_padding: max_length

`text_truncation`¶

Enable truncation for long sequences.

Type: boolean
Required: No

preprocessing:
  text_truncation: true

`image_resize`¶

Image resize dimensions [height, width].

Type: list[int, int] or list[int] (for square)
Required: No

preprocessing:
  image_resize: [224, 224]   # Square resize
  image_resize: [640, 480]   # Rectangle resize
  image_resize: [224]        # Square (224x224)

`image_normalize`¶

Normalization method for images.

Type: string
Required: No
Values: imagenet, custom, or none

preprocessing:
  image_normalize: imagenet  # ImageNet normalization (default)
  image_normalize: custom    # Custom normalization (requires image_mean/image_std)
  image_normalize: none      # No normalization (just divide by 255.0)

`image_mean`¶

Custom mean values for normalization [R, G, B].

Type: list[float, float, float]
Required: No (only when image_normalize: custom)

preprocessing:
  image_normalize: custom
  image_mean: [0.5, 0.5, 0.5]

`image_std`¶

Custom standard deviation values for normalization [R, G, B].

Type: list[float, float, float]
Required: No (only when image_normalize: custom)

preprocessing:
  image_normalize: custom
  image_std: [0.5, 0.5, 0.5]

`audio_sample_rate`¶

Target sample rate for audio resampling.

Type: integer
Required: No
Default: 16000 (common for speech models)

preprocessing:
  audio_sample_rate: 16000
  audio_sample_rate: 8000

`audio_feature_extraction`¶

Feature extraction method for audio preprocessing.

Type: string
Required: No
Values: "mel_spectrogram", "raw", or other feature types
Default: "raw" (raw audio waveform)

preprocessing:
  audio_feature_extraction: mel_spectrogram  # For Whisper-like models
  audio_feature_extraction: raw              # For Wav2Vec-like models

`audio_n_mels`¶

Number of mel filter banks for mel spectrogram extraction.

Type: integer
Required: No (only when audio_feature_extraction: mel_spectrogram)
Default: 80 (Whisper default)

preprocessing:
  audio_feature_extraction: mel_spectrogram
  audio_n_mels: 80

`audio_n_fft`¶

FFT window size for mel spectrogram extraction.

Type: integer
Required: No (only when audio_feature_extraction: mel_spectrogram)
Default: 400 (Whisper default)

preprocessing:
  audio_feature_extraction: mel_spectrogram
  audio_n_fft: 400

`audio_hop_length`¶

Hop length (stride) for mel spectrogram extraction.

Type: integer
Required: No (only when audio_feature_extraction: mel_spectrogram)
Default: 160 (Whisper default)

preprocessing:
  audio_feature_extraction: mel_spectrogram
  audio_hop_length: 160

`cache_enabled`¶

Enable or disable preprocessing cache.

Type: boolean
Required: No
Default: true

When enabled, tokenizers, processed images, and audio features are cached to improve performance.

preprocessing:
  cache_enabled: true

`cache_max_memory_mb`¶

Maximum memory usage for the cache in megabytes.

Type: integer
Required: No
Default: 100

preprocessing:
  cache_max_memory_mb: 200

`cache_max_entries`¶

Maximum number of entries in the cache.

Type: integer
Required: No
Default: 100

preprocessing:
  cache_max_entries: 200

`cache_ttl_seconds`¶

Time to live for cache entries in seconds. Set to null to disable expiration.

Type: integer or null
Required: No
Default: null (no expiration)

preprocessing:
  cache_ttl_seconds: 3600  # 1 hour
  cache_ttl_seconds: null  # No expiration

Examples¶

Text Preprocessing¶

preprocessing:
  text_tokenizer: bert-base-uncased
  text_max_length: 128
  text_padding: max_length
  text_truncation: true

Image Preprocessing¶

preprocessing:
  image_resize: [224, 224]
  image_normalize: imagenet

Custom normalization:

preprocessing:
  image_resize: [224, 224]
  image_normalize: custom
  image_mean: [0.5, 0.5, 0.5]
  image_std: [0.5, 0.5, 0.5]

Audio Preprocessing¶

Raw audio (for Wav2Vec-like models):

preprocessing:
  audio_sample_rate: 16000
  audio_feature_extraction: raw

Mel spectrogram (for Whisper-like models):

preprocessing:
  audio_sample_rate: 16000
  audio_feature_extraction: mel_spectrogram
  audio_n_mels: 80
  audio_n_fft: 400
  audio_hop_length: 160

Complete Examples¶

BERT Sentiment Analysis¶

name: sentiment-analysis
model:
  source: ./bert.onnx

inputs:
  - name: input_ids
    type: int64
    shape: [1, 128]
  - name: attention_mask
    type: int64
    shape: [1, 128]

outputs:
  - name: logits
    type: float32
    shape: [1, 2]

preprocessing:
  text_tokenizer: bert-base-uncased
  text_max_length: 128
  text_padding: max_length
  text_truncation: true

Image Classification¶

name: image-classifier
model:
  source: ./resnet50.onnx

inputs:
  - name: image
    type: float32
    shape: [1, 3, 224, 224]

outputs:
  - name: probabilities
    type: float32
    shape: [1, 1000]

preprocessing:
  image_resize: [224, 224]
  image_normalize: imagenet

Whisper Speech Recognition¶

name: whisper-transcription
model:
  source: ./whisper.onnx

inputs:
  - name: mel_spectrogram
    type: float32
    shape: [1, 80, 3000]

outputs:
  - name: logits
    type: float32
    shape: [1, 50257]

preprocessing:
  audio_sample_rate: 16000
  audio_feature_extraction: mel_spectrogram
  audio_n_mels: 80
  audio_n_fft: 400
  audio_hop_length: 160

Using Human-Friendly Image Inputs¶

With image preprocessing configured, you can use simple image inputs:

# From file path
gpux run image-classifier --input '{"image": "/path/to/image.jpg"}'

# From URL
gpux run image-classifier --input '{"image": "https://example.com/image.png"}'

# From base64 data URI
gpux run image-classifier --input '{"image": "data:image/jpeg;base64,/9j/4AAQ..."}'

The preprocessor automatically: - Loads the image from the specified source - Resizes to the target dimensions (from config or model input shape) - Normalizes using ImageNet or custom parameters - Converts to the correct tensor format

Using Human-Friendly Audio Inputs¶

With audio preprocessing configured, you can use simple audio inputs:

# From file path
gpux run whisper-model --input '{"audio": "/path/to/audio.wav"}'

# From URL
gpux run whisper-model --input '{"audio": "https://example.com/audio.mp3"}'

# From base64 data URI
gpux run whisper-model --input '{"audio": "data:audio/wav;base64,UklGRiQAAABXQVZFZm10..."}'

The preprocessor automatically: - Loads the audio from the specified source (WAV, MP3, FLAC formats supported) - Resamples to the target sample rate (from config or model requirements) - Extracts features (mel spectrogram for Whisper, raw audio for others) - Converts to the correct tensor format

Multimodal Preprocessing (Text + Image)¶

For models that accept multiple input types:

preprocessing:
  text_tokenizer: openai/clip-vit-base-patch32
  text_max_length: 77
  image_resize: [224, 224]
  image_normalize: imagenet
  cache_enabled: true
  cache_max_memory_mb: 200

Usage:

# Text + Image input
gpux run clip-model \
  --input '{"text": "a cat", "image": "/path/to/cat.jpg"}'

The multimodal preprocessor automatically: - Detects both text and image keys - Routes each to the appropriate preprocessor - Combines outputs into a single dictionary

Preprocessing Configuration¶

Overview¶

Fields¶

text_tokenizer¶

text_max_length¶

text_padding¶

text_truncation¶

image_resize¶

image_normalize¶

image_mean¶

image_std¶

audio_sample_rate¶

audio_feature_extraction¶

audio_n_mels¶

audio_n_fft¶

audio_hop_length¶

cache_enabled¶

cache_max_memory_mb¶

cache_max_entries¶

cache_ttl_seconds¶