Preprocessing Configuration¶
Data preprocessing settings in gpux.yml.
Overview¶
The preprocessing section defines data preprocessing pipelines.
preprocessing:
# Text preprocessing fields
text_tokenizer: string # Tokenizer name for text models
text_max_length: int # Max tokenization length
text_padding: string # Padding strategy ("max_length", "longest")
text_truncation: bool # Enable truncation
# Image preprocessing fields
image_resize: [int, int] # Image resize dimensions [height, width]
image_normalize: string # Normalization method ("imagenet", "custom", or None)
image_mean: [float, float, float] # Custom mean values [R, G, B]
image_std: [float, float, float] # Custom std values [R, G, B]
# Audio preprocessing fields
audio_sample_rate: int # Target sample rate for resampling (e.g., 16000)
audio_feature_extraction: string # Feature extraction method ("mel_spectrogram", "raw", etc.)
audio_n_mels: int # Number of mel filter banks for spectrogram (default: 80)
audio_n_fft: int # FFT window size for spectrogram
audio_hop_length: int # Hop length for spectrogram
# Cache configuration fields
cache_enabled: bool # Enable/disable caching (default: true)
cache_max_memory_mb: int # Maximum memory usage in MB (default: 100)
cache_max_entries: int # Maximum number of cache entries (default: 100)
cache_ttl_seconds: int | null # Time to live in seconds (null = no expiration)
Fields¶
text_tokenizer¶
Tokenizer name for text preprocessing.
- Type:
string - Required: No
- Examples: HuggingFace tokenizer names
preprocessing:
text_tokenizer: bert-base-uncased
text_tokenizer: gpt2
text_tokenizer: distilbert-base-uncased
text_max_length¶
Maximum sequence length for tokenization.
- Type:
integer - Required: No
text_padding¶
Padding strategy for tokenization.
- Type:
string - Required: No
- Values:
"max_length","longest", or"do_not_pad"
text_truncation¶
Enable truncation for long sequences.
- Type:
boolean - Required: No
image_resize¶
Image resize dimensions [height, width].
- Type:
list[int, int]orlist[int](for square) - Required: No
preprocessing:
image_resize: [224, 224] # Square resize
image_resize: [640, 480] # Rectangle resize
image_resize: [224] # Square (224x224)
image_normalize¶
Normalization method for images.
- Type:
string - Required: No
- Values:
imagenet,custom, ornone
preprocessing:
image_normalize: imagenet # ImageNet normalization (default)
image_normalize: custom # Custom normalization (requires image_mean/image_std)
image_normalize: none # No normalization (just divide by 255.0)
image_mean¶
Custom mean values for normalization [R, G, B].
- Type:
list[float, float, float] - Required: No (only when
image_normalize: custom)
image_std¶
Custom standard deviation values for normalization [R, G, B].
- Type:
list[float, float, float] - Required: No (only when
image_normalize: custom)
audio_sample_rate¶
Target sample rate for audio resampling.
- Type:
integer - Required: No
- Default:
16000(common for speech models)
audio_feature_extraction¶
Feature extraction method for audio preprocessing.
- Type:
string - Required: No
- Values:
"mel_spectrogram","raw", or other feature types - Default:
"raw"(raw audio waveform)
preprocessing:
audio_feature_extraction: mel_spectrogram # For Whisper-like models
audio_feature_extraction: raw # For Wav2Vec-like models
audio_n_mels¶
Number of mel filter banks for mel spectrogram extraction.
- Type:
integer - Required: No (only when
audio_feature_extraction: mel_spectrogram) - Default:
80(Whisper default)
audio_n_fft¶
FFT window size for mel spectrogram extraction.
- Type:
integer - Required: No (only when
audio_feature_extraction: mel_spectrogram) - Default:
400(Whisper default)
audio_hop_length¶
Hop length (stride) for mel spectrogram extraction.
- Type:
integer - Required: No (only when
audio_feature_extraction: mel_spectrogram) - Default:
160(Whisper default)
cache_enabled¶
Enable or disable preprocessing cache.
- Type:
boolean - Required: No
- Default:
true
When enabled, tokenizers, processed images, and audio features are cached to improve performance.
cache_max_memory_mb¶
Maximum memory usage for the cache in megabytes.
- Type:
integer - Required: No
- Default:
100
cache_max_entries¶
Maximum number of entries in the cache.
- Type:
integer - Required: No
- Default:
100
cache_ttl_seconds¶
Time to live for cache entries in seconds. Set to null to disable expiration.
- Type:
integerornull - Required: No
- Default:
null(no expiration)
Examples¶
Text Preprocessing¶
preprocessing:
text_tokenizer: bert-base-uncased
text_max_length: 128
text_padding: max_length
text_truncation: true
Image Preprocessing¶
Custom normalization:
preprocessing:
image_resize: [224, 224]
image_normalize: custom
image_mean: [0.5, 0.5, 0.5]
image_std: [0.5, 0.5, 0.5]
Audio Preprocessing¶
Raw audio (for Wav2Vec-like models):
Mel spectrogram (for Whisper-like models):
preprocessing:
audio_sample_rate: 16000
audio_feature_extraction: mel_spectrogram
audio_n_mels: 80
audio_n_fft: 400
audio_hop_length: 160
Complete Examples¶
BERT Sentiment Analysis¶
name: sentiment-analysis
model:
source: ./bert.onnx
inputs:
- name: input_ids
type: int64
shape: [1, 128]
- name: attention_mask
type: int64
shape: [1, 128]
outputs:
- name: logits
type: float32
shape: [1, 2]
preprocessing:
text_tokenizer: bert-base-uncased
text_max_length: 128
text_padding: max_length
text_truncation: true
Image Classification¶
name: image-classifier
model:
source: ./resnet50.onnx
inputs:
- name: image
type: float32
shape: [1, 3, 224, 224]
outputs:
- name: probabilities
type: float32
shape: [1, 1000]
preprocessing:
image_resize: [224, 224]
image_normalize: imagenet
Whisper Speech Recognition¶
name: whisper-transcription
model:
source: ./whisper.onnx
inputs:
- name: mel_spectrogram
type: float32
shape: [1, 80, 3000]
outputs:
- name: logits
type: float32
shape: [1, 50257]
preprocessing:
audio_sample_rate: 16000
audio_feature_extraction: mel_spectrogram
audio_n_mels: 80
audio_n_fft: 400
audio_hop_length: 160
Using Human-Friendly Image Inputs¶
With image preprocessing configured, you can use simple image inputs:
# From file path
gpux run image-classifier --input '{"image": "/path/to/image.jpg"}'
# From URL
gpux run image-classifier --input '{"image": "https://example.com/image.png"}'
# From base64 data URI
gpux run image-classifier --input '{"image": "data:image/jpeg;base64,/9j/4AAQ..."}'
The preprocessor automatically: - Loads the image from the specified source - Resizes to the target dimensions (from config or model input shape) - Normalizes using ImageNet or custom parameters - Converts to the correct tensor format
Using Human-Friendly Audio Inputs¶
With audio preprocessing configured, you can use simple audio inputs:
# From file path
gpux run whisper-model --input '{"audio": "/path/to/audio.wav"}'
# From URL
gpux run whisper-model --input '{"audio": "https://example.com/audio.mp3"}'
# From base64 data URI
gpux run whisper-model --input '{"audio": "data:audio/wav;base64,UklGRiQAAABXQVZFZm10..."}'
The preprocessor automatically: - Loads the audio from the specified source (WAV, MP3, FLAC formats supported) - Resamples to the target sample rate (from config or model requirements) - Extracts features (mel spectrogram for Whisper, raw audio for others) - Converts to the correct tensor format
Multimodal Preprocessing (Text + Image)¶
For models that accept multiple input types:
preprocessing:
text_tokenizer: openai/clip-vit-base-patch32
text_max_length: 77
image_resize: [224, 224]
image_normalize: imagenet
cache_enabled: true
cache_max_memory_mb: 200
Usage:
The multimodal preprocessor automatically:
- Detects both text and image keys
- Routes each to the appropriate preprocessor
- Combines outputs into a single dictionary