Skip to content

Data Preprocessing

Preprocessing pipelines for text, images, and audio.


🎯 Overview

Learn preprocessing techniques for different data types.


📝 Text Preprocessing

Tokenization

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer("Hello world", return_tensors="np")

Configuration

preprocessing:
  tokenizer: bert-base-uncased
  max_length: 512

🖼️ Image Preprocessing

Resize and Normalize

import numpy as np
from PIL import Image

# Load image
img = Image.open("image.jpg")

# Resize
img = img.resize((224, 224))

# Normalize (ImageNet)
img_array = np.array(img) / 255.0
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
img_normalized = (img_array - mean) / std

Configuration

preprocessing:
  resize: [224, 224]
  normalize: imagenet

🎵 Audio Preprocessing

Resampling

import librosa

audio, sr = librosa.load("audio.wav", sr=16000)

Configuration

preprocessing:
  audio_sample_rate: 16000
  audio_feature_extraction: mel_spectrogram
  audio_n_mels: 80

🔄 Multimodal Preprocessing

GPUX supports multimodal models that accept multiple input types (e.g., text + image). The MultimodalPreprocessor automatically coordinates multiple specialized preprocessors.

Using Multimodal Inputs

# Text + Image input (e.g., CLIP models)
gpux run clip-model \
  --input '{"text": "a cat", "image": "/path/to/cat.jpg"}'

Configuration

preprocessing:
  # Text preprocessing
  text_tokenizer: openai/clip-vit-base-patch32
  text_max_length: 77

  # Image preprocessing
  image_resize: [224, 224]
  image_normalize: imagenet

The multimodal preprocessor automatically: - Detects when both text and image keys are present - Routes text to TextPreprocessor - Routes image to ImagePreprocessor - Combines outputs into a single dictionary

Python API

from gpux import GPUXRuntime
from gpux.config.parser import PreprocessingConfig

config = PreprocessingConfig(
    text_tokenizer="openai/clip-vit-base-patch32",
    image_resize=[224, 224],
    image_normalize="imagenet",
)

runtime = GPUXRuntime(
    model_id="clip-model",
    preprocessing_config=config
)

result = runtime.infer({
    "text": "a cat sitting on a mat",
    "image": "/path/to/cat.jpg"
})

⚡ Performance Optimization

Caching

GPUX includes intelligent caching to improve preprocessing performance:

preprocessing:
  cache_enabled: true
  cache_max_memory_mb: 200  # Maximum cache size in MB
  cache_max_entries: 100    # Maximum number of cached items
  cache_ttl_seconds: 3600   # Cache expiration (optional)

What gets cached: - Tokenizers (avoid reloading on each request) - Processed images (avoid reprocessing same images) - Processed audio features (avoid recomputation)

Benefits: - Significant speedup for repeated inputs - Reduced memory allocation overhead - Better performance in production environments

Batch Processing

Process multiple inputs efficiently:

batch_inputs = [
    {"text": "Hello world"},
    {"text": "Another text"},
    {"image": "./image1.jpg"},
    {"image": "./image2.jpg"},
]

results = runtime.batch_infer(batch_inputs)

The preprocessing pipeline automatically: - Detects which inputs need preprocessing - Processes each input with the appropriate preprocessor - Handles mixed batches (some preprocessed, some raw)


💡 Key Takeaways

Success

✅ Text tokenization ✅ Image preprocessing ✅ Audio resampling ✅ Multimodal preprocessing (text + image) ✅ Configuration options ✅ Performance optimization with caching ✅ Batch processing support


Previous: Inputs & Outputs | Next: Batch Inference →