Data Preprocessing¶
Preprocessing pipelines for text, images, and audio.
🎯 Overview¶
Learn preprocessing techniques for different data types.
📝 Text Preprocessing¶
Tokenization¶
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer("Hello world", return_tensors="np")
Configuration¶
🖼️ Image Preprocessing¶
Resize and Normalize¶
import numpy as np
from PIL import Image
# Load image
img = Image.open("image.jpg")
# Resize
img = img.resize((224, 224))
# Normalize (ImageNet)
img_array = np.array(img) / 255.0
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
img_normalized = (img_array - mean) / std
Configuration¶
🎵 Audio Preprocessing¶
Resampling¶
Configuration¶
🔄 Multimodal Preprocessing¶
GPUX supports multimodal models that accept multiple input types (e.g., text + image). The MultimodalPreprocessor automatically coordinates multiple specialized preprocessors.
Using Multimodal Inputs¶
# Text + Image input (e.g., CLIP models)
gpux run clip-model \
--input '{"text": "a cat", "image": "/path/to/cat.jpg"}'
Configuration¶
preprocessing:
# Text preprocessing
text_tokenizer: openai/clip-vit-base-patch32
text_max_length: 77
# Image preprocessing
image_resize: [224, 224]
image_normalize: imagenet
The multimodal preprocessor automatically:
- Detects when both text and image keys are present
- Routes text to TextPreprocessor
- Routes image to ImagePreprocessor
- Combines outputs into a single dictionary
Python API¶
from gpux import GPUXRuntime
from gpux.config.parser import PreprocessingConfig
config = PreprocessingConfig(
text_tokenizer="openai/clip-vit-base-patch32",
image_resize=[224, 224],
image_normalize="imagenet",
)
runtime = GPUXRuntime(
model_id="clip-model",
preprocessing_config=config
)
result = runtime.infer({
"text": "a cat sitting on a mat",
"image": "/path/to/cat.jpg"
})
⚡ Performance Optimization¶
Caching¶
GPUX includes intelligent caching to improve preprocessing performance:
preprocessing:
cache_enabled: true
cache_max_memory_mb: 200 # Maximum cache size in MB
cache_max_entries: 100 # Maximum number of cached items
cache_ttl_seconds: 3600 # Cache expiration (optional)
What gets cached: - Tokenizers (avoid reloading on each request) - Processed images (avoid reprocessing same images) - Processed audio features (avoid recomputation)
Benefits: - Significant speedup for repeated inputs - Reduced memory allocation overhead - Better performance in production environments
Batch Processing¶
Process multiple inputs efficiently:
batch_inputs = [
{"text": "Hello world"},
{"text": "Another text"},
{"image": "./image1.jpg"},
{"image": "./image2.jpg"},
]
results = runtime.batch_infer(batch_inputs)
The preprocessing pipeline automatically: - Detects which inputs need preprocessing - Processes each input with the appropriate preprocessor - Handles mixed batches (some preprocessed, some raw)
💡 Key Takeaways¶
Success
✅ Text tokenization ✅ Image preprocessing ✅ Audio resampling ✅ Multimodal preprocessing (text + image) ✅ Configuration options ✅ Performance optimization with caching ✅ Batch processing support
Previous: Inputs & Outputs | Next: Batch Inference →