Multimodal Models¶
Real-world example for multimodal models (text + image) using GPUX.
🎯 Overview¶
This guide demonstrates how to use GPUX's multimodal preprocessing system with models that accept multiple input types, such as CLIP (Contrastive Language-Image Pre-training) models.
What you'll learn:
- Setting up multimodal models with GPUX
- Using MultimodalPreprocessor for text + image inputs
- Configuration for multimodal preprocessing
- Batch processing with multimodal inputs
- Performance optimization with caching
📋 Prerequisites¶
-
Install GPUX (if not already installed):
-
Required dependencies:
transformers(for text tokenization)-
Pillow(for image processing) -
A multimodal model (e.g., CLIP):
🚀 Quick Start¶
1. Pull a Multimodal Model¶
This automatically:
- Downloads the model from HuggingFace
- Converts it to ONNX format
- Generates a gpux.yml configuration
- Sets up preprocessing for both text and image inputs
2. Run Inference with Multimodal Input¶
# Using both text and image inputs
gpux run openai/clip-vit-base-patch32 \
--input '{"text": "a cat sitting on a mat", "image": "/path/to/cat.jpg"}'
3. Test with Different Input Sources¶
# Local image file
gpux run openai/clip-vit-base-patch32 \
--input '{"text": "a red car", "image": "./test_image.jpg"}'
# Image from URL
gpux run openai/clip-vit-base-patch32 \
--input '{"text": "a beautiful sunset", "image": "https://example.com/sunset.jpg"}'
# Base64 encoded image
gpux run openai/clip-vit-base-patch32 \
--input '{"text": "a dog", "image": "data:image/jpeg;base64,/9j/4AAQ..."}'
📝 Configuration¶
Basic Multimodal Configuration¶
name: clip-model
version: 1.0.0
model:
source: ./clip_model.onnx
format: onnx
preprocessing:
# Text preprocessing
text_tokenizer: openai/clip-vit-base-patch32
text_max_length: 77 # CLIP uses 77 tokens
text_padding: max_length
text_truncation: true
# Image preprocessing
image_resize: [224, 224]
image_normalize: imagenet
# Cache configuration (optional, for performance)
cache_enabled: true
cache_max_memory_mb: 200
cache_max_entries: 100
cache_ttl_seconds: 3600 # 1 hour
inputs:
input_ids:
type: int64
shape: [1, 77]
attention_mask:
type: int64
shape: [1, 77]
image:
type: float32
shape: [1, 3, 224, 224]
outputs:
text_embeddings:
type: float32
shape: [1, 512]
image_embeddings:
type: float32
shape: [1, 512]
Advanced Configuration with Custom Settings¶
preprocessing:
# Text settings
text_tokenizer: openai/clip-vit-base-patch32
text_max_length: 77
text_padding: max_length
text_truncation: true
# Image settings
image_resize: [224, 224]
image_normalize: custom
image_mean: [0.485, 0.456, 0.406]
image_std: [0.229, 0.224, 0.225]
# Performance optimization
cache_enabled: true
cache_max_memory_mb: 500 # Larger cache for production
cache_max_entries: 200
cache_ttl_seconds: 7200 # 2 hours
💻 Python API Usage¶
Basic Usage¶
from gpux import GPUXRuntime
from gpux.config.parser import PreprocessingConfig
# Create preprocessing configuration
config = PreprocessingConfig(
text_tokenizer="openai/clip-vit-base-patch32",
text_max_length=77,
image_resize=[224, 224],
image_normalize="imagenet",
cache_enabled=True,
cache_max_memory_mb=200,
)
# Initialize runtime
runtime = GPUXRuntime(
model_id="openai/clip-vit-base-patch32",
preprocessing_config=config
)
# Run inference with multimodal input
result = runtime.infer({
"text": "a cat sitting on a mat",
"image": "/path/to/cat.jpg"
})
print(result)
Batch Processing¶
# Process multiple image-text pairs
batch_inputs = [
{"text": "a red car", "image": "./car1.jpg"},
{"text": "a blue bicycle", "image": "./bike1.jpg"},
{"text": "a green tree", "image": "./tree1.jpg"},
]
results = runtime.batch_infer(batch_inputs)
for i, result in enumerate(results):
print(f"Result {i}: {result}")
Direct Preprocessor Usage¶
from gpux.core.preprocessing.multimodal import MultimodalPreprocessor
from gpux.core.preprocessing.registry import PreprocessorRegistry
from gpux.core.preprocessing.text import TextPreprocessor
from gpux.core.preprocessing.image import ImagePreprocessor
from gpux.config.parser import PreprocessingConfig
# Setup preprocessors
registry = PreprocessorRegistry()
registry.register(TextPreprocessor())
registry.register(ImagePreprocessor())
# Create multimodal preprocessor
multimodal = MultimodalPreprocessor(registry=registry)
# Configure preprocessing
config = PreprocessingConfig(
text_tokenizer="openai/clip-vit-base-patch32",
image_resize=[224, 224],
image_normalize="imagenet",
)
# Preprocess multimodal input
result = multimodal.preprocess({
"text": "a red cat sitting on a mat",
"image": "./test_image.jpg"
}, config)
print(result)
# Output: {
# 'input_ids': array([[...]]),
# 'attention_mask': array([[...]]),
# 'image': array([[[[...]]]])
# }
🎯 Use Cases¶
Image-Text Matching¶
Check if an image matches a text description:
Zero-Shot Image Classification¶
Classify images into categories without training:
gpux run clip-model \
--input '{"text": "a photograph of a cat, a dog, or a bird", "image": "./pet.jpg"}'
Image Search¶
Find images that match text queries:
query = "a sunset over mountains"
image_paths = ["./img1.jpg", "./img2.jpg", "./img3.jpg"]
results = []
for img_path in image_paths:
result = runtime.infer({
"text": query,
"image": img_path
})
results.append((img_path, result["similarity_score"]))
# Sort by similarity
results.sort(key=lambda x: x[1], reverse=True)
print(f"Best match: {results[0][0]}")
⚡ Performance Optimization¶
Enable Caching¶
The preprocessing system includes intelligent caching to improve performance:
preprocessing:
cache_enabled: true
cache_max_memory_mb: 200 # Maximum cache size
cache_max_entries: 100 # Maximum number of cached items
cache_ttl_seconds: 3600 # Cache expiration time (optional)
Benefits: - Tokenizers are cached (avoid reloading on each request) - Processed images are cached (avoid reprocessing same images) - Significant speedup for repeated inputs
Batch Processing¶
Process multiple inputs efficiently:
# Process 10 image-text pairs at once
batch = [
{"text": f"description {i}", "image": f"./image_{i}.jpg"}
for i in range(10)
]
results = runtime.batch_infer(batch)
🔧 Troubleshooting¶
Model Not Found¶
If the model doesn't exist on HuggingFace:
- Try alternative CLIP models: laion/CLIP-ViT-B-32-xlaai
- Check model name spelling
- Verify internet connection
Conversion Issues¶
If model conversion fails:
- Ensure torch is installed
- Try with --provider cpu flag
- Check model compatibility with ONNX
Preprocessing Errors¶
If preprocessing fails:
- Verify image file exists and is readable
- Check image format (JPEG, PNG supported)
- Ensure text is a valid string
- Check that both text and image keys are present
Cache Issues¶
If caching causes problems:
- Disable cache: cache_enabled: false
- Reduce cache size: cache_max_memory_mb: 50
- Clear cache programmatically: preprocessor.clear_cache()
📚 Additional Resources¶
- Preprocessing Guide
- Preprocessing Configuration Reference
- Batch Inference Guide
- CLIP Model Documentation
💡 Key Takeaways¶
Success
✅ Multimodal preprocessing handles text + image automatically ✅ Supports file paths, URLs, and base64 encoded images ✅ Batch processing for efficient inference ✅ Intelligent caching for performance optimization ✅ Zero configuration needed for most models
Previous: Examples Index | Next: Object Detection →