Skip to content

Multimodal Models

Real-world example for multimodal models (text + image) using GPUX.


🎯 Overview

This guide demonstrates how to use GPUX's multimodal preprocessing system with models that accept multiple input types, such as CLIP (Contrastive Language-Image Pre-training) models.

What you'll learn: - Setting up multimodal models with GPUX - Using MultimodalPreprocessor for text + image inputs - Configuration for multimodal preprocessing - Batch processing with multimodal inputs - Performance optimization with caching


📋 Prerequisites

  1. Install GPUX (if not already installed):

    pip install gpux
    

  2. Required dependencies:

  3. transformers (for text tokenization)
  4. Pillow (for image processing)

  5. A multimodal model (e.g., CLIP):

    gpux pull openai/clip-vit-base-patch32
    


🚀 Quick Start

1. Pull a Multimodal Model

# Pull CLIP model from HuggingFace
gpux pull openai/clip-vit-base-patch32

This automatically: - Downloads the model from HuggingFace - Converts it to ONNX format - Generates a gpux.yml configuration - Sets up preprocessing for both text and image inputs

2. Run Inference with Multimodal Input

# Using both text and image inputs
gpux run openai/clip-vit-base-patch32 \
  --input '{"text": "a cat sitting on a mat", "image": "/path/to/cat.jpg"}'

3. Test with Different Input Sources

# Local image file
gpux run openai/clip-vit-base-patch32 \
  --input '{"text": "a red car", "image": "./test_image.jpg"}'

# Image from URL
gpux run openai/clip-vit-base-patch32 \
  --input '{"text": "a beautiful sunset", "image": "https://example.com/sunset.jpg"}'

# Base64 encoded image
gpux run openai/clip-vit-base-patch32 \
  --input '{"text": "a dog", "image": "data:image/jpeg;base64,/9j/4AAQ..."}'

📝 Configuration

Basic Multimodal Configuration

name: clip-model
version: 1.0.0

model:
  source: ./clip_model.onnx
  format: onnx

preprocessing:
  # Text preprocessing
  text_tokenizer: openai/clip-vit-base-patch32
  text_max_length: 77  # CLIP uses 77 tokens
  text_padding: max_length
  text_truncation: true

  # Image preprocessing
  image_resize: [224, 224]
  image_normalize: imagenet

  # Cache configuration (optional, for performance)
  cache_enabled: true
  cache_max_memory_mb: 200
  cache_max_entries: 100
  cache_ttl_seconds: 3600  # 1 hour

inputs:
  input_ids:
    type: int64
    shape: [1, 77]
  attention_mask:
    type: int64
    shape: [1, 77]
  image:
    type: float32
    shape: [1, 3, 224, 224]

outputs:
  text_embeddings:
    type: float32
    shape: [1, 512]
  image_embeddings:
    type: float32
    shape: [1, 512]

Advanced Configuration with Custom Settings

preprocessing:
  # Text settings
  text_tokenizer: openai/clip-vit-base-patch32
  text_max_length: 77
  text_padding: max_length
  text_truncation: true

  # Image settings
  image_resize: [224, 224]
  image_normalize: custom
  image_mean: [0.485, 0.456, 0.406]
  image_std: [0.229, 0.224, 0.225]

  # Performance optimization
  cache_enabled: true
  cache_max_memory_mb: 500  # Larger cache for production
  cache_max_entries: 200
  cache_ttl_seconds: 7200  # 2 hours

💻 Python API Usage

Basic Usage

from gpux import GPUXRuntime
from gpux.config.parser import PreprocessingConfig

# Create preprocessing configuration
config = PreprocessingConfig(
    text_tokenizer="openai/clip-vit-base-patch32",
    text_max_length=77,
    image_resize=[224, 224],
    image_normalize="imagenet",
    cache_enabled=True,
    cache_max_memory_mb=200,
)

# Initialize runtime
runtime = GPUXRuntime(
    model_id="openai/clip-vit-base-patch32",
    preprocessing_config=config
)

# Run inference with multimodal input
result = runtime.infer({
    "text": "a cat sitting on a mat",
    "image": "/path/to/cat.jpg"
})

print(result)

Batch Processing

# Process multiple image-text pairs
batch_inputs = [
    {"text": "a red car", "image": "./car1.jpg"},
    {"text": "a blue bicycle", "image": "./bike1.jpg"},
    {"text": "a green tree", "image": "./tree1.jpg"},
]

results = runtime.batch_infer(batch_inputs)

for i, result in enumerate(results):
    print(f"Result {i}: {result}")

Direct Preprocessor Usage

from gpux.core.preprocessing.multimodal import MultimodalPreprocessor
from gpux.core.preprocessing.registry import PreprocessorRegistry
from gpux.core.preprocessing.text import TextPreprocessor
from gpux.core.preprocessing.image import ImagePreprocessor
from gpux.config.parser import PreprocessingConfig

# Setup preprocessors
registry = PreprocessorRegistry()
registry.register(TextPreprocessor())
registry.register(ImagePreprocessor())

# Create multimodal preprocessor
multimodal = MultimodalPreprocessor(registry=registry)

# Configure preprocessing
config = PreprocessingConfig(
    text_tokenizer="openai/clip-vit-base-patch32",
    image_resize=[224, 224],
    image_normalize="imagenet",
)

# Preprocess multimodal input
result = multimodal.preprocess({
    "text": "a red cat sitting on a mat",
    "image": "./test_image.jpg"
}, config)

print(result)
# Output: {
#   'input_ids': array([[...]]),
#   'attention_mask': array([[...]]),
#   'image': array([[[[...]]]])
# }

🎯 Use Cases

Image-Text Matching

Check if an image matches a text description:

gpux run clip-model \
  --input '{"text": "a dog playing in the park", "image": "./dog.jpg"}'

Zero-Shot Image Classification

Classify images into categories without training:

gpux run clip-model \
  --input '{"text": "a photograph of a cat, a dog, or a bird", "image": "./pet.jpg"}'

Find images that match text queries:

query = "a sunset over mountains"
image_paths = ["./img1.jpg", "./img2.jpg", "./img3.jpg"]

results = []
for img_path in image_paths:
    result = runtime.infer({
        "text": query,
        "image": img_path
    })
    results.append((img_path, result["similarity_score"]))

# Sort by similarity
results.sort(key=lambda x: x[1], reverse=True)
print(f"Best match: {results[0][0]}")

⚡ Performance Optimization

Enable Caching

The preprocessing system includes intelligent caching to improve performance:

preprocessing:
  cache_enabled: true
  cache_max_memory_mb: 200  # Maximum cache size
  cache_max_entries: 100    # Maximum number of cached items
  cache_ttl_seconds: 3600    # Cache expiration time (optional)

Benefits: - Tokenizers are cached (avoid reloading on each request) - Processed images are cached (avoid reprocessing same images) - Significant speedup for repeated inputs

Batch Processing

Process multiple inputs efficiently:

# Process 10 image-text pairs at once
batch = [
    {"text": f"description {i}", "image": f"./image_{i}.jpg"}
    for i in range(10)
]

results = runtime.batch_infer(batch)

🔧 Troubleshooting

Model Not Found

If the model doesn't exist on HuggingFace: - Try alternative CLIP models: laion/CLIP-ViT-B-32-xlaai - Check model name spelling - Verify internet connection

Conversion Issues

If model conversion fails: - Ensure torch is installed - Try with --provider cpu flag - Check model compatibility with ONNX

Preprocessing Errors

If preprocessing fails: - Verify image file exists and is readable - Check image format (JPEG, PNG supported) - Ensure text is a valid string - Check that both text and image keys are present

Cache Issues

If caching causes problems: - Disable cache: cache_enabled: false - Reduce cache size: cache_max_memory_mb: 50 - Clear cache programmatically: preprocessor.clear_cache()


📚 Additional Resources


💡 Key Takeaways

Success

✅ Multimodal preprocessing handles text + image automatically ✅ Supports file paths, URLs, and base64 encoded images ✅ Batch processing for efficient inference ✅ Intelligent caching for performance optimization ✅ Zero configuration needed for most models


Previous: Examples Index | Next: Object Detection →