CLIP Model

The CLIP (Contrastive Language-Image Pretraining) model is a powerful multimodal model that connects text and images. It can understand both visual and textual content, allowing you to find the best matching images for a given text description or vice versa.

Features

  • Hardware-Optimized: Automatically detects your hardware and uses the appropriate implementation:

    • Jetson Devices (Orin Nano, AGX Orin): GPU-accelerated implementation with NVIDIA PyTorch wheel
    • Standard GPUs: GPU-accelerated implementation
    • CPU-only Systems: Optimized CPU implementation
  • Real-time Progress Indicators: Visual feedback with rotating spinners and timing information

  • Resource Monitoring: Built-in monitoring of system resources (memory usage, GPU utilization)

  • Automatic Dependency Management: Dependencies are installed only when needed

Installation

The CLIP model is included in the Exla SDK. No separate installation is required.

pip install exla-sdk

Basic Usage

from exla.models.clip import clip
import json

# Initialize the model (automatically detects your hardware)
model = clip()

# Run inference
results = model.inference(
    image_paths=["path/to/image1.jpg", "path/to/image2.jpg"],
    text_queries=["a photo of a dog", "a photo of a cat", "a photo of a bird"]
)

# Print results
print(json.dumps(results, indent=2))

Advanced Usage

Processing Multiple Images

# Process a list of images
images = [
    "path/to/image1.jpg",
    "path/to/image2.jpg",
    "path/to/image3.jpg"
]

# Or load images from a text file (one path per line)
images = "path/to/image_list.txt"

results = model.inference(
    image_paths=images,
    text_queries=["query1", "query2", "query3"]
)

Jetson-Specific Optimizations

For Jetson devices, you can explicitly install the NVIDIA PyTorch wheel for optimal performance:

from exla.models.clip import clip

model = clip()

# Install NVIDIA PyTorch wheel (only needed once)
model.install_nvidia_pytorch()

# Run inference with GPU acceleration
results = model.inference(...)

Performance Considerations

Hardware Comparison

HardwareModel LoadingInference (2 images)Total Time
Jetson AGX Orin (GPU)~4-6s~0.8s~5-7s
Standard GPU~3-5s~0.5s~4-6s
CPU~8-10s~2-4s~10-15s

Python Version

For Jetson devices, Python 3.10 is strongly recommended as the NVIDIA PyTorch wheel is specifically built for this version. Using other Python versions will result in CPU-only inference, which is significantly slower.

API Reference

clip()

Factory function that returns the appropriate CLIP model based on the detected hardware.

Returns:

  • A CLIP model instance optimized for the detected hardware

model.inference(image_paths, text_queries=[], timeout=300, debug=False)

Runs CLIP inference on the provided images and text queries.

Parameters:

  • image_paths (str or list): Path to a single image, list of image paths, or path to a text file containing image paths
  • text_queries (list): List of text queries to compare against the images
  • timeout (int): Maximum time in seconds to wait for inference (default: 300)
  • debug (bool): Whether to print detailed debug information (default: False)

Returns:

  • List of dictionaries containing predictions for each text query

model.install_nvidia_pytorch()

Explicitly installs NVIDIA’s PyTorch wheel for optimal performance on Jetson devices.

Returns:

  • bool: True if installation was successful, False otherwise

Example Output

[
  {
    "a photo of a dog": [
      {
        "image_path": "data/dog.png",
        "score": "23.1011"
      },
      {
        "image_path": "data/cat.png",
        "score": "17.1396"
      }
    ]
  },
  {
    "a photo of a cat": [
      {
        "image_path": "data/cat.png",
        "score": "25.3045"
      },
      {
        "image_path": "data/dog.png",
        "score": "18.7532"
      }
    ]
  }
]

Visual Feedback

The CLIP model provides rich visual feedback during execution:

✨ EXLA SDK - CLIP Model ✨
🔍 Device Detected: AGX_ORIN

📊 Initial System Resources:
📊 Resource Monitor - NVIDIA Jetson AGX Orin
💻 System Memory: 4.2GB / 15.6GB (27%)

⠏ [0.5s] Initializing Exla Optimized CLIP model for AGX_ORIN [GPU Mode]
✓ [0.6s] Initializing Exla Optimized CLIP model for AGX_ORIN [GPU Mode]

🚀 Running CLIP inference on your images
✓ [0.2s] Processed 2 images
⠋ [1.2s] Loading CLIP model
Using GPU: Orin
✓ [4.1s] Model ready on CUDA
⠋ [0.0s] Running CLIP inference
✓ [0.8s] Inference completed successfully
✓ [0.1s] Processing results

✨ CLIP Inference Summary:
   • Model: openai/clip-vit-large-patch14-336
   • Device: CUDA
   • Images processed: 2
   • Text queries: 3
   • Total time: 5.17s