CLIP
Contrastive Language-Image Pretraining model optimized for Jetson and other devices
CLIP Model
The CLIP (Contrastive Language-Image Pretraining) model is a powerful multimodal model that connects text and images. It can understand both visual and textual content, allowing you to find the best matching images for a given text description or vice versa.
Features
-
Hardware-Optimized: Automatically detects your hardware and uses the appropriate implementation:
- Jetson Devices (Orin Nano, AGX Orin): GPU-accelerated implementation with NVIDIA PyTorch wheel
- Standard GPUs: GPU-accelerated implementation
- CPU-only Systems: Optimized CPU implementation
-
Real-time Progress Indicators: Visual feedback with rotating spinners and timing information
-
Resource Monitoring: Built-in monitoring of system resources (memory usage, GPU utilization)
-
Automatic Dependency Management: Dependencies are installed only when needed
Installation
The CLIP model is included in the Exla SDK. No separate installation is required.
Basic Usage
Advanced Usage
Processing Multiple Images
Jetson-Specific Optimizations
For Jetson devices, you can explicitly install the NVIDIA PyTorch wheel for optimal performance:
Performance Considerations
Hardware Comparison
Hardware | Model Loading | Inference (2 images) | Total Time |
---|---|---|---|
Jetson AGX Orin (GPU) | ~4-6s | ~0.8s | ~5-7s |
Standard GPU | ~3-5s | ~0.5s | ~4-6s |
CPU | ~8-10s | ~2-4s | ~10-15s |
Python Version
For Jetson devices, Python 3.10 is strongly recommended as the NVIDIA PyTorch wheel is specifically built for this version. Using other Python versions will result in CPU-only inference, which is significantly slower.
API Reference
clip()
Factory function that returns the appropriate CLIP model based on the detected hardware.
Returns:
- A CLIP model instance optimized for the detected hardware
model.inference(image_paths, text_queries=[], timeout=300, debug=False)
Runs CLIP inference on the provided images and text queries.
Parameters:
image_paths
(str or list): Path to a single image, list of image paths, or path to a text file containing image pathstext_queries
(list): List of text queries to compare against the imagestimeout
(int): Maximum time in seconds to wait for inference (default: 300)debug
(bool): Whether to print detailed debug information (default: False)
Returns:
- List of dictionaries containing predictions for each text query
model.install_nvidia_pytorch()
Explicitly installs NVIDIA’s PyTorch wheel for optimal performance on Jetson devices.
Returns:
bool
: True if installation was successful, False otherwise
Example Output
Visual Feedback
The CLIP model provides rich visual feedback during execution: