CLIP Model

The CLIP (Contrastive Language-Image Pretraining) model is a powerful multimodal model that connects text and images. With InferX, you can run CLIP on any device using the same API - whether it’s a Jetson, GPU server, or CPU-only system.

Features

Universal API: Same code works on Jetson, GPU, or CPU
Hardware-Optimized: Automatically detects your hardware and uses the appropriate implementation
Real-time Processing: Optimized for fast inference across all platforms
Zero Configuration: No setup required - just import and run

Installation

CLIP is included with InferX. No separate installation required.

pip install git+https://github.com/exla-ai/InferX.git

Basic Usage

from inferx.models.clip import clip
import json

# Initialize the model (automatically detects your hardware)
model = clip()

# Run inference
results = model.inference(
    image_paths=["path/to/image1.jpg", "path/to/image2.jpg"],
    text_queries=["a photo of a dog", "a photo of a cat", "a photo of a bird"]
)

# Print results
print(json.dumps(results, indent=2))

Advanced Usage

Processing Multiple Images

from inferx.models.clip import clip

# Process a list of images
images = [
    "path/to/image1.jpg",
    "path/to/image2.jpg",
    "path/to/image3.jpg"
]

# Or load images from a text file (one path per line)
images = "path/to/image_list.txt"

model = clip()
results = model.inference(
    image_paths=images,
    text_queries=["query1", "query2", "query3"]
)

Batch Processing

from inferx.models.clip import clip
import os

# Initialize model
model = clip()

# Process directory of images
image_directory = "path/to/images/"
image_paths = [
    os.path.join(image_directory, f) 
    for f in os.listdir(image_directory) 
    if f.endswith(('.jpg', '.png', '.jpeg'))
]

text_queries = [
    "a photo of a dog",
    "a photo of a cat", 
    "a landscape photo",
    "a person walking"
]

results = model.inference(
    image_paths=image_paths,
    text_queries=text_queries
)

Performance

InferX automatically optimizes CLIP for your hardware:

Hardware	Typical Inference Time	Memory Usage
Jetson AGX Orin	~50ms	~2GB
RTX 4090	~20ms	~3GB
Intel i7 CPU	~200ms	~1GB

Response Format

[
  {
    "a photo of a dog": [
      {
        "image_path": "data/dog.png",
        "score": "23.1011"
      },
      {
        "image_path": "data/cat.png",
        "score": "17.1396"
      }
    ]
  }
]

Hardware Detection

InferX automatically detects and optimizes for your hardware:

✨ InferX - CLIP Model ✨
🔍 Device Detected: AGX_ORIN
⠏ [0.5s] Initializing InferX Optimized CLIP model for AGX_ORIN [GPU Mode]
✓ [0.6s] Ready for inference

Error Handling

from inferx.models.clip import clip

try:
    model = clip()
    results = model.inference(
        image_paths=["nonexistent.jpg"],
        text_queries=["test query"]
    )
except FileNotFoundError:
    print("Image file not found")
except Exception as e:
    print(f"Error: {e}")

Next Steps

Explore other InferX models
Check out practical examples
Learn about custom model optimization

Getting Started

Multimodal Models

Large Language Models

Computer Vision Models

Audio Models

Custom Models

Mobile SDK

CLIP

CLIP Model

Features

Installation

Basic Usage

Advanced Usage

Processing Multiple Images

Batch Processing

Performance

Response Format

Hardware Detection

Error Handling

Next Steps

Getting Started

Multimodal Models

Large Language Models

Computer Vision Models

Audio Models

Custom Models

Mobile SDK

​CLIP Model

​Features

​Installation

​Basic Usage

​Advanced Usage

​Processing Multiple Images

​Batch Processing

​Performance

​Response Format

​Hardware Detection

​Error Handling

​Next Steps

CLIP Model

Features

Installation

Basic Usage

Advanced Usage

Processing Multiple Images

Batch Processing

Performance

Response Format

Hardware Detection

Error Handling

Next Steps