InternVL2.5 Model

InternVL2.5 is a powerful multimodal vision-language model that connects visual and textual understanding. It can process both images and text, enabling capabilities like visual question answering, image captioning, and multimodal reasoning.

Features

Hardware-Optimized: Automatically detects your hardware and uses the appropriate implementation:
- Jetson Devices (Orin Nano, AGX Orin): GPU-accelerated implementation
- Standard GPUs: GPU-accelerated implementation with CUDA
- Apple Silicon: MPS (Metal Performance Shaders) acceleration for macOS devices
- CPU-only Systems: Optimized CPU implementation
Real-time Progress Indicators: Visual feedback with rotating spinners and timing information
Resource Monitoring: Built-in monitoring of system resources (memory usage, GPU utilization)
Multiple Tasks: Support for visual question answering, image captioning, and multimodal reasoning

Installation

The InternVL2.5 model is included with InferX:

pip install git+https://github.com/exla-ai/InferX.git

Basic Usage

from inferx.models.internvl2_5 import internvl2_5
import json

# Initialize the model (automatically detects your hardware)
model = internvl2_5()

# Run visual question answering
results = model.vqa(
    image_path="path/to/image.jpg",
    question="What is shown in this image?"
)

# Print results
print(json.dumps(results, indent=2))

Advanced Usage

Image Captioning

# Generate captions for images
captions = model.caption(
    image_paths=["path/to/image1.jpg", "path/to/image2.jpg"]
)

# Print captions
for image_path, caption in captions.items():
    print(f"{image_path}: {caption}")

Multimodal Reasoning

# Perform multimodal reasoning
reasoning_result = model.reason(
    image_path="path/to/image.jpg",
    prompt="Describe the scene in detail and explain what might happen next."
)

print(reasoning_result)

Processing Multiple Images

# Process a list of images with the same question
images = [
    "path/to/image1.jpg",
    "path/to/image2.jpg",
    "path/to/image3.jpg"
]

results = model.vqa(
    image_paths=images,
    question="What objects can you see?"
)

# Print results for each image
for image_path, answer in results.items():
    print(f"{image_path}: {answer}")

API Reference

`internvl2_5()`

Factory function that returns the appropriate InternVL2.5 model based on the detected hardware. Returns:

An InternVL2.5 model instance optimized for the detected hardware

`model.vqa(image_path=None, image_paths=None, question=None, timeout=300, debug=False)`

Runs visual question answering on the provided image(s) with the given question. Parameters:

image_path (str): Path to a single image
image_paths (list): List of image paths for batch processing
question (str): Question to ask about the image(s)
timeout (int): Maximum time in seconds to wait for inference (default: 300)
debug (bool): Whether to print detailed debug information (default: False)

Returns:

Dictionary mapping image paths to answers

`model.caption(image_path=None, image_paths=None, timeout=300, debug=False)`

Generates captions for the provided image(s). Parameters:

image_path (str): Path to a single image
image_paths (list): List of image paths for batch processing
timeout (int): Maximum time in seconds to wait for inference (default: 300)
debug (bool): Whether to print detailed debug information (default: False)

Returns:

Dictionary mapping image paths to captions

`model.reason(image_path, prompt, timeout=300, debug=False)`

Performs multimodal reasoning on the provided image with the given prompt. Parameters:

image_path (str): Path to an image
prompt (str): Reasoning prompt or instruction
timeout (int): Maximum time in seconds to wait for inference (default: 300)
debug (bool): Whether to print detailed debug information (default: False)

Returns:

String containing the reasoning result

`model.install_nvidia_pytorch()`

Explicitly installs NVIDIA’s PyTorch wheel for optimal performance on Jetson devices. Returns:

bool: True if installation was successful, False otherwise

Example Output

Visual Question Answering

{
  "path/to/image.jpg": "A red car parked next to a blue building."
}

Image Captioning

{
  "path/to/image1.jpg": "A group of people sitting around a table having dinner.",
  "path/to/image2.jpg": "A mountain landscape with snow-capped peaks and a clear blue sky."
}

Multimodal Reasoning

The image shows a bustling city street during rush hour. There are several cars and buses on the road, with pedestrians crossing at a crosswalk. The buildings on either side appear to be commercial, with shops and restaurants on the ground floor. The weather seems clear with good visibility.

Given the context of a busy rush hour, it's likely that traffic will continue to be heavy for some time. The pedestrians crossing the street will reach the other side and continue to their destinations. Some people might enter the shops or restaurants visible in the image. The public transportation vehicles will stop to pick up and drop off passengers. As time progresses, if this is an evening rush hour, the street lights might turn on as it gets darker.

Visual Feedback

The InternVL2.5 model provides rich visual feedback during execution:

✨ InferX - InternVL2.5 Model ✨
🔍 Device Detected: AGX_ORIN

📊 Initial System Resources:
📊 Resource Monitor - NVIDIA Jetson AGX Orin

⠏ Initializing InferX Optimized InternVL2.5 model for AGX_ORIN [GPU Mode]
✓ Initializing InferX Optimized InternVL2.5 model for AGX_ORIN [GPU Mode]

🚀 Running Visual Question Answering
✓ Processed image
⠋ Loading InternVL2.5 model
Using GPU: Orin
✓ Model ready on CUDA
⠋ Running VQA inference
✓ Inference completed successfully
✓ Processing results

✨ InternVL2.5 Inference Summary:
   • Model: internvl/internvl2-5b
   • Device: CUDA
   • Images processed: 1

Next Steps

Try CLIP model for simpler image-text matching
Explore practical examples
Learn about custom model optimization

Getting Started

Multimodal Models

Large Language Models

Computer Vision Models

Audio Models

Custom Models

Mobile SDK

InternVL2.5

InternVL2.5 Model

Features

Installation

Basic Usage

Advanced Usage

Image Captioning

Multimodal Reasoning

Processing Multiple Images

API Reference

`internvl2_5()`

`model.vqa(image_path=None, image_paths=None, question=None, timeout=300, debug=False)`

`model.caption(image_path=None, image_paths=None, timeout=300, debug=False)`

`model.reason(image_path, prompt, timeout=300, debug=False)`

`model.install_nvidia_pytorch()`

Example Output

Visual Question Answering

Image Captioning

Multimodal Reasoning

Visual Feedback

Next Steps

Getting Started

Multimodal Models

Large Language Models

Computer Vision Models

Audio Models

Custom Models

Mobile SDK

​InternVL2.5 Model

​Features

​Installation

​Basic Usage

​Advanced Usage

​Image Captioning

​Multimodal Reasoning

​Processing Multiple Images

​API Reference

​internvl2_5()

​model.vqa(image_path=None, image_paths=None, question=None, timeout=300, debug=False)

​model.caption(image_path=None, image_paths=None, timeout=300, debug=False)

​model.reason(image_path, prompt, timeout=300, debug=False)

​model.install_nvidia_pytorch()

​Example Output

​Visual Question Answering

​Image Captioning

​Multimodal Reasoning

​Visual Feedback

​Next Steps

InternVL2.5 Model

Features

Installation

Basic Usage

Advanced Usage

Image Captioning

Multimodal Reasoning

Processing Multiple Images

API Reference

`internvl2_5()`

`model.vqa(image_path=None, image_paths=None, question=None, timeout=300, debug=False)`

`model.caption(image_path=None, image_paths=None, timeout=300, debug=False)`

`model.reason(image_path, prompt, timeout=300, debug=False)`

`model.install_nvidia_pytorch()`

Example Output

Visual Question Answering

Image Captioning

Multimodal Reasoning

Visual Feedback

Next Steps