InternVL2.5
InternVL2.5 multimodal vision-language model optimized for Jetson and other devices
InternVL2.5 Model
InternVL2.5 is a powerful multimodal vision-language model that connects visual and textual understanding. It can process both images and text, enabling capabilities like visual question answering, image captioning, and multimodal reasoning.
Features
-
Hardware-Optimized: Automatically detects your hardware and uses the appropriate implementation:
- Jetson Devices (Orin Nano, AGX Orin): GPU-accelerated implementation
- Standard GPUs: GPU-accelerated implementation with CUDA
- Apple Silicon: MPS (Metal Performance Shaders) acceleration for macOS devices
- CPU-only Systems: Optimized CPU implementation
-
Real-time Progress Indicators: Visual feedback with rotating spinners and timing information
-
Resource Monitoring: Built-in monitoring of system resources (memory usage, GPU utilization)
-
Multiple Tasks: Support for visual question answering, image captioning, and multimodal reasoning
Installation
The InternVL2.5 model is included in the Exla SDK. No separate installation is required.
Basic Usage
Advanced Usage
Image Captioning
Multimodal Reasoning
Processing Multiple Images
API Reference
internvl2_5()
Factory function that returns the appropriate InternVL2.5 model based on the detected hardware.
Returns:
- An InternVL2.5 model instance optimized for the detected hardware
model.vqa(image_path=None, image_paths=None, question=None, timeout=300, debug=False)
Runs visual question answering on the provided image(s) with the given question.
Parameters:
image_path
(str): Path to a single imageimage_paths
(list): List of image paths for batch processingquestion
(str): Question to ask about the image(s)timeout
(int): Maximum time in seconds to wait for inference (default: 300)debug
(bool): Whether to print detailed debug information (default: False)
Returns:
- Dictionary mapping image paths to answers
model.caption(image_path=None, image_paths=None, timeout=300, debug=False)
Generates captions for the provided image(s).
Parameters:
image_path
(str): Path to a single imageimage_paths
(list): List of image paths for batch processingtimeout
(int): Maximum time in seconds to wait for inference (default: 300)debug
(bool): Whether to print detailed debug information (default: False)
Returns:
- Dictionary mapping image paths to captions
model.reason(image_path, prompt, timeout=300, debug=False)
Performs multimodal reasoning on the provided image with the given prompt.
Parameters:
image_path
(str): Path to an imageprompt
(str): Reasoning prompt or instructiontimeout
(int): Maximum time in seconds to wait for inference (default: 300)debug
(bool): Whether to print detailed debug information (default: False)
Returns:
- String containing the reasoning result
model.install_nvidia_pytorch()
Explicitly installs NVIDIA’s PyTorch wheel for optimal performance on Jetson devices.
Returns:
bool
: True if installation was successful, False otherwise
Example Output
Visual Question Answering
Image Captioning
Multimodal Reasoning
Visual Feedback
The InternVL2.5 model provides rich visual feedback during execution: