InternVL2.5 Model
InternVL2.5 is a powerful multimodal vision-language model that connects visual and textual understanding. It can process both images and text, enabling capabilities like visual question answering, image captioning, and multimodal reasoning.Features
-
Hardware-Optimized: Automatically detects your hardware and uses the appropriate implementation:
- Jetson Devices (Orin Nano, AGX Orin): GPU-accelerated implementation
- Standard GPUs: GPU-accelerated implementation with CUDA
- Apple Silicon: MPS (Metal Performance Shaders) acceleration for macOS devices
- CPU-only Systems: Optimized CPU implementation
- Real-time Progress Indicators: Visual feedback with rotating spinners and timing information
- Resource Monitoring: Built-in monitoring of system resources (memory usage, GPU utilization)
- Multiple Tasks: Support for visual question answering, image captioning, and multimodal reasoning
Installation
The InternVL2.5 model is included with InferX:Basic Usage
Advanced Usage
Image Captioning
Multimodal Reasoning
Processing Multiple Images
API Reference
internvl2_5()
Factory function that returns the appropriate InternVL2.5 model based on the detected hardware.
Returns:
- An InternVL2.5 model instance optimized for the detected hardware
model.vqa(image_path=None, image_paths=None, question=None, timeout=300, debug=False)
Runs visual question answering on the provided image(s) with the given question.
Parameters:
image_path
(str): Path to a single imageimage_paths
(list): List of image paths for batch processingquestion
(str): Question to ask about the image(s)timeout
(int): Maximum time in seconds to wait for inference (default: 300)debug
(bool): Whether to print detailed debug information (default: False)
- Dictionary mapping image paths to answers
model.caption(image_path=None, image_paths=None, timeout=300, debug=False)
Generates captions for the provided image(s).
Parameters:
image_path
(str): Path to a single imageimage_paths
(list): List of image paths for batch processingtimeout
(int): Maximum time in seconds to wait for inference (default: 300)debug
(bool): Whether to print detailed debug information (default: False)
- Dictionary mapping image paths to captions
model.reason(image_path, prompt, timeout=300, debug=False)
Performs multimodal reasoning on the provided image with the given prompt.
Parameters:
image_path
(str): Path to an imageprompt
(str): Reasoning prompt or instructiontimeout
(int): Maximum time in seconds to wait for inference (default: 300)debug
(bool): Whether to print detailed debug information (default: False)
- String containing the reasoning result
model.install_nvidia_pytorch()
Explicitly installs NVIDIA’s PyTorch wheel for optimal performance on Jetson devices.
Returns:
bool
: True if installation was successful, False otherwise
Example Output
Visual Question Answering
Image Captioning
Multimodal Reasoning
Visual Feedback
The InternVL2.5 model provides rich visual feedback during execution:Next Steps
- Try CLIP model for simpler image-text matching
- Explore practical examples
- Learn about custom model optimization