SAM 2
Segment Anything 2 model optimized for Jetson and other devices
Segment Anything 2 (SAM2)
The Segment Anything 2 (SAM2) model is a foundation model for promptable visual segmentation in images and videos. This implementation provides an optimized version of SAM2 for NVIDIA Jetson devices.
Features
- TensorRT Optimization: Accelerated inference using NVIDIA TensorRT
- Multiple Input Types: Support for image files, video files, and numpy arrays
- Custom Prompts: Specify points or boxes to guide segmentation
- Camera Stream Support: Process live video from cameras or RTSP streams
Installation
The SAM2 model is included in the EXLA SDK. To use it, you need to download the model weights:
Basic Usage
Input Types
SAM2 supports multiple input types:
File Paths
Numpy Arrays
You can directly pass numpy arrays to the model:
Prompts
You can guide the segmentation by providing custom prompts:
Point Prompts
Box Prompts
Processing Results
The model returns a dictionary with the following keys:
status
: Success or error statusprocessing_time
: Time taken for inference (in seconds)masks
: List of segmentation masks (when using numpy arrays as input)
When using numpy arrays as input, you can process the masks directly:
Example: Visualizing Masks
Here’s how to create a visualization by overlaying a mask on the original image:
Complete Example
The EXLA SDK includes a complete example script examples/sam2/example_sam2.py
that demonstrates:
- Using point prompts with an image file
- Using box prompts with a video file
- Using numpy arrays as input with point prompts
Running the Example
Example Output
The example script generates the following outputs:
- For image processing: Segmentation masks and overlays in
data/output_truck/
- For video processing: Segmented video and frames in
data/output_f1/
- For numpy array processing:
- Original masks:
data/output_numpy/mask_*_original.png
- Resized masks:
data/output_numpy/mask_*_resized.png
- Overlay images:
data/output_numpy/overlay_*.png
- Original masks:
Camera Stream Processing
Process live video from a camera or RTSP stream:
Advanced Usage
For advanced usage, you can get direct access to the underlying model:
Performance Considerations
- The model returns masks at a fixed size (256x256), which need to be resized to match the original image dimensions
- For best performance, use appropriate image sizes (large images may require more memory and processing time)
- When processing videos, consider the frame rate and resolution to balance quality and performance