Whisper Model

Whisper is OpenAI’s robust speech recognition and transcription model. With InferX, you can run Whisper on any device using the same API - from edge devices to powerful servers.

Features

  • Universal Speech Recognition: Transcribe audio in multiple languages
  • Real-time Processing: Optimized for live audio streams
  • Cross-Platform: Same code works on Jetson, GPU, or CPU
  • Multiple Languages: Support for 99+ languages
  • Noise Robust: Works well with noisy audio

Installation

Whisper is included with InferX:

pip install git+https://github.com/exla-ai/InferX.git

Basic Usage

from inferx.models.whisper import whisper

# Initialize the model (automatically detects your hardware)
model = whisper()

# Transcribe audio file
result = model.inference(audio_path="path/to/audio.wav")

# Print transcription
print(f"Transcription: {result['text']}")
print(f"Language: {result['language']}")
print(f"Confidence: {result['confidence']}")

Advanced Usage

Real-time Audio Processing

from inferx.models.whisper import whisper
import pyaudio
import numpy as np

model = whisper()

# Setup audio capture
chunk_size = 1024
sample_rate = 16000
audio = pyaudio.PyAudio()

stream = audio.open(
    format=pyaudio.paFloat32,
    channels=1,
    rate=sample_rate,
    input=True,
    frames_per_buffer=chunk_size
)

print("Listening... Press Ctrl+C to stop")

try:
    while True:
        # Read audio chunk
        audio_chunk = stream.read(chunk_size)
        audio_data = np.frombuffer(audio_chunk, dtype=np.float32)
        
        # Transcribe
        result = model.inference(audio_data=audio_data)
        
        if result['text'].strip():
            print(f"Transcription: {result['text']}")
            
except KeyboardInterrupt:
    print("Stopping...")
finally:
    stream.stop_stream()
    stream.close()
    audio.terminate()

Batch Processing

import os

model = whisper()

# Process multiple audio files
audio_files = [f for f in os.listdir("audio_folder/") if f.endswith(('.wav', '.mp3', '.m4a'))]

for audio_file in audio_files:
    result = model.inference(audio_path=f"audio_folder/{audio_file}")
    
    # Save transcription
    transcript_file = audio_file.replace('.wav', '.txt').replace('.mp3', '.txt').replace('.m4a', '.txt')
    with open(f"transcripts/{transcript_file}", 'w') as f:
        f.write(result['text'])
    
    print(f"Processed: {audio_file} -> {transcript_file}")

Language-Specific Transcription

# Force specific language
result = model.inference(
    audio_path="spanish_audio.wav",
    language="es"  # Spanish
)

# Auto-detect language
result = model.inference(
    audio_path="unknown_language.wav",
    detect_language=True
)

print(f"Detected language: {result['language']}")

Performance

InferX optimizes Whisper for your hardware:

HardwareReal-time FactorMemory Usage
Jetson AGX Orin~0.3x~2GB
RTX 4090~0.1x~3GB
Intel i7 CPU~0.8x~1GB

Real-time factor: 0.3x means 1 minute of audio processes in ~18 seconds

Response Format

{
    'text': str,           # Transcribed text
    'language': str,       # Detected/specified language code
    'confidence': float,   # Overall confidence score (0-1)
    'segments': [          # Word-level timestamps
        {
            'start': float,    # Start time in seconds
            'end': float,      # End time in seconds
            'text': str,       # Text segment
            'confidence': float
        }
    ]
}

Supported Audio Formats

  • WAV: Uncompressed audio (recommended)
  • MP3: MPEG audio
  • M4A: AAC audio
  • FLAC: Lossless audio
  • OGG: Ogg Vorbis

Example Applications

Meeting Transcription

from inferx.models.whisper import whisper

model = whisper()

# Transcribe meeting recording
result = model.inference(
    audio_path="meeting_recording.wav",
    segment_timestamps=True
)

# Generate meeting notes with timestamps
with open("meeting_notes.txt", "w") as f:
    f.write(f"Meeting Transcription\n")
    f.write(f"Language: {result['language']}\n\n")
    
    for segment in result['segments']:
        timestamp = f"[{segment['start']:.1f}s - {segment['end']:.1f}s]"
        f.write(f"{timestamp}: {segment['text']}\n")

Voice Commands

import speech_recognition as sr

model = whisper()

def process_voice_command(audio_data):
    result = model.inference(audio_data=audio_data)
    command = result['text'].lower().strip()
    
    if 'turn on lights' in command:
        return 'lights_on'
    elif 'play music' in command:
        return 'music_play'
    elif 'what time' in command:
        return 'time_query'
    else:
        return 'unknown'

# Use with speech recognition
recognizer = sr.Recognizer()
microphone = sr.Microphone()

with microphone as source:
    audio = recognizer.listen(source)
    
command = process_voice_command(audio.get_raw_data())
print(f"Command: {command}")

Hardware Detection

✨ InferX - Whisper Model ✨
🔍 Device Detected: AGX_ORIN
⠏ [0.8s] Loading Whisper model
✓ [1.0s] Ready for speech recognition

Next Steps

Whisper Model

Whisper is OpenAI’s robust speech recognition and transcription model. With InferX, you can run Whisper on any device using the same API - from edge devices to powerful servers.

Features

  • Universal Speech Recognition: Transcribe audio in multiple languages
  • Real-time Processing: Optimized for live audio streams
  • Cross-Platform: Same code works on Jetson, GPU, or CPU
  • Multiple Languages: Support for 99+ languages
  • Noise Robust: Works well with noisy audio

Installation

Whisper is included with InferX:

pip install git+https://github.com/exla-ai/InferX.git

Basic Usage

from inferx.models.whisper import whisper

# Initialize the model (automatically detects your hardware)
model = whisper()

# Transcribe audio file
result = model.inference(audio_path="path/to/audio.wav")

# Print transcription
print(f"Transcription: {result['text']}")
print(f"Language: {result['language']}")
print(f"Confidence: {result['confidence']}")

Advanced Usage

Real-time Audio Processing

from inferx.models.whisper import whisper
import pyaudio
import numpy as np

model = whisper()

# Setup audio capture
chunk_size = 1024
sample_rate = 16000
audio = pyaudio.PyAudio()

stream = audio.open(
    format=pyaudio.paFloat32,
    channels=1,
    rate=sample_rate,
    input=True,
    frames_per_buffer=chunk_size
)

print("Listening... Press Ctrl+C to stop")

try:
    while True:
        # Read audio chunk
        audio_chunk = stream.read(chunk_size)
        audio_data = np.frombuffer(audio_chunk, dtype=np.float32)
        
        # Transcribe
        result = model.inference(audio_data=audio_data)
        
        if result['text'].strip():
            print(f"Transcription: {result['text']}")
            
except KeyboardInterrupt:
    print("Stopping...")
finally:
    stream.stop_stream()
    stream.close()
    audio.terminate()

Batch Processing

import os

model = whisper()

# Process multiple audio files
audio_files = [f for f in os.listdir("audio_folder/") if f.endswith(('.wav', '.mp3', '.m4a'))]

for audio_file in audio_files:
    result = model.inference(audio_path=f"audio_folder/{audio_file}")
    
    # Save transcription
    transcript_file = audio_file.replace('.wav', '.txt').replace('.mp3', '.txt').replace('.m4a', '.txt')
    with open(f"transcripts/{transcript_file}", 'w') as f:
        f.write(result['text'])
    
    print(f"Processed: {audio_file} -> {transcript_file}")

Language-Specific Transcription

# Force specific language
result = model.inference(
    audio_path="spanish_audio.wav",
    language="es"  # Spanish
)

# Auto-detect language
result = model.inference(
    audio_path="unknown_language.wav",
    detect_language=True
)

print(f"Detected language: {result['language']}")

Performance

InferX optimizes Whisper for your hardware:

HardwareReal-time FactorMemory Usage
Jetson AGX Orin~0.3x~2GB
RTX 4090~0.1x~3GB
Intel i7 CPU~0.8x~1GB

Real-time factor: 0.3x means 1 minute of audio processes in ~18 seconds

Response Format

{
    'text': str,           # Transcribed text
    'language': str,       # Detected/specified language code
    'confidence': float,   # Overall confidence score (0-1)
    'segments': [          # Word-level timestamps
        {
            'start': float,    # Start time in seconds
            'end': float,      # End time in seconds
            'text': str,       # Text segment
            'confidence': float
        }
    ]
}

Supported Audio Formats

  • WAV: Uncompressed audio (recommended)
  • MP3: MPEG audio
  • M4A: AAC audio
  • FLAC: Lossless audio
  • OGG: Ogg Vorbis

Example Applications

Meeting Transcription

from inferx.models.whisper import whisper

model = whisper()

# Transcribe meeting recording
result = model.inference(
    audio_path="meeting_recording.wav",
    segment_timestamps=True
)

# Generate meeting notes with timestamps
with open("meeting_notes.txt", "w") as f:
    f.write(f"Meeting Transcription\n")
    f.write(f"Language: {result['language']}\n\n")
    
    for segment in result['segments']:
        timestamp = f"[{segment['start']:.1f}s - {segment['end']:.1f}s]"
        f.write(f"{timestamp}: {segment['text']}\n")

Voice Commands

import speech_recognition as sr

model = whisper()

def process_voice_command(audio_data):
    result = model.inference(audio_data=audio_data)
    command = result['text'].lower().strip()
    
    if 'turn on lights' in command:
        return 'lights_on'
    elif 'play music' in command:
        return 'music_play'
    elif 'what time' in command:
        return 'time_query'
    else:
        return 'unknown'

# Use with speech recognition
recognizer = sr.Recognizer()
microphone = sr.Microphone()

with microphone as source:
    audio = recognizer.listen(source)
    
command = process_voice_command(audio.get_raw_data())
print(f"Command: {command}")

Hardware Detection

✨ InferX - Whisper Model ✨
🔍 Device Detected: AGX_ORIN
⠏ [0.8s] Loading Whisper model
✓ [1.0s] Ready for speech recognition

Next Steps