Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.exla.ai/llms.txt

Use this file to discover all available pages before exploring further.

Whisper Model

Whisper is OpenAI’s robust speech recognition and transcription model. With InferX, you can run Whisper on any device using the same API - from edge devices to powerful servers.

Features

  • Universal Speech Recognition: Transcribe audio in multiple languages
  • Real-time Processing: Optimized for live audio streams
  • Cross-Platform: Same code works on Jetson, GPU, or CPU
  • Multiple Languages: Support for 99+ languages
  • Noise Robust: Works well with noisy audio

Installation

Whisper is included with InferX:
pip install git+https://github.com/exla-ai/InferX.git

Basic Usage

from inferx.models.whisper import whisper

# Initialize the model (automatically detects your hardware)
model = whisper()

# Transcribe audio file
result = model.inference(audio_path="path/to/audio.wav")

# Print transcription
print(f"Transcription: {result['text']}")
print(f"Language: {result['language']}")
print(f"Confidence: {result['confidence']}")

Advanced Usage

Real-time Audio Processing

from inferx.models.whisper import whisper
import pyaudio
import numpy as np

model = whisper()

# Setup audio capture
chunk_size = 1024
sample_rate = 16000
audio = pyaudio.PyAudio()

stream = audio.open(
    format=pyaudio.paFloat32,
    channels=1,
    rate=sample_rate,
    input=True,
    frames_per_buffer=chunk_size
)

print("Listening... Press Ctrl+C to stop")

try:
    while True:
        # Read audio chunk
        audio_chunk = stream.read(chunk_size)
        audio_data = np.frombuffer(audio_chunk, dtype=np.float32)
        
        # Transcribe
        result = model.inference(audio_data=audio_data)
        
        if result['text'].strip():
            print(f"Transcription: {result['text']}")
            
except KeyboardInterrupt:
    print("Stopping...")
finally:
    stream.stop_stream()
    stream.close()
    audio.terminate()

Batch Processing

import os

model = whisper()

# Process multiple audio files
audio_files = [f for f in os.listdir("audio_folder/") if f.endswith(('.wav', '.mp3', '.m4a'))]

for audio_file in audio_files:
    result = model.inference(audio_path=f"audio_folder/{audio_file}")
    
    # Save transcription
    transcript_file = audio_file.replace('.wav', '.txt').replace('.mp3', '.txt').replace('.m4a', '.txt')
    with open(f"transcripts/{transcript_file}", 'w') as f:
        f.write(result['text'])
    
    print(f"Processed: {audio_file} -> {transcript_file}")

Language-Specific Transcription

# Force specific language
result = model.inference(
    audio_path="spanish_audio.wav",
    language="es"  # Spanish
)

# Auto-detect language
result = model.inference(
    audio_path="unknown_language.wav",
    detect_language=True
)

print(f"Detected language: {result['language']}")

Performance

InferX optimizes Whisper for your hardware:
HardwareReal-time FactorMemory Usage
Jetson AGX Orin~0.3x~2GB
RTX 4090~0.1x~3GB
Intel i7 CPU~0.8x~1GB
Real-time factor: 0.3x means 1 minute of audio processes in ~18 seconds

Response Format

{
    'text': str,           # Transcribed text
    'language': str,       # Detected/specified language code
    'confidence': float,   # Overall confidence score (0-1)
    'segments': [          # Word-level timestamps
        {
            'start': float,    # Start time in seconds
            'end': float,      # End time in seconds
            'text': str,       # Text segment
            'confidence': float
        }
    ]
}

Supported Audio Formats

  • WAV: Uncompressed audio (recommended)
  • MP3: MPEG audio
  • M4A: AAC audio
  • FLAC: Lossless audio
  • OGG: Ogg Vorbis

Example Applications

Meeting Transcription

from inferx.models.whisper import whisper

model = whisper()

# Transcribe meeting recording
result = model.inference(
    audio_path="meeting_recording.wav",
    segment_timestamps=True
)

# Generate meeting notes with timestamps
with open("meeting_notes.txt", "w") as f:
    f.write(f"Meeting Transcription\n")
    f.write(f"Language: {result['language']}\n\n")
    
    for segment in result['segments']:
        timestamp = f"[{segment['start']:.1f}s - {segment['end']:.1f}s]"
        f.write(f"{timestamp}: {segment['text']}\n")

Voice Commands

import speech_recognition as sr

model = whisper()

def process_voice_command(audio_data):
    result = model.inference(audio_data=audio_data)
    command = result['text'].lower().strip()
    
    if 'turn on lights' in command:
        return 'lights_on'
    elif 'play music' in command:
        return 'music_play'
    elif 'what time' in command:
        return 'time_query'
    else:
        return 'unknown'

# Use with speech recognition
recognizer = sr.Recognizer()
microphone = sr.Microphone()

with microphone as source:
    audio = recognizer.listen(source)
    
command = process_voice_command(audio.get_raw_data())
print(f"Command: {command}")

Hardware Detection

✨ InferX - Whisper Model ✨
🔍 Device Detected: AGX_ORIN
⠏ [0.8s] Loading Whisper model
✓ [1.0s] Ready for speech recognition

Next Steps