Documentation Index
Fetch the complete documentation index at: https://docs.exla.ai/llms.txt
Use this file to discover all available pages before exploring further.
Whisper Model
Whisper is OpenAI’s robust speech recognition and transcription model. With InferX, you can run Whisper on any device using the same API - from edge devices to powerful servers.
Features
- Universal Speech Recognition: Transcribe audio in multiple languages
- Real-time Processing: Optimized for live audio streams
- Cross-Platform: Same code works on Jetson, GPU, or CPU
- Multiple Languages: Support for 99+ languages
- Noise Robust: Works well with noisy audio
Installation
Whisper is included with InferX:
pip install git+https://github.com/exla-ai/InferX.git
Basic Usage
from inferx.models.whisper import whisper
# Initialize the model (automatically detects your hardware)
model = whisper()
# Transcribe audio file
result = model.inference(audio_path="path/to/audio.wav")
# Print transcription
print(f"Transcription: {result['text']}")
print(f"Language: {result['language']}")
print(f"Confidence: {result['confidence']}")
Advanced Usage
Real-time Audio Processing
from inferx.models.whisper import whisper
import pyaudio
import numpy as np
model = whisper()
# Setup audio capture
chunk_size = 1024
sample_rate = 16000
audio = pyaudio.PyAudio()
stream = audio.open(
format=pyaudio.paFloat32,
channels=1,
rate=sample_rate,
input=True,
frames_per_buffer=chunk_size
)
print("Listening... Press Ctrl+C to stop")
try:
while True:
# Read audio chunk
audio_chunk = stream.read(chunk_size)
audio_data = np.frombuffer(audio_chunk, dtype=np.float32)
# Transcribe
result = model.inference(audio_data=audio_data)
if result['text'].strip():
print(f"Transcription: {result['text']}")
except KeyboardInterrupt:
print("Stopping...")
finally:
stream.stop_stream()
stream.close()
audio.terminate()
Batch Processing
import os
model = whisper()
# Process multiple audio files
audio_files = [f for f in os.listdir("audio_folder/") if f.endswith(('.wav', '.mp3', '.m4a'))]
for audio_file in audio_files:
result = model.inference(audio_path=f"audio_folder/{audio_file}")
# Save transcription
transcript_file = audio_file.replace('.wav', '.txt').replace('.mp3', '.txt').replace('.m4a', '.txt')
with open(f"transcripts/{transcript_file}", 'w') as f:
f.write(result['text'])
print(f"Processed: {audio_file} -> {transcript_file}")
Language-Specific Transcription
# Force specific language
result = model.inference(
audio_path="spanish_audio.wav",
language="es" # Spanish
)
# Auto-detect language
result = model.inference(
audio_path="unknown_language.wav",
detect_language=True
)
print(f"Detected language: {result['language']}")
InferX optimizes Whisper for your hardware:
| Hardware | Real-time Factor | Memory Usage |
|---|
| Jetson AGX Orin | ~0.3x | ~2GB |
| RTX 4090 | ~0.1x | ~3GB |
| Intel i7 CPU | ~0.8x | ~1GB |
Real-time factor: 0.3x means 1 minute of audio processes in ~18 seconds
{
'text': str, # Transcribed text
'language': str, # Detected/specified language code
'confidence': float, # Overall confidence score (0-1)
'segments': [ # Word-level timestamps
{
'start': float, # Start time in seconds
'end': float, # End time in seconds
'text': str, # Text segment
'confidence': float
}
]
}
- WAV: Uncompressed audio (recommended)
- MP3: MPEG audio
- M4A: AAC audio
- FLAC: Lossless audio
- OGG: Ogg Vorbis
Example Applications
Meeting Transcription
from inferx.models.whisper import whisper
model = whisper()
# Transcribe meeting recording
result = model.inference(
audio_path="meeting_recording.wav",
segment_timestamps=True
)
# Generate meeting notes with timestamps
with open("meeting_notes.txt", "w") as f:
f.write(f"Meeting Transcription\n")
f.write(f"Language: {result['language']}\n\n")
for segment in result['segments']:
timestamp = f"[{segment['start']:.1f}s - {segment['end']:.1f}s]"
f.write(f"{timestamp}: {segment['text']}\n")
Voice Commands
import speech_recognition as sr
model = whisper()
def process_voice_command(audio_data):
result = model.inference(audio_data=audio_data)
command = result['text'].lower().strip()
if 'turn on lights' in command:
return 'lights_on'
elif 'play music' in command:
return 'music_play'
elif 'what time' in command:
return 'time_query'
else:
return 'unknown'
# Use with speech recognition
recognizer = sr.Recognizer()
microphone = sr.Microphone()
with microphone as source:
audio = recognizer.listen(source)
command = process_voice_command(audio.get_raw_data())
print(f"Command: {command}")
Hardware Detection
✨ InferX - Whisper Model ✨
🔍 Device Detected: AGX_ORIN
⠏ [0.8s] Loading Whisper model
✓ [1.0s] Ready for speech recognition
Next Steps