Whisper Model
Whisper is OpenAI’s robust speech recognition and transcription model. With InferX, you can run Whisper on any device using the same API - from edge devices to powerful servers.
Features
- Universal Speech Recognition: Transcribe audio in multiple languages
- Real-time Processing: Optimized for live audio streams
- Cross-Platform: Same code works on Jetson, GPU, or CPU
- Multiple Languages: Support for 99+ languages
- Noise Robust: Works well with noisy audio
Installation
Whisper is included with InferX:
pip install git+https://github.com/exla-ai/InferX.git
Basic Usage
from inferx.models.whisper import whisper
# Initialize the model (automatically detects your hardware)
model = whisper()
# Transcribe audio file
result = model.inference(audio_path="path/to/audio.wav")
# Print transcription
print(f"Transcription: {result['text']}")
print(f"Language: {result['language']}")
print(f"Confidence: {result['confidence']}")
Advanced Usage
Real-time Audio Processing
from inferx.models.whisper import whisper
import pyaudio
import numpy as np
model = whisper()
# Setup audio capture
chunk_size = 1024
sample_rate = 16000
audio = pyaudio.PyAudio()
stream = audio.open(
format=pyaudio.paFloat32,
channels=1,
rate=sample_rate,
input=True,
frames_per_buffer=chunk_size
)
print("Listening... Press Ctrl+C to stop")
try:
while True:
# Read audio chunk
audio_chunk = stream.read(chunk_size)
audio_data = np.frombuffer(audio_chunk, dtype=np.float32)
# Transcribe
result = model.inference(audio_data=audio_data)
if result['text'].strip():
print(f"Transcription: {result['text']}")
except KeyboardInterrupt:
print("Stopping...")
finally:
stream.stop_stream()
stream.close()
audio.terminate()
Batch Processing
import os
model = whisper()
# Process multiple audio files
audio_files = [f for f in os.listdir("audio_folder/") if f.endswith(('.wav', '.mp3', '.m4a'))]
for audio_file in audio_files:
result = model.inference(audio_path=f"audio_folder/{audio_file}")
# Save transcription
transcript_file = audio_file.replace('.wav', '.txt').replace('.mp3', '.txt').replace('.m4a', '.txt')
with open(f"transcripts/{transcript_file}", 'w') as f:
f.write(result['text'])
print(f"Processed: {audio_file} -> {transcript_file}")
Language-Specific Transcription
# Force specific language
result = model.inference(
audio_path="spanish_audio.wav",
language="es" # Spanish
)
# Auto-detect language
result = model.inference(
audio_path="unknown_language.wav",
detect_language=True
)
print(f"Detected language: {result['language']}")
InferX optimizes Whisper for your hardware:
Hardware | Real-time Factor | Memory Usage |
---|
Jetson AGX Orin | ~0.3x | ~2GB |
RTX 4090 | ~0.1x | ~3GB |
Intel i7 CPU | ~0.8x | ~1GB |
Real-time factor: 0.3x means 1 minute of audio processes in ~18 seconds
{
'text': str, # Transcribed text
'language': str, # Detected/specified language code
'confidence': float, # Overall confidence score (0-1)
'segments': [ # Word-level timestamps
{
'start': float, # Start time in seconds
'end': float, # End time in seconds
'text': str, # Text segment
'confidence': float
}
]
}
- WAV: Uncompressed audio (recommended)
- MP3: MPEG audio
- M4A: AAC audio
- FLAC: Lossless audio
- OGG: Ogg Vorbis
Example Applications
Meeting Transcription
from inferx.models.whisper import whisper
model = whisper()
# Transcribe meeting recording
result = model.inference(
audio_path="meeting_recording.wav",
segment_timestamps=True
)
# Generate meeting notes with timestamps
with open("meeting_notes.txt", "w") as f:
f.write(f"Meeting Transcription\n")
f.write(f"Language: {result['language']}\n\n")
for segment in result['segments']:
timestamp = f"[{segment['start']:.1f}s - {segment['end']:.1f}s]"
f.write(f"{timestamp}: {segment['text']}\n")
Voice Commands
import speech_recognition as sr
model = whisper()
def process_voice_command(audio_data):
result = model.inference(audio_data=audio_data)
command = result['text'].lower().strip()
if 'turn on lights' in command:
return 'lights_on'
elif 'play music' in command:
return 'music_play'
elif 'what time' in command:
return 'time_query'
else:
return 'unknown'
# Use with speech recognition
recognizer = sr.Recognizer()
microphone = sr.Microphone()
with microphone as source:
audio = recognizer.listen(source)
command = process_voice_command(audio.get_raw_data())
print(f"Command: {command}")
Hardware Detection
✨ InferX - Whisper Model ✨
🔍 Device Detected: AGX_ORIN
⠏ [0.8s] Loading Whisper model
✓ [1.0s] Ready for speech recognition
Next Steps
Responses are generated using AI and may contain mistakes.