Whisper

Whisper is a versatile automatic speech recognition (ASR) model designed for transcribing and translating spoken language. It’s optimized for edge deployment through the Exla SDK.

Overview

Whisper is a robust speech recognition model that can:

  • Transcribe speech to text in multiple languages
  • Translate spoken language to English
  • Handle various audio qualities and accents
  • Process audio files of different formats

Usage

Here’s a simple example of how to use Whisper for speech transcription:

from exla.models.whisper import whisper

audio_path = "data/speech.wav"
model = whisper()

result = model.transcribe(audio_path)

print(result["text"])

Example Output

The model returns a dictionary containing the transcription and additional metadata:

{
    "text": "Hello, this is a test of the Whisper speech recognition model.",
    "segments": [
        {
            "id": 0,
            "start": 0.0,
            "end": 3.5,
            "text": "Hello, this is a test of the",
            "confidence": 0.95
        },
        {
            "id": 1,
            "start": 3.5,
            "end": 5.8,
            "text": "Whisper speech recognition model.",
            "confidence": 0.92
        }
    ],
    "language": "en"
}