Reference

Transcription Glossary

Plain-English definitions of the technical terms used in speech-to-text and AI transcription. Bookmark this page for the next time you encounter "WER" or "VTT" in a vendor pitch.

ASR

Automatic Speech Recognition. The umbrella technical term for any system that converts spoken audio into written text.

Whisper

An open-source ASR model released by OpenAI in 2022. Trained on 680,000 hours of multilingual audio. Now standard infrastructure for most AI transcription tools, including Transcript.you.

WER

Word Error Rate. The percentage of words a transcription engine gets wrong, including substitutions, insertions, and deletions. Lower is better. Modern engines achieve 2-5% WER on clean audio.

Diarization

The process of identifying "who said what" in a multi-speaker recording. Often labeled as Speaker 1, Speaker 2, etc. Requires a separate model from base transcription.

SRT

SubRip Subtitle file format. The most widely supported subtitle format, compatible with YouTube, VLC, Premiere, Final Cut, and DaVinci Resolve.

VTT (WebVTT)

Web Video Text Tracks. The subtitle format native to HTML5 video. Used by Vimeo, browser-based players, and modern web video CMS systems.

Codec

The algorithm used to compress and decompress audio data. MP3, AAC, FLAC, Opus, and PCM are all codecs. Different codecs trade off file size, quality, and computational cost.

Sample rate

How many times per second an audio signal is measured, in Hz. CD quality is 44,100 Hz; modern recordings often use 48,000 Hz. Higher sample rates capture more detail but produce larger files.

Bitrate

How many bits of data are used per second of audio, in kbps. Streaming MP3 is 128 kbps; podcast-grade is 192-256 kbps; archival lossless is 1,000+ kbps.

Lossy vs lossless

Lossy compression (MP3, AAC, OGG) discards audio data to reduce file size. Lossless (FLAC, ALAC, WAV) preserves every bit. For transcription, lossy at 128 kbps+ is indistinguishable from lossless.

Chunking

Splitting a long audio file into shorter segments before sending to the ASR engine. Whisper has a 30-second context window internally; tools chunk at 10-minute boundaries to stay within model limits.

Timestamp

A time marker on each line of a transcript, typically in MM:SS or HH:MM:SS format. Required for subtitles; optional for plain text.

Forced alignment

Matching a known transcript to its audio to produce per-word timing. Used to improve subtitle precision after the initial ASR pass.

Acoustic model

The component of an ASR system that maps raw audio waveforms to phoneme probabilities.

Language model

The component that turns phoneme guesses into actual words by predicting likely word sequences. Modern Whisper-style transformers integrate both into one model.

Punctuation prediction

A post-processing step that adds commas, periods, and capitalization to raw ASR output, which originally produces unpunctuated streams of words.

Whisper-1 vs Whisper-3

OpenAI's API-served Whisper variant (whisper-1) is older and supports verbose JSON with segment timestamps. Whisper-3 is the latest open-source release with better accuracy on noisy and accented audio.

Transcription vs subtitling

Transcription produces a plain text file with everything spoken. Subtitling produces time-coded short lines optimized for on-screen reading (typically 32-40 chars per line, max 2 lines on screen).

Closed captions vs subtitles

Closed captions include non-speech audio (music cues, sound effects, speaker labels) for the deaf and hard-of-hearing. Subtitles assume the viewer can hear and only translate spoken dialogue.

Real-time vs batch transcription

Real-time transcription processes a live audio stream as it's spoken (live captions). Batch transcription processes a complete recorded file. Batch is typically more accurate; real-time has a 1-3 second latency.

Looking for more depth? Read the Complete Guide to AI Transcription for a full walk-through of formats, accuracy, and tools.

Last updated: May 05, 2026