Accuracy Is the Floor. Readability Is the Experience.
A better way to evaluate real-time captions and live speech delivery systems
When we evaluate speech-to-text (STT) tools or automatic speech recognition (ASR) systems, we usually start with one obvious question:
How accurate is the transcription?
That question matters. Accuracy is still the foundation of any speech product. If a system consistently hears the wrong words, nothing else can save the experience.
But when we started building Lanson Live, we found ourselves wrestling with a frustrating puzzle: Why do some highly accurate transcripts still feel terrible to read live?
We realized it's because a live caption stream is not a document people read later. It is information people consume while speech is still happening. The speaker keeps talking. The listener keeps reading. The output keeps changing.
That makes real-time speech delivery a profoundly different problem from ordinary transcription, meeting notes, or voice dictation. In live speech, maybe the real question isn't just:
Did the system eventually get the words right?
Perhaps the better question to ask together is:
Can people actually follow what is being said while it is happening?
Real-time captions are not just speech-to-text
Live transcription systems usually optimize for recognition: converting spoken audio into written words. That is necessary, but is it the whole picture?
For live captions to be truly useful, speech has to become something people can read and follow in real time. That requires more than raw recognition. It requires structure, timing, and stability.
A raw transcript can be technically accurate and still be difficult to read. It can appear too late. It can flicker. It can rewrite itself repeatedly. It can break phrases in unnatural places. In those cases, the system may have recognized the speech perfectly, but has it actually delivered the live context?
Dictation has a review loop. Live speech does not.
Why do live captions often feel so disjointed? We often catch ourselves comparing them to voice dictation, but we realized that might be the wrong benchmark entirely. Dictation follows a forgiving flow:
Speak → text appears → review → edit → send.
The user has a human confirmation loop. They can pause. They can fix mistakes. They can decide when the text is ready.
Live speech delivery doesn't afford us that luxury. A live caption flow looks more like this:
Speaker keeps talking → audience keeps reading → output must remain usable as it appears.
There is no comfortable "fix it later" window for the person reading. If a caption line appears, jumps, rewrites, shifts, and then settles several seconds later, the reader has already paid the cognitive cost. Their attention has been interrupted.
That seems to be the core difference:
Dictation optimizes for input with a review loop. Live captions optimize for delivery without one.
This realization changes the technical standard. The output doesn't only need to be correct eventually—it needs to be readable while it is forming.
Three dimensions that matter for real-time captions
If accuracy is only the floor, how do we measure the rest of the experience? As we tested different models, we found that a complete evaluation usually comes down to three core dimensions:
- Accuracy — are the words right, and is the content complete?
- Latency — when does usable context appear?
- Readability — can people actually follow the output as it forms?
Together, these dimensions describe whether a real-time speech system is just transcribing speech, or actually making it usable while the moment unfolds.
1. Real-Time ASR Accuracy: The Foundation, Not the Full Experience
Accuracy is obviously the baseline. A caption system needs to handle ordinary words, domain terms, accents, and noisy environments without changing the meaning.
But we've noticed it's surprisingly easy to conflate two different layers of accuracy.
The first is word-level accuracy (often measured as Word Error Rate, or WER): did the system recognize the right words? This is what most benchmarks measure.
The second, and often overlooked, is coverage accuracy: did the system capture the full message? A system can recognize individual words correctly but still drop a sentence ending, miss a speaker transition, or lose a key clause during fast speech.
In live settings, the audience cannot rewind. A missing clause can break the thread entirely. A caption stream that is mostly correct word-by-word but routinely loses sentence endings isn't really accurate enough for live speech, is it?
Questions we've learned to ask:
- Does the system recognize common words, names, numbers, and domain terms reliably?
- Does it preserve meaning under accents or imperfect audio?
- Does it maintain completeness during fast speech or speaker transitions?
- Does correction improve readability without creating visible instability?
The goal is not simply correct text. The goal is text that is correct, complete, and stable enough to follow.
2. Caption Latency: Measuring First Usable Context
We often hear API providers talk about latency as if it means only one thing: how quickly the first word appears. But is that really what matters to the person reading?
For live captions, we started asking a different question:
When can the reader understand a stable unit of meaning?
For example, an API might deliver the first word in 200 milliseconds. But if it takes 3 seconds to form a stable, readable phrase, the effective latency for the user is 3 seconds, not 200ms. If words appear instantly but remain incomplete or constantly changing, the experience isn't truly readable.
Useful latency dimensions we look at:
- Time to first visible caption
- Time to first readable phrase
- Time from phrase completion to stable display
- P50 and P95 latency across a real session
The important distinction we've found is this:
Real-time shouldn't mean racing to display unstable text. It should mean delivering context fast enough to support understanding.
Latency should always be evaluated alongside stability. If a system appears fast but forces the reader to chase unfinished output, we haven't really solved the live speech problem yet.
3. Caption Readability: The Experience Layer Accuracy Bypasses
Readability is where live captioning diverges most clearly from ordinary transcription. Through our testing, we found it comes down to two components that heavily reinforce each other.
Stability: reflow is a comprehension cost
Have you ever found yourself losing track of a live broadcast because the text suddenly shifted? When a caption stream constantly rewrites itself, the problem isn't just cosmetic.
The reader has to relocate the line. They have to decide whether the words they already read still mean the same thing. In a live situation, that effort competes directly with listening and thinking.
In live speech delivery, reflow is a comprehension cost.
To put this into perspective: if a traditional ASR model causes 15 visible rewrites (flickers) per minute, the reader's attention is constantly broken. Optimizing this down to 1 or 2 necessary revisions changes the entire experience.
Stability metrics worth tracking:
- Visible rewrites per minute
- Line movement events per minute
- Time from initial display to stable caption
Live speech is messy, and real-time systems sometimes do need to revise. The goal isn't absolute perfection; it's controlling unnecessary movement to make the stream feel calm enough to read.
Segmentation: captions should follow meaning, not just time
Speech doesn't arrive in neat written sentences. People pause, restart, interrupt themselves, and change direction. A caption system has to decide where to break the stream.
Bad segmentation can make even 100% accurate text feel hard to read. For example, a system might split a thought like this:
I think we should
probably move the
launch because the
current build
Even if every word is correct, the reading rhythm is exhausting. Better segmentation follows meaning:
I think we should probably move the launch,
because the current build is not stable enough.
Segmentation questions to consider:
- Does the system break lines at natural semantic boundaries?
- Does it preserve phrases instead of slicing them mechanically?
- Does punctuation help reading rather than arrive too late?
Segmentation isn't just a formatting detail. It is one of the primary ways we translate speech into readable context.
A practical evaluation framework
When comparing real-time caption or live transcription tools, we've realized the question shouldn't just be "which one is more accurate?" A stronger, more holistic evaluation looks like this:
| Dimension | What to ask | Why it matters |
|---|---|---|
| Accuracy | Are the words right and the content complete? | Recognition and coverage are both the foundation. |
| Latency | When does usable context appear? | The reader needs to keep up with speech. |
| Readability | Does the output stay stable and follow meaning? | Reflow and poor segmentation create cognitive cost. |
What this means for product comparison
This framework completely changes how we compare tools. A meeting assistant can be excellent at summaries. A dictation tool can be excellent at helping input text faster. A speech-to-text API can be excellent at raw recognition.
But those aren't the same as live speech delivery. Different tools optimize for different moments in time:
- Dictation optimizes input before sending.
- Meeting AI optimizes knowledge after the conversation.
- Speech-to-text APIs optimize recognition for developers.
- Live speech delivery systems optimize following speech as it happens.
That is why we stopped asking "which tool transcribes better?" and started asking:
What does the text need to do once it appears?
If the text needs to help people follow speech while it is still happening, then accuracy, latency, and readability become the central standard.
Conclusion: real-time should be ready to read
Real-time captioning isn't only about speed. It's about helping people follow language while the moment is still unfolding. That requires accuracy across both word recognition and content coverage. It requires latency low enough to keep pace. And it requires readability—a caption stream stable enough that people don't have to chase the output.
If a system displays words quickly but forces people to read unstable, poorly segmented text, the problem is only half-solved.
We believe the future of real-time speech tools shouldn't be measured just by how fast they display text. It should be measured by how well they protect attention, preserve context, and make live speech easy to follow.
That's the journey we are on, and it's the standard Lanson Live is designed around.
Ready to evaluate speech delivery beyond basic accuracy?
