When we first deployed drive-thru voice ordering at a QSR location, our ASR accuracy on quiet phone calls translated terribly to the drive-thru speaker environment. The 82% word-error rate we were seeing in early testing wasn't a model problem — it was an audio problem we hadn't fully diagnosed. This is the story of what we changed and why it worked.
Drive-thru audio is genuinely hostile. You have a speaker post mounted outdoors, often near a road or parking lot, capturing not just the customer's voice but engine noise, wind, audio from the customer's own car stereo, children in the back seat, and — at high-volume locations — bleedthrough from the car behind them in the queue. Acoustic conditions vary by weather, time of day, and how recently the speaker hardware was serviced. None of this is in the training data of a general-purpose ASR model.
Where the 82% Number Came From
Our initial deployment used a standard noise suppression library applied as a pre-processing step before sending audio to the ASR model. This worked reasonably well for phone ordering — quiet background, close microphone, narrow-band audio that's inherently somewhat filtered. For drive-thru, it was insufficient.
We measured word error rate (WER) on a sample of 500 drive-thru interactions from the first two weeks of deployment. The overall WER was 18% — meaning roughly 1 in 6 words was misrecognized. That translates to order accuracy around 82% at the complete order level (a single misrecognized word anywhere in an order can flag it for human review or cause a wrong item).
When we broke down errors by category, three patterns dominated:
- Engine noise masking low-frequency consonants (particularly "b", "d", "g" sounds). "Large" getting heard as "large" correctly but "double" being missed or garbled.
- Wind bursts causing clipping artifacts that the noise suppression treated as speech signal rather than noise, injecting phantom phonemes.
- Modifier stacking under noise — when a customer said a long modifier string quickly ("no onion extra jalapeño no cheese"), the latter items degraded sharply as SNR dropped.
What We Tried That Didn't Work
Before getting to what did work, it's worth being honest about the dead ends. We tried increasing the noise suppression aggressiveness on the standard library — this improved wind artifact handling but made the consonant masking problem significantly worse by also suppressing the speech signal in low-SNR moments.
We tried a beamforming approach that required hardware changes to the speaker post — directional microphone arrays that would physically filter noise by direction. It worked in lab conditions. It was impractical at the operator level: it required retrofitting hardware at each location, introduced a new failure mode (microphone alignment drift), and the cost-per-location unit economics didn't justify the accuracy gain for that approach alone.
We tried adding a confidence threshold so that low-confidence transcriptions would immediately escalate to a human rather than attempting fulfillment. This reduced wrong order completions but increased escalation rate to 35%, which defeated the purpose of automation and overloaded the staff review queue.
The Two Changes That Actually Moved the Needle
The first improvement came from switching to a neural noise suppression model purpose-built for outdoor and vehicular audio rather than the general-purpose library we had been using. The key difference was how the model handled non-stationary noise — engine sounds, wind gusts, intermittent tire noise — versus the stationary noise (HVAC hum, consistent background music) that general models are optimized for. Drive-thru noise is almost entirely non-stationary. The neural model reduced our wind clipping artifacts by roughly 60% in controlled testing.
The second improvement was domain-specific acoustic model fine-tuning. We took a base ASR model and fine-tuned it on a dataset of drive-thru audio that we had accumulated from early deployments — transcribed by human annotators, with particular attention to the modifier vocabulary that appears in fast food ordering. "No onion," "extra sauce," "large not medium," "make that a combo" — the phonetic patterns of these phrases in noisy outdoor conditions are different enough from general speech that a fine-tuned model handles them meaningfully better.
The combination of neural noise suppression plus domain fine-tuning brought WER from 18% to approximately 6%, which corresponds to the 94% complete-order accuracy we reported.
What's Still Hard
We want to be transparent that 94% is not 100%, and the remaining 6% is not evenly distributed. The hardest conditions we still encounter:
- Heavy rain on the speaker housing. Rain directly on the microphone grille creates a percussive noise that even neural suppression struggles to separate from speech. Accuracy drops to roughly 87% in heavy precipitation.
- Simultaneous talkers. When a customer and a passenger are both speaking, we don't reliably identify the primary speaker. The ordering AI attempts to resolve ambiguity by prompting for confirmation, but this adds interaction time.
- Severely degraded speaker hardware. Older speaker posts with cracked or corroded microphone elements produce audio that no software fix can fully compensate for. When we audit a new location, speaker hardware condition is now part of our readiness checklist.
What This Means for Operators Evaluating Drive-Thru Voice AI
ASR accuracy numbers in vendor marketing materials are almost always measured under favorable conditions — typically quiet phone audio or controlled lab environments. When you're evaluating drive-thru voice AI, ask specifically for accuracy figures measured on outdoor drive-thru audio, and ask how those numbers break down by noise condition, time of day, and weather.
Also ask what the escalation rate is — the percentage of interactions that get flagged for human review. A system claiming 95% accuracy but escalating 25% of calls isn't actually handling 95% of your volume autonomously. Our target is sub-10% escalation rate with accuracy above 92% in standard operating conditions, and we measure both continuously.
The gap between phone ordering ASR and drive-thru ASR is real, and any system that doesn't acknowledge it is either being imprecise about their testing conditions or hasn't actually deployed in a real drive-thru environment. The good news is that the gap is closable — we've closed most of it — but it requires specific engineering investment that doesn't come for free.