CATI quality control
CATI Quality Control: The Complete Guide for Agency Operations Directors (2026)
How much monitoring is enough, what IQCS and AAPOR require, and the sampling math behind your review rate — a practitioner's guide to CATI quality control.
If you run fieldwork at a research agency, you already do quality control. You monitor calls, you validate a share of each interviewer's work, you spot-check the data file before delivery. This guide is not an argument that you should start — it's a reference for the moments when those programs get stress-tested: a client questions a wave, a new supervisor asks why the monitoring rate is 10%, or volume doubles and the review queue doesn't.
Most of what ranks for "CATI quality control" today is either decades-old academic material or contact-center QA content that has never seen a survey protocol. What's missing is the operational middle: how much monitoring is statistically enough, what the standards bodies actually require, how to score errors consistently across reviewers, and which numbers to track week over week. That's what this guide covers, with the math shown.
Two positions up front, so you know the point of view you're reading:
- Coverage is a means, not the goal. The damage from a quality problem scales with how long it runs uncorrected. A program that catches errors three days faster beats one that reviews twice as many calls a week later.
- Severity scoring beats pass/fail. A skipped screener and a slightly rushed intro are not the same finding, and a QC program that treats them the same teaches reviewers to stop recording the small stuff.
The five layers of a CATI quality control program
A complete telephone survey QC program has five layers. Most agencies run three or four of them; very few run all five deliberately, with owners and rates written down.
| Layer | When it runs | What it catches |
|---|---|---|
| Pre-field controls | Before dialing starts | Bad programming, unclear question wording, interviewers briefed on the wrong protocol |
| Live monitoring | During fieldwork, in real time | Delivery problems while they can still be corrected mid-shift |
| Post-hoc audio review | After the interview, from recordings | Question administration, probing, consent, anything audible — at a reviewable pace |
| Back-checks (re-contact validation) | During or just after fieldwork | Whether the interview really happened, with an eligible respondent, as recorded |
| Data-file checks | Continuously on accumulating data | Statistical anomalies: interview length outliers, straight-lining, implausible completes-per-hour |
Three notes from running these layers in practice:
- Pre-field is the cheapest layer per error prevented. A pilot batch of 20–30 recorded interviews, audited with the same rubric you'll use in production, surfaces protocol ambiguities before they become a thousand inconsistently-administered completes. Brief interviewers on what the audit checks — the AAPOR falsification task force found that simply informing interviewers their work will be verified is itself a preventive control.
- The layers are not interchangeable. Audio review cannot confirm a respondent was eligible if the screener was read correctly to the wrong person; back-checks cannot hear a leading probe. Agencies that lean entirely on one layer inherit that layer's blind spots.
- Each layer needs a written rate and an owner. "We monitor calls" is a sentiment; "the floor supervisor live-monitors 5% during fieldwork, QC reviews 10% of recordings within 48 hours, and we back-check 10% with every interviewer covered" is a program you can defend to a client.
What the standards actually require: IQCS, ISO 20252, AAPOR
When a client's procurement team asks what standard your QC follows, these are the reference points — and the concrete numbers buried inside them.
| Standard / guidance | Scope | The concrete number |
|---|---|---|
| IQCS Standards 2023 (UK) | Validation of fieldwork | Minimum 10% of fieldwork validated; telephone-centre validation carried out during fieldwork may be 5% |
| ISO 20252:2019 | Full research service requirements | No public normative percentage — requires documented validation and monitoring procedures; certified agencies commonly implement around 10% |
| AAPOR/ASA falsification task force (2003) | Falsification prevention and detection | Re-contact or monitor 5–15% of respondents |
| World Bank DIME guidance | Back-checks | 10–20% of observations back-checked |
| J-PAL research protocols | Back-checks | At least 10%, with every interviewer covered at least once |
Three things worth knowing about this table before you cite it back to anyone:
- "Validation" in IQCS terms is broader than listening to calls. It means confirming the interview genuinely took place and was conducted correctly — by re-contacting respondents and/or by listening to recordings or live-monitoring in the telephone centre. A program that only re-contacts, or only listens, can still be compliant; what matters is the documented combination.
- Don't attribute a percentage to ISO 20252 itself. The standard's text is paywalled, and the "10%" figures floating around online come from certified agencies describing their own implementations, not from the standard. If your client contract references ISO 20252, your defensible claim is documented procedures plus your actual rates.
- These numbers are floors, not targets. They were set when validation meant a supervisor with a headset and a callback sheet. They define the minimum a credible agency does — not the point at which quality problems stop slipping through, which is what the next section is about.
Monitoring sample size: how much review is enough
"What percentage of calls should we review?" is the most practical question in CATI quality control, and the standards answer it only with floors. Here is the actual math, so you can pick a rate deliberately.
The quantity that matters is not the project-level percentage — it's reviews per interviewer per week. Quality problems are interviewer-level behaviors: a misread scale, a skipped screener question, a leading probe that has become a habit. The probability that a review program surfaces a behavior is:
P(detected) = 1 − (1 − p)^n
where p is how often the behavior shows up in that interviewer's completes, and n is how many of their completes you review. For an interviewer doing 40 completes a week, here is what different review rates buy you in one week:
| Behavior frequency | 10% reviewed (4 calls) | 20% reviewed (8 calls) | 30% reviewed (12 calls) | 100% reviewed (40 calls) |
|---|---|---|---|---|
| In 5% of their completes | 19% | 34% | 46% | 87% |
| In 15% of their completes | 48% | 73% | 86% | 99.9% |
| In 30% of their completes | 76% | 94% | 99% | ~100% |
Read the first row again, because it's the one that bites: an interviewer mishandling a question in one of every twenty completes has an 81% chance of sailing through a week of 10% monitoring undetected. At that rate, the median time to first detection is three to four weeks — and every undetected week puts another 40 completes in your data file, two of them carrying that error.
This is not an argument that 10% is negligent. It's the honest trade-off behind it: sampled review reliably catches frequent problems and structurally misses infrequent ones. The standards' floors were calibrated to deter and detect gross misconduct, not to catch the 5%-frequency administration error that quietly biases one question.
Why not just review more? Capacity. A 12–15 minute CATI interview takes roughly real time to audit properly — even at faster playback, scoring against a rubric and writing up findings brings you back to 12–15 minutes per call, or four to five calls per reviewer-hour. For a room producing 1,000 completes a week:
- 10% review ≈ 100 audits ≈ 22 reviewer-hours — half an FTE.
- 20% review ≈ 200 audits ≈ 44 reviewer-hours — a full-time reviewer's week, and then some.
- 100% review ≈ 1,000 audits ≈ 220 reviewer-hours — five to six FTEs, which is why nobody does it manually. (The one documented exception is LAPOP, which audits 100% of interviews by audio — with a dedicated audit team, for academic surveys where the cost is justified.)
Practical allocation rules that survive contact with a real phone room:
- Set the unit to the interviewer, not the project. A flat "10% of calls" sampled proportionally gives your highest-volume interviewers plenty of coverage and your part-timers none. Floor it: every active interviewer gets at least 2 reviews per week, whatever their volume.
- Front-load new interviewers. Review 100% of a new hire's first 5–10 completes, then taper. Most administration habits — good and bad — set in the first week.
- Spend the discretionary remainder on signals. After floors and new hires, allocate remaining capacity by paradata: interview-length outliers, unusual completes-per-hour, refusal-rate shifts. Random sampling is for the absence of signals, not instead of them.
Assembled into one worked example — a room of 25 interviewers (two of them new this week) producing 1,000 completes, with one full-time QC reviewer (≈180 audits a week at the pace above):
| Allocation | Audits | Rule applied |
|---|---|---|
| Floor: every active interviewer | 50 | 2 reviews × 25 interviewers, regardless of volume |
| New hires | 30 | 100% of each new interviewer's first ~10 completes, then taper |
| Paradata-triggered | 60 | Length outliers, completes-per-hour anomalies, refusal shifts |
| Random remainder | 40 | Spread across interviewers with no signals |
| Total | 180 | 18% coverage — above the IQCS floor, capacity-bound, defensible |
That table is the answer to "why is our monitoring rate what it is" — the rate falls out of reviewer capacity and allocation rules, not the other way around.
Back-checking rates: who to re-contact, and when
Back-checks — re-contacting respondents to verify the interview happened as recorded — are the layer most CATI shops under-invest in, partly because recordings feel like sufficient proof. They aren't: a recording proves a conversation happened, not that the respondent was eligible, not that the screener outcome was honest, and not that the incentive was delivered as promised.
The concrete guidance worth anchoring to: World Bank DIME recommends back-checking 10–20% of observations; J-PAL's data-quality guidance points the same direction — at least 10%, with every interviewer back-checked at least once and discrepancies documented and reconciled. Both come from in-person fieldwork, where falsification risk is higher than in a centralized phone room — for supervised CATI with full recording, the lower end of those ranges, targeted well, is a defensible position. For at-home CATI interviewers, treat the risk profile as closer to field interviewing and stay nearer the top end.
What a useful CATI back-check asks, in under three minutes:
- Did someone from our center interview you on or around this date? Roughly how long did it take?
- Two or three re-asked factual items — screener-critical demographics, not opinions (opinions legitimately change; your year of birth doesn't).
- If the study carried an incentive: was it offered and delivered as described?
- One open question about the interviewer's conduct.
Prioritize the same way you allocate audio review: every interviewer covered at least once per study, new interviewers and paradata outliers oversampled. A back-check program that never selects anyone unusual is theater.
One thing to settle with your DPO before any of this runs, if you field in the EU or UK: the legal basis for both recording and re-contact. Recording for quality control needs to be disclosed in the interview introduction (your consent script is also a QC item — it's in the rubric below for a reason), and re-contacting a respondent for verification is a further use of their phone number that your privacy notice should cover. The clean pattern is to ask at the close of the interview — "a supervisor may call you back briefly to verify the quality of this interview" — and to put retention limits on recordings and back-check records in writing. None of this is exotic; it just has to be written down before a client's DPO asks.
Real-time vs. post-hoc monitoring
The traditional split: live monitoring (a supervisor silently joins calls during fieldwork — the mode IQCS allows at a 5% floor during fieldwork) versus post-hoc review (auditing recordings afterwards, where the 10% validation expectation lives).
The argument for live monitoring has always been latency: a supervisor who hears a misread scale at 10:15 can correct the interviewer at 10:30, before the same error lands in another dozen completes. Peng and Feld made the systematic version of this argument in Survey Practice back in 2011: traditional survey QC resembles industrial acceptance sampling — inspecting finished goods after production — and statistical process control applied to live CATI data means "monitoring is being done live, and not after the fact." Fifteen years later, that paper is still ahead of most practice.
The argument against relying on it alone is everything else: live monitoring is unrecordable after the fact, samples only the moments a supervisor happens to join, leaves no reviewable evidence for client disputes, and consumes your most expensive floor staff in real time.
The synthesis most operations land on — and the framing we'd defend: the variable that actually matters is not live-versus-recorded; it's the lag between an error occurring and the interviewer hearing about it. Post-hoc review that reaches the interviewer next Tuesday is acceptance sampling, whatever your coverage rate. Post-hoc review of this morning's recordings that reaches the interviewer before tomorrow's shift captures most of live monitoring's corrective value, with evidence attached, at a fraction of the supervision cost. In a two-week field period, feedback lag of a week means an error caught in wave one is corrected for — at best — the last few days of dialing. Whatever rates you choose, put a maximum review lag in your QC plan (we'd argue for 48 hours, and same-day where tooling allows), and measure it.
An error-severity rubric you can copy
Two reviewers can listen to the same call and file different verdicts — not because either is careless, but because "fail" is doing too much work. The fix is a weighted severity rubric, and the best documented pattern comes from LAPOP at Vanderbilt, whose audit teams log every issue in a Quality Assurance Chapter (QuAC) with points per error type: an interview accumulating 20 or more points is canceled and replaced, and the worst violations carry weights that auto-fail — falsifying an interview scores 100 outright. (Their FALCON-CATI methodological note documents the phone-survey adaptation.)
Here is that pattern adapted for agency CATI work. Calibrate the weights to your protocols; keep the structure.
| Finding | Points | Rationale |
|---|---|---|
| Interview falsified, in whole or part | 100 | Auto-fail; triggers the falsification protocol, not just cancellation |
| Ineligible respondent / screener not properly administered | 25 | The complete is unusable regardless of how well the rest went |
| Mandatory question skipped but an answer recorded | 20 | Fabricated data point — critical by definition |
| Recorded answer contradicts what the respondent said | 20 | The data file no longer reflects the interview |
| Consent or recording disclosure omitted | 20 | Compliance exposure independent of data quality |
| Leading probe / suggesting an answer | 10 | Biases the response, but the interview may survive one instance |
| Scale or answer options not read as scripted | 8 | Comparability damage, accumulates across statements |
| Vague answer accepted without clarification | 6 | Recoverable in coding sometimes, a habit worth catching always |
| Wording deviation that preserves meaning | 4 | Track it; it predicts worse drift |
| Rushed delivery, talking over the respondent | 2 | Coaching material, not cancellation material |
Governance notes that make a rubric like this hold up:
- The 20-point line is the cancel/replace threshold, reached either by one critical finding or by accumulation — two leading probes plus a misread scale plus accepted vagueness is also a failed interview, which is exactly the kind of "death by warnings" call a binary pass/fail lets through.
- Findings below the threshold are not noise; they're the coaching feed. The 2-to-10-point band is where interviewer development lives.
- Reclassification must be legitimate. A reviewer who listens and decides the "mismatch" was actually the respondent self-correcting should be able to downgrade or dismiss the finding, with the decision logged. A rubric nobody can override gets quietly ignored instead.
- Calibrate monthly. Have reviewers score the same two or three calls independently and compare. Inter-reviewer drift is measurable, and reviewing it is the cheapest consistency tool you have.
- Version the rubric with the questionnaire. Mid-wave revisions are routine — a reworded question, a new screener term, a dropped statement from a scale — and every one of them silently invalidates part of the rubric (and any automated flagging criteria built on it). Date-stamp rubric versions against questionnaire versions, and treat findings scored against the wrong version as void, not as arguments.
CATI quality KPIs worth tracking
If the QC program produces findings but no trendlines, you can't answer the only questions that matter at the ops level: is quality improving, who needs help, and is the program itself working. The set we'd track weekly, per study and per interviewer:
| KPI | Definition | Why it earns a slot |
|---|---|---|
| Time-to-feedback | Median hours from interview end to the interviewer receiving the finding | The KPI almost nobody tracks, and the one that determines how long errors repeat |
| Critical-error rate | Critical findings per 100 completes, per interviewer | The headline quality number; watch trend, not level |
| Repeat-error rate | Same finding type recurring for the same interviewer after feedback | Distinguishes a feedback problem from a training problem |
| Review coverage and fairness | % of completes reviewed, and minimum reviews per active interviewer | Catches the silent failure of proportional sampling |
| Back-check discrepancy rate | % of back-checks with a material discrepancy | Your falsification early-warning line |
| Reviewer dismissal rate | % of flagged findings dismissed on human review | Measures whether your flagging — human or AI — is worth your reviewers' time |
Per-interviewer views are where this pays off. Aggregate rates look fine right up until you split them and find one interviewer running 11% critical against a room average of 3%:
| Interviewer | Calls | Critical rate | Reviewed |
|---|---|---|---|
| #08 | 142 | 2% | 142 |
| #14 | 138 | 4% | 130 |
| #22 | 96 | 11% | 88 |
| #31 | 124 | 3% | 124 |
One honest caveat on interviewer-level metrics: small samples lie. Four completes reviewed is not a critical rate; require a minimum reviewed-call count before a number turns into a conversation with an interviewer, and treat week-one outliers as a reason to review more of their calls, not as a verdict.
Where AI review fits — and where it doesn't
Recall the capacity math: full manual review of a 1,000-complete week costs five to six reviewer FTEs, which is why every manual program is a sampling program. AI call review changes that one constraint — transcription and first-pass evaluation against the protocol are automated, so the marginal cost of reviewing one more call stops being 15 minutes of a reviewer's time and becomes a per-call processing fee (vendors in this space, ours included, price per processed call — make any vendor quote that number against your volume before you take "affordable" on faith). Reviewing 100% of completes stops being an academic option: LAPOP's 100% audit, without LAPOP's audit headcount.
Being specific about what that does and doesn't change, because the vendors in this space (ours included) are better served by precision than by promises:
What it changes. Detection probability stops depending on sampling luck: the 5%-frequency error that survives a month of 10% sampling appears in the first day's findings, because every complete is evaluated. Answers can be checked against the data file per question — the mismatch class of error that audio-only review surfaces accidentally if at all. And feedback lag collapses: findings from this morning's calls are reviewable the same day, which is the real-time-monitoring benefit without a supervisor on a headset.
What it doesn't change. The verdict stays human. An AI flag is "listen to this," never an automatic accusation — false positives exist (we publish how we measure ours), and a falsification flag in particular must never reach an interviewer's file without a reviewer confirming it. Your severity rubric, your thresholds, and your coaching workflow remain yours; flagging criteria need calibration to your protocols before the flags are trustworthy, plus ongoing tuning from your reviewers' confirm/dismiss decisions — and recalibration when the questionnaire changes mid-wave, exactly like the rubric it implements. Calibration time is a claim every vendor makes and none should be allowed to keep: have them demonstrate it on a batch of your own recorded calls before you commit, not on a demo reel. And the standards still mean what they meant: back-checks verify things no audio analysis can, and your documented procedures — not your tooling — are what an auditor certifies.
The QC stack from the top of this guide doesn't get shorter with AI in it; the post-hoc layer just stops being the bottleneck, and your reviewers' hours move from listening-to-find to deciding-on-findings.
The QC plan template
Everything above, turned into a fill-in document: download the CATI QC plan template (Markdown — paste it into Google Docs, Word, or your wiki and fill in the bracketed fields). No email required. It covers:
- Scope, roles, and named owners for all five QC layers
- Rate tables to fill in: live monitoring, post-hoc review, back-checks — with the floors from this guide as reference points
- The severity rubric above, ready to re-weight
- New-interviewer onboarding review schedule
- The KPI set with a weekly review cadence
- A feedback-loop section: maximum review lag, who delivers feedback, escalation and falsification protocol
- A client-facing documentation checklist for when QC gets audited
If a client dispute, an ISO audit, or a new QC hire is on your horizon, filling this in is an afternoon that pays for itself the first time someone asks "what's your validation procedure?" and the answer is a document instead of a meeting.
Sources and refresh policy
This guide cites primary sources throughout; the load-bearing ones:
- IQCS Standards 2023 — the 10% validation minimum and the 5% during-fieldwork telephone-centre floor
- AAPOR/ASA, Interviewer Falsification in Survey Research (2003) — the 5–15% verification recommendation and prevention findings
- LAPOP Quality Control and Methodological Note 008 — the 100% audio audit and the QuAC weighted-scoring pattern
- Peng & Feld, "Quality Control in Telephone Survey Interviewer Monitoring," Survey Practice 4(2), 2011 — the acceptance-sampling critique and SPC approach
- World Bank DIME Wiki: Back Checks and J-PAL: Data quality checks — back-check rates and coverage rules
- ISO 20252:2019 — service requirements for research fieldwork
Published June 11, 2026. This guide is reviewed annually — standards editions, cited rates, and the AI capability section are re-verified at each revision, and the dateline above reflects the last review.