CATI quality control

CATI Quality Control: The Complete Guide for Agency Operations Directors (2026)

How much monitoring is enough, what IQCS and AAPOR require, and the sampling math behind your review rate — a practitioner's guide to CATI quality control.

By the VeriCATI teamJune 11, 202618 min read

If you run fieldwork at a research agency, you already do quality control. You monitor calls, you validate a share of each interviewer's work, you spot-check the data file before delivery. This guide is not an argument that you should start — it's a reference for the moments when those programs get stress-tested: a client questions a wave, a new supervisor asks why the monitoring rate is 10%, or volume doubles and the review queue doesn't.

Most of what ranks for "CATI quality control" today is either decades-old academic material or contact-center QA content that has never seen a survey protocol. What's missing is the operational middle: how much monitoring is statistically enough, what the standards bodies actually require, how to score errors consistently across reviewers, and which numbers to track week over week. That's what this guide covers, with the math shown.

Two positions up front, so you know the point of view you're reading:

Coverage is a means, not the goal. The damage from a quality problem scales with how long it runs uncorrected. A program that catches errors three days faster beats one that reviews twice as many calls a week later.
Severity scoring beats pass/fail. A skipped screener and a slightly rushed intro are not the same finding, and a QC program that treats them the same teaches reviewers to stop recording the small stuff.

The five layers of a CATI quality control program

A complete telephone survey QC program has five layers. Most agencies run three or four of them; very few run all five deliberately, with owners and rates written down.

Layer	When it runs	What it catches
Pre-field controls	Before dialing starts	Bad programming, unclear question wording, interviewers briefed on the wrong protocol
Live monitoring	During fieldwork, in real time	Delivery problems while they can still be corrected mid-shift
Post-hoc audio review	After the interview, from recordings	Question administration, probing, consent, anything audible — at a reviewable pace
Back-checks (re-contact validation)	During or just after fieldwork	Whether the interview really happened, with an eligible respondent, as recorded
Data-file checks	Continuously on accumulating data	Statistical anomalies: interview length outliers, straight-lining, implausible completes-per-hour

Three notes from running these layers in practice:

Pre-field is the cheapest layer per error prevented. A pilot batch of 20–30 recorded interviews, audited with the same rubric you'll use in production, surfaces protocol ambiguities before they become a thousand inconsistently-administered completes. Brief interviewers on what the audit checks — the AAPOR falsification task force found that simply informing interviewers their work will be verified is itself a preventive control.
The layers are not interchangeable. Audio review cannot confirm a respondent was eligible if the screener was read correctly to the wrong person; back-checks cannot hear a leading probe. Agencies that lean entirely on one layer inherit that layer's blind spots.
Each layer needs a written rate and an owner. "We monitor calls" is a sentiment; "the floor supervisor live-monitors 5% during fieldwork, QC reviews 10% of recordings within 48 hours, and we back-check 10% with every interviewer covered" is a program you can defend to a client.

What the standards actually require: IQCS, ISO 20252, AAPOR

When a client's procurement team asks what standard your QC follows, these are the reference points — and the concrete numbers buried inside them.

Standard / guidance	Scope	The concrete number
IQCS Standards 2023 (UK)	Validation of fieldwork	Minimum 10% of fieldwork validated; telephone-centre validation carried out during fieldwork may be 5%
ISO 20252:2019	Full research service requirements	No public normative percentage — requires documented validation and monitoring procedures; certified agencies commonly implement around 10%
AAPOR/ASA falsification task force (2003)	Falsification prevention and detection	Re-contact or monitor 5–15% of respondents
World Bank DIME guidance	Back-checks	10–20% of observations back-checked
J-PAL research protocols	Back-checks	At least 10%, with every interviewer covered at least once

Three things worth knowing about this table before you cite it back to anyone:

"Validation" in IQCS terms is broader than listening to calls. It means confirming the interview genuinely took place and was conducted correctly — by re-contacting respondents and/or by listening to recordings or live-monitoring in the telephone centre. A program that only re-contacts, or only listens, can still be compliant; what matters is the documented combination.
Don't attribute a percentage to ISO 20252 itself. The standard's text is paywalled, and the "10%" figures floating around online come from certified agencies describing their own implementations, not from the standard. If your client contract references ISO 20252, your defensible claim is documented procedures plus your actual rates.
These numbers are floors, not targets. They were set when validation meant a supervisor with a headset and a callback sheet. They define the minimum a credible agency does — not the point at which quality problems stop slipping through, which is what the next section is about.

Monitoring sample size: how much review is enough

"What percentage of calls should we review?" is the most practical question in CATI quality control, and the standards answer it only with floors. Here is the actual math, so you can pick a rate deliberately.

The quantity that matters is not the project-level percentage — it's reviews per interviewer per week. Quality problems are interviewer-level behaviors: a misread scale, a skipped screener question, a leading probe that has become a habit. The probability that a review program surfaces a behavior is:

P(detected) = 1 − (1 − p)^n

where p is how often the behavior shows up in that interviewer's completes, and n is how many of their completes you review. For an interviewer doing 40 completes a week, here is what different review rates buy you in one week:

Behavior frequency	10% reviewed (4 calls)	20% reviewed (8 calls)	30% reviewed (12 calls)	100% reviewed (40 calls)
In 5% of their completes	19%	34%	46%	87%
In 15% of their completes	48%	73%	86%	99.9%
In 30% of their completes	76%	94%	99%	~100%

Read the first row again, because it's the one that bites: an interviewer mishandling a question in one of every twenty completes has an 81% chance of sailing through a week of 10% monitoring undetected. At that rate, the median time to first detection is three to four weeks — and every undetected week puts another 40 completes in your data file, two of them carrying that error.

This is not an argument that 10% is negligent. It's the honest trade-off behind it: sampled review reliably catches frequent problems and structurally misses infrequent ones. The standards' floors were calibrated to deter and detect gross misconduct, not to catch the 5%-frequency administration error that quietly biases one question.

Why not just review more? Capacity. A 12–15 minute CATI interview takes roughly real time to audit properly — even at faster playback, scoring against a rubric and writing up findings brings you back to 12–15 minutes per call, or four to five calls per reviewer-hour. For a room producing 1,000 completes a week:

10% review ≈ 100 audits ≈ 22 reviewer-hours — half an FTE.
20% review ≈ 200 audits ≈ 44 reviewer-hours — a full-time reviewer's week, and then some.
100% review ≈ 1,000 audits ≈ 220 reviewer-hours — five to six FTEs, which is why nobody does it manually. (The one documented exception is LAPOP, which audits 100% of interviews by audio — with a dedicated audit team, for academic surveys where the cost is justified.)

Practical allocation rules that survive contact with a real phone room:

Set the unit to the interviewer, not the project. A flat "10% of calls" sampled proportionally gives your highest-volume interviewers plenty of coverage and your part-timers none. Floor it: every active interviewer gets at least 2 reviews per week, whatever their volume.
Front-load new interviewers. Review 100% of a new hire's first 5–10 completes, then taper. Most administration habits — good and bad — set in the first week.
Spend the discretionary remainder on signals. After floors and new hires, allocate remaining capacity by paradata: interview-length outliers, unusual completes-per-hour, refusal-rate shifts. Random sampling is for the absence of signals, not instead of them.

Assembled into one worked example — a room of 25 interviewers (two of them new this week) producing 1,000 completes, with one full-time QC reviewer (≈180 audits a week at the pace above):

Allocation	Audits	Rule applied
Floor: every active interviewer	50	2 reviews × 25 interviewers, regardless of volume
New hires	30	100% of each new interviewer's first ~10 completes, then taper
Paradata-triggered	60	Length outliers, completes-per-hour anomalies, refusal shifts
Random remainder	40	Spread across interviewers with no signals
Total	180	18% coverage — above the IQCS floor, capacity-bound, defensible

That table is the answer to "why is our monitoring rate what it is" — the rate falls out of reviewer capacity and allocation rules, not the other way around.

Back-checking rates: who to re-contact, and when

Back-checks — re-contacting respondents to verify the interview happened as recorded — are the layer most CATI shops under-invest in, partly because recordings feel like sufficient proof. They aren't: a recording proves a conversation happened, not that the respondent was eligible, not that the screener outcome was honest, and not that the incentive was delivered as promised.

The concrete guidance worth anchoring to: World Bank DIME recommends back-checking 10–20% of observations; J-PAL's data-quality guidance points the same direction — at least 10%, with every interviewer back-checked at least once and discrepancies documented and reconciled. Both come from in-person fieldwork, where falsification risk is higher than in a centralized phone room — for supervised CATI with full recording, the lower end of those ranges, targeted well, is a defensible position. For at-home CATI interviewers, treat the risk profile as closer to field interviewing and stay nearer the top end.

What a useful CATI back-check asks, in under three minutes:

Did someone from our center interview you on or around this date? Roughly how long did it take?
Two or three re-asked factual items — screener-critical demographics, not opinions (opinions legitimately change; your year of birth doesn't).
If the study carried an incentive: was it offered and delivered as described?
One open question about the interviewer's conduct.

Prioritize the same way you allocate audio review: every interviewer covered at least once per study, new interviewers and paradata outliers oversampled. A back-check program that never selects anyone unusual is theater.

One thing to settle with your DPO before any of this runs, if you field in the EU or UK: the legal basis for both recording and re-contact. Recording for quality control needs to be disclosed in the interview introduction (your consent script is also a QC item — it's in the rubric below for a reason), and re-contacting a respondent for verification is a further use of their phone number that your privacy notice should cover. The clean pattern is to ask at the close of the interview — "a supervisor may call you back briefly to verify the quality of this interview" — and to put retention limits on recordings and back-check records in writing. None of this is exotic; it just has to be written down before a client's DPO asks.

Real-time vs. post-hoc monitoring

The traditional split: live monitoring (a supervisor silently joins calls during fieldwork — the mode IQCS allows at a 5% floor during fieldwork) versus post-hoc review (auditing recordings afterwards, where the 10% validation expectation lives).

The argument for live monitoring has always been latency: a supervisor who hears a misread scale at 10:15 can correct the interviewer at 10:30, before the same error lands in another dozen completes. Peng and Feld made the systematic version of this argument in Survey Practice back in 2011: traditional survey QC resembles industrial acceptance sampling — inspecting finished goods after production — and statistical process control applied to live CATI data means "monitoring is being done live, and not after the fact." Fifteen years later, that paper is still ahead of most practice.

The argument against relying on it alone is everything else: live monitoring is unrecordable after the fact, samples only the moments a supervisor happens to join, leaves no reviewable evidence for client disputes, and consumes your most expensive floor staff in real time.

The synthesis most operations land on — and the framing we'd defend: the variable that actually matters is not live-versus-recorded; it's the lag between an error occurring and the interviewer hearing about it. Post-hoc review that reaches the interviewer next Tuesday is acceptance sampling, whatever your coverage rate. Post-hoc review of this morning's recordings that reaches the interviewer before tomorrow's shift captures most of live monitoring's corrective value, with evidence attached, at a fraction of the supervision cost. In a two-week field period, feedback lag of a week means an error caught in wave one is corrected for — at best — the last few days of dialing. Whatever rates you choose, put a maximum review lag in your QC plan (we'd argue for 48 hours, and same-day where tooling allows), and measure it.

An error-severity rubric you can copy

Two reviewers can listen to the same call and file different verdicts — not because either is careless, but because "fail" is doing too much work. The fix is a weighted severity rubric, and the best documented pattern comes from LAPOP at Vanderbilt, whose audit teams log every issue in a Quality Assurance Chapter (QuAC) with points per error type: an interview accumulating 20 or more points is canceled and replaced, and the worst violations carry weights that auto-fail — falsifying an interview scores 100 outright. (Their FALCON-CATI methodological note documents the phone-survey adaptation.)

Here is that pattern adapted for agency CATI work. Calibrate the weights to your protocols; keep the structure.

Finding	Points	Rationale
Interview falsified, in whole or part	100	Auto-fail; triggers the falsification protocol, not just cancellation
Ineligible respondent / screener not properly administered	25	The complete is unusable regardless of how well the rest went
Mandatory question skipped but an answer recorded	20	Fabricated data point — critical by definition
Recorded answer contradicts what the respondent said	20	The data file no longer reflects the interview
Consent or recording disclosure omitted	20	Compliance exposure independent of data quality
Leading probe / suggesting an answer	10	Biases the response, but the interview may survive one instance
Scale or answer options not read as scripted	8	Comparability damage, accumulates across statements
Vague answer accepted without clarification	6	Recoverable in coding sometimes, a habit worth catching always
Wording deviation that preserves meaning	4	Track it; it predicts worse drift
Rushed delivery, talking over the respondent	2	Coaching material, not cancellation material

Governance notes that make a rubric like this hold up:

The 20-point line is the cancel/replace threshold, reached either by one critical finding or by accumulation — two leading probes plus a misread scale plus accepted vagueness is also a failed interview, which is exactly the kind of "death by warnings" call a binary pass/fail lets through.
Findings below the threshold are not noise; they're the coaching feed. The 2-to-10-point band is where interviewer development lives.
Reclassification must be legitimate. A reviewer who listens and decides the "mismatch" was actually the respondent self-correcting should be able to downgrade or dismiss the finding, with the decision logged. A rubric nobody can override gets quietly ignored instead.
Calibrate monthly. Have reviewers score the same two or three calls independently and compare. Inter-reviewer drift is measurable, and reviewing it is the cheapest consistency tool you have.
Version the rubric with the questionnaire. Mid-wave revisions are routine — a reworded question, a new screener term, a dropped statement from a scale — and every one of them silently invalidates part of the rubric (and any automated flagging criteria built on it). Date-stamp rubric versions against questionnaire versions, and treat findings scored against the wrong version as void, not as arguments.

CATI quality KPIs worth tracking

If the QC program produces findings but no trendlines, you can't answer the only questions that matter at the ops level: is quality improving, who needs help, and is the program itself working. The set we'd track weekly, per study and per interviewer:

KPI	Definition	Why it earns a slot
Time-to-feedback	Median hours from interview end to the interviewer receiving the finding	The KPI almost nobody tracks, and the one that determines how long errors repeat
Critical-error rate	Critical findings per 100 completes, per interviewer	The headline quality number; watch trend, not level
Repeat-error rate	Same finding type recurring for the same interviewer after feedback	Distinguishes a feedback problem from a training problem
Review coverage and fairness	% of completes reviewed, and minimum reviews per active interviewer	Catches the silent failure of proportional sampling
Back-check discrepancy rate	% of back-checks with a material discrepancy	Your falsification early-warning line
Reviewer dismissal rate	% of flagged findings dismissed on human review	Measures whether your flagging — human or AI — is worth your reviewers' time

Per-interviewer views are where this pays off. Aggregate rates look fine right up until you split them and find one interviewer running 11% critical against a room average of 3%:

VeriCATI · Interviewer stats

Interviewer	Calls	Critical rate	Reviewed
#08	142	2%	142
#14	138	4%	130
#22	96	11%	88
#31	124	3%	124

One honest caveat on interviewer-level metrics: small samples lie. Four completes reviewed is not a critical rate; require a minimum reviewed-call count before a number turns into a conversation with an interviewer, and treat week-one outliers as a reason to review more of their calls, not as a verdict.

Where AI review fits — and where it doesn't

Recall the capacity math: full manual review of a 1,000-complete week costs five to six reviewer FTEs, which is why every manual program is a sampling program. AI call review changes that one constraint — transcription and first-pass evaluation against the protocol are automated, so the marginal cost of reviewing one more call stops being 15 minutes of a reviewer's time and becomes a per-call processing fee (vendors in this space, ours included, price per processed call — make any vendor quote that number against your volume before you take "affordable" on faith). Reviewing 100% of completes stops being an academic option: LAPOP's 100% audit, without LAPOP's audit headcount.

Being specific about what that does and doesn't change, because the vendors in this space (ours included) are better served by precision than by promises:

What it changes. Detection probability stops depending on sampling luck: the 5%-frequency error that survives a month of 10% sampling appears in the first day's findings, because every complete is evaluated. Answers can be checked against the data file per question — the mismatch class of error that audio-only review surfaces accidentally if at all. And feedback lag collapses: findings from this morning's calls are reviewable the same day, which is the real-time-monitoring benefit without a supervisor on a headset.

What it doesn't change. The verdict stays human. An AI flag is "listen to this," never an automatic accusation — false positives exist (we publish how we measure ours), and a falsification flag in particular must never reach an interviewer's file without a reviewer confirming it. Your severity rubric, your thresholds, and your coaching workflow remain yours; flagging criteria need calibration to your protocols before the flags are trustworthy, plus ongoing tuning from your reviewers' confirm/dismiss decisions — and recalibration when the questionnaire changes mid-wave, exactly like the rubric it implements. Calibration time is a claim every vendor makes and none should be allowed to keep: have them demonstrate it on a batch of your own recorded calls before you commit, not on a demo reel. And the standards still mean what they meant: back-checks verify things no audio analysis can, and your documented procedures — not your tooling — are what an auditor certifies.

The QC stack from the top of this guide doesn't get shorter with AI in it; the post-hoc layer just stops being the bottleneck, and your reviewers' hours move from listening-to-find to deciding-on-findings.

The QC plan template

Everything above, turned into a fill-in document: download the CATI QC plan template (Markdown — paste it into Google Docs, Word, or your wiki and fill in the bracketed fields). No email required. It covers:

Scope, roles, and named owners for all five QC layers
Rate tables to fill in: live monitoring, post-hoc review, back-checks — with the floors from this guide as reference points
The severity rubric above, ready to re-weight
New-interviewer onboarding review schedule
The KPI set with a weekly review cadence
A feedback-loop section: maximum review lag, who delivers feedback, escalation and falsification protocol
A client-facing documentation checklist for when QC gets audited

If a client dispute, an ISO audit, or a new QC hire is on your horizon, filling this in is an afternoon that pays for itself the first time someone asks "what's your validation procedure?" and the answer is a document instead of a meeting.

Sources and refresh policy

This guide cites primary sources throughout; the load-bearing ones:

IQCS Standards 2023 — the 10% validation minimum and the 5% during-fieldwork telephone-centre floor
AAPOR/ASA, Interviewer Falsification in Survey Research (2003) — the 5–15% verification recommendation and prevention findings
LAPOP Quality Control and Methodological Note 008 — the 100% audio audit and the QuAC weighted-scoring pattern
Peng & Feld, "Quality Control in Telephone Survey Interviewer Monitoring," Survey Practice 4(2), 2011 — the acceptance-sampling critique and SPC approach
World Bank DIME Wiki: Back Checks and J-PAL: Data quality checks — back-check rates and coverage rules
ISO 20252:2019 — service requirements for research fieldwork

Published June 11, 2026. This guide is reviewed annually — standards editions, cited rates, and the AI capability section are re-verified at each revision, and the dateline above reflects the last review.