CATI interviewer monitoring
The CATI Interviewer Monitoring Checklist (Free Template + Scoring Rubric)
Free CATI interviewer monitoring checklist with 7 sections, weighted severity scoring, cadence by interviewer tenure, and inter-monitor calibration guidance.
Search for "CATI interviewer monitoring checklist" and you will find: a paywalled 2008 Sage handbook chapter, a 1989 Quirks piece, a statistics paper on sample design, and several form-builder templates with items like "Was the interviewer polite?" These are not the same thing as a working monitoring rubric, and they are not what you need when your QC supervisor is about to listen to their first hundred calls.
This page has the checklist itself — in the page, not behind a form. Seven sections covering every auditable dimension of a CATI interview, each item carrying a severity weight you can use for scoring. Below the checklist: how to allocate review coverage by interviewer tenure, how to run an inter-monitor calibration session, and how the full checklist changes when AI analysis covers 100% of calls instead of a sampled fraction.
The checklist is designed to be copied into a spreadsheet, printed as a paper form, or adapted into your CATI platform's QA module. Weights are suggested starting points — recalibrate to your protocol and adjust when the questionnaire changes.
How scoring works
Each checklist item carries a severity weight. A reviewed call accumulates points for each finding; the total determines the outcome:
- 0–7 points: Pass — no required action, findings feed the coaching record
- 8–19 points: Warning — coaching session required; supervisor reviews the call with the interviewer
- 20+ points: Fail — call is excluded and replaced; escalate per your falsification protocol if fabrication is suspected
- 100 points (auto-fail): Confirmed or strongly suspected fabrication — triggers the falsification protocol, not just exclusion
A call can fail by accumulation: a leading probe (10 pts) plus a misread scale (8 pts) plus a vague answer accepted without probing (6 pts) is 24 points, a failed call. A binary pass/fail system would log this as a minor delivery call and move on. Don't run a binary system.
This weighted structure adapts the LAPOP QuAC rubric — cancel at 20 accumulated points, falsification scored as an automatic 100 — to agency CATI work.
For the full statistical argument behind these thresholds — including how monitoring rate interacts with detection probability per behavior frequency — see the CATI Quality Control guide.
The complete monitoring checklist
Copy, print, or adapt. Items marked C are critical (20+ points each); W are warnings (8–15 points); M are minor (2–6 points). N/A is a valid score for items not applicable to a given survey design.
Section 1: Introduction and consent
| # | Item | Sev | Pts | Finding / notes |
|---|---|---|---|---|
| 1.1 | Interviewer identified themselves and the research organization as scripted | M | 4 | |
| 1.2 | Study purpose explained as required by the protocol | M | 4 | |
| 1.3 | Recording disclosure given verbatim before any substantive content | C | 20 | Omission is a compliance exposure independent of data quality |
| 1.4 | Consent obtained and confirmed before proceeding | C | 20 | |
| 1.5 | Confidentiality/anonymity assurance given as scripted | M | 4 |
Section 2: Screener administration
| # | Item | Sev | Pts | Finding / notes |
|---|---|---|---|---|
| 2.1 | All screener questions read as scripted, in order | W | 10 | |
| 2.2 | Eligibility decision correctly reflects the respondent's answers | C | 25 | A wrong pass is a fabricated complete; a wrong fail is a lost eligible |
| 2.3 | Ineligible respondents terminated correctly and professionally | M | 4 | |
| 2.4 | No coaching of respondents toward eligible answers | C | 20 | Treat as partial falsification if observed |
Section 3: Question administration
| # | Item | Sev | Pts | Finding / notes |
|---|---|---|---|---|
| 3.1 | All mandatory questions administered | C | 20 | A mandatory question skipped with an answer recorded is a fabricated data point |
| 3.2 | Questions read verbatim or close equivalent (meaning preserved) | W | 8 | Wording deviations that change meaning score higher; track even minor deviations |
| 3.3 | Skip/routing patterns followed correctly | C | 20 | Wrong routing can render multiple downstream answers meaningless |
| 3.4 | Questions not repeated or summarized before respondent answers | M | 4 | |
| 3.5 | Questions delivered at a pace that allows the respondent to process them before answering | M | 4 | Consistently rushed pace predicts probing failures downstream and is independently scorable; duration outliers belong in paradata review, not here |
Section 4: Probing
| # | Item | Sev | Pts | Finding / notes |
|---|---|---|---|---|
| 4.1 | Vague or inadequate responses probed appropriately (once, then accepted) | M | 6 | Missing a required probe is the item; probing too aggressively scores under 5.1 |
| 4.2 | Probes used are neutral — question repeated, or generic non-directive probe | W | 10 | A leading probe (see 5.1) scores higher than a missing probe |
| 4.3 | Interviewer waited for respondent to answer before continuing | M | 4 | Cutting off is both a delivery issue and a reliability risk |
| 4.4 | Clarification questions from the respondent answered correctly per protocol | M | 4 |
Section 5: Neutrality
| # | Item | Sev | Pts | Finding / notes |
|---|---|---|---|---|
| 5.1 | No leading probe or suggestion of an acceptable answer | W | 10 | Even a tone shift that signals a preferred answer scores here |
| 5.2 | Reactions to respondent answers neutral — no verbal reinforcement | M | 4 | "Great," "perfect," "exactly" after scale items bias subsequent responses |
| 5.3 | Scale options read in full, in order, without emphasis on any item | W | 8 | Partial reading or emphasis systematically biases scale data |
| 5.4 | Answer options not read aloud for unaided or spontaneous recall items | W | 8 | Reading options that should be spontaneous contaminates the distribution |
| 5.5 | No personal opinions, commentary, or side conversations | W | 8 |
Section 6: Recording accuracy
| # | Item | Sev | Pts | Finding / notes |
|---|---|---|---|---|
| 6.1 | Recorded answer matches what the respondent said | C | 20 | Confirmed mismatch is a data-file integrity issue; check against the data file if available |
| 6.2 | Open-end verbatim captured fully and accurately | W | 10 | Partial capture may be unavoidable; verify against what is audible |
| 6.3 | No answers recorded for questions the respondent did not answer (or refused) | C | 20 | If a pattern — see falsification protocol |
| 6.4 | Respondent's final answer recorded (not a mid-deliberation value) | W | 8 |
Note on data-file availability: Items 6.1 and 6.3 require comparison against the data file to score with confidence. When the data file is not yet available at time of review, mark these items "Pending — verify at data close." Do not score Pass without verification: a reflexive pass defeats the purpose of the section. Schedule a second-pass review when the export is ready.
Section 7: Closing
| # | Item | Sev | Pts | Finding / notes |
|---|---|---|---|---|
| 7.1 | Interview closed as scripted — completion confirmed | M | 2 | |
| 7.2 | Incentive explained and delivery process described, if applicable | W | 8 | Non-delivery or incorrect description is a separate operations issue |
| 7.3 | Respondent given opportunity to ask questions or raise concerns | M | 2 |
Scoring summary
| Total points | Outcome | Action |
|---|---|---|
| 0–7 | Pass | Add findings to coaching record; no immediate action required |
| 8–19 | Warning | Coaching session with the interviewer within 48 hours; document and track for recurrence |
| 20–99 | Fail | Exclude and replace the call; escalate to supervisor; follow your error escalation protocol |
| 100 (auto) | Fabrication suspected | Falsification protocol — do not inform the interviewer until evidence is assembled; see the falsification guide |
The 100-point auto-fail is a sentinel value, not an accumulation target. Falsification is a qualitative finding — one confirmed fabricated answer is not the same kind of finding as four accumulated probe failures. The 100 exists to separate "this call failed on quality grounds" from "this call requires the falsification protocol." You should never be adding up to 100; you should be asking: does the evidence pattern trigger that separate workflow?
Monitoring cadence by interviewer tenure
A flat "10% of all calls" allocation gives your highest-volume interviewers coverage and your newest hires almost none, which is the opposite of where monitoring effort pays off.
The more defensible approach: set the unit to the interviewer, not the project, and front-load new hires.
| Interviewer status | Recommended review rate | Rationale |
|---|---|---|
| First 5 completes | 100% | Habits form in the first week; catch deviations before they become habits |
| First month (completes 6–50) | 50% | Taper as baseline confidence builds; maintain high coverage for first full study |
| First quarter | 25% | Continue above the standard floor while tenure is short |
| Established, no signals | 10–15% (floor: 2 calls/week) | At or above the IQCS 10% validation floor, with a minimum regardless of volume — part-timers still need coverage |
| Paradata signal triggered | 25–50% | Duration outliers, completes-per-hour anomalies, refusal-rate shifts — temporarily increase until explained or confirmed |
| Confirmed error pattern | Back to 50–100% | Reset monitoring intensity until the specific issue is resolved across consecutive reviewed calls |
One rule that survives every QC program restructure: no active interviewer goes a full study week with zero reviewed calls, regardless of how low their volume is. A part-timer doing 8 completes a week and a full-timer doing 50 should both have at least 2 reviewed calls per week. The statistical protection of the floor is low; the deterrent value and the coaching signal are not — the AAPOR task force found that visibly verifying interviewers' work deters falsification on its own.
For the sampling math behind these rates — how behavior frequency interacts with detection probability at different coverage levels — the CATI Quality Control guide has the full P(detected) table. The short version: a 5%-frequency error has an 81% chance of clearing a week of 10% monitoring undetected. Front-loading tenure coverage is the mitigation.
Inter-monitor calibration
A checklist only produces consistent data if multiple reviewers apply it consistently. Without calibration, your scores measure reviewer differences as much as interviewer behavior — which means the same call passes one desk and fails another, which is the variance problem the checklist was supposed to solve.
Monthly calibration protocol, in four steps:
-
Select two or three calls blind. Pull calls from the archive that reviewers have not previously scored. Include at least one borderline call (likely to generate disagreement) and one clear case in each direction.
-
Score independently. Each reviewer scores all three calls against the current checklist without seeing other reviewers' scores. No discussion until everyone has filed.
-
Compare and discuss disagreements. Focus on items where scores differ by a full severity tier or more. The goal is not consensus on this session's calls — it's documentation of which items are ambiguous and why.
-
Update the rubric and the example library. Disagreements that reveal genuine ambiguity in a checklist item get a written clarification added to the rubric. Edge cases that reviewers handled well get added to a shared example library that new reviewers train on. If a questionnaire change is driving disagreement (a reworded question, a new scale format), update the checklist version and date-stamp the revision.
Track inter-monitor agreement rates over time — the percentage of items where independent reviewers give the same severity rating on the same call. A rate below 80% on pass/fail items signals that the checklist or its training needs work. A rate below 70% on specific items signals that those items need clearer definitions.
One principle that should be explicit in your calibration protocol: when two senior reviewers persistently disagree on the same item type, the correct response is to update the item definition — not to decide one reviewer is right. Persistent disagreement usually means the checklist item is doing too much work: it covers two different behaviors under one label, or the threshold between "minor" and "warning" is unspecified. The fix is a rewritten item, not a coaching conversation.
Calibration for a team of two to four monitors takes under two hours per month. An annual calibration "when we get around to it" is not calibration; it is documentation that disagreements were acceptable.
Automating the checklist across 100% of calls
Manual monitoring, even with a tight checklist and good cadence, is a sampling program. As the CATI Quality Control guide shows, a 5%-frequency error has an 81% chance of going undetected in a week of 10% sampling. That probability is a property of sampling math, not of reviewer quality — you cannot review your way out of it with the same headcount.
The checklist does not get shorter when AI analysis covers every call. The coverage gets larger. That reframe belongs at the front of this section because it is the thing most often misunderstood: AI-assisted call review does not replace the rubric or the reviewer — it changes what the reviewer is doing, from selecting which calls to listen to, to deciding what the flags mean.
The checklist categories above map directly onto what AI-based call analysis evaluates automatically:
Findings · Call 0934
Analyzed in 7m 41sThe analysis panel above shows the finding types that surface per call in the review queue: response mismatches against the data file, mandatory questions skipped, probing failures, scale read incorrectly, neutrality violations. Each corresponds to a checklist section. The difference is that AI evaluation runs on every call, so the 5%-frequency error that would survive several weeks of sampled monitoring appears in the first day's findings.
What does not change: the severity weights, the fail threshold, the remediation steps, and the human decision on each finding. An AI flag is "listen to this" — every finding still requires a reviewer to confirm, dismiss, or reclassify before it generates a coaching record or a cancel decision.
Practical transition path for moving from a manual sampling program to AI-assisted review: run both in parallel for the first two to three weeks while AI criteria calibrate to your protocol. Assign a subset of calls to manual scoring using this checklist, then compare manual findings against AI flags on those same calls. The goal in this period is calibration, not coverage: you are checking that AI flags land on the right checklist items at the right severity level, and adjusting the flagging criteria where they don't. When the manual-versus-AI agreement rate on high-severity items reaches a level your QC lead is comfortable with, taper the parallel manual sample and use it going forward for calibration sessions rather than coverage. The cadence table in this post still applies — the difference is that your manual reviewer hours move from listening-to-find to deciding-on-findings, and the floor rules (every active interviewer reviewed, new hires at 50–100%) are now met by the automated layer rather than by sampling.
False-positive rates are the number to watch in this transition. If AI flags are generating too much noise, reviewers learn to dismiss without looking — which defeats the program. Our methodology and measured rates are at /accuracy.
Download the monitoring form
Everything in this checklist turned into a per-call scoring sheet — the seven sections with their severity weights, a score column and notes field per item, the score-summary bands, and the cadence reference table: download the CATI interviewer monitoring form (Markdown — paste it into Google Docs, Word, or your wiki and score one call per copy). No email required.
Sources and refresh policy
- AAPOR/ASA, Interviewer Falsification in Survey Research (2003) — prevention findings; the deterrence value of visible verification
- LAPOP Quality Control and Methodological Note 008 — the QuAC weighted-scoring model this checklist adapts; the auto-fail falsification weight
- IQCS Standards 2023 — the 10% validation floor and the 5% during-fieldwork telephone-centre minimum
Published June 11, 2026. Checklist items and severity weights are reviewed annually and recalibrated when questionnaire formats or industry guidance change; the dateline above reflects the last full review.