CATI interviewer monitoring

The CATI Interviewer Monitoring Checklist (Free Template + Scoring Rubric)

Free CATI interviewer monitoring checklist with 7 sections, weighted severity scoring, cadence by interviewer tenure, and inter-monitor calibration guidance.

By the VeriCATI team13 min read

Search for "CATI interviewer monitoring checklist" and you will find: a paywalled 2008 Sage handbook chapter, a 1989 Quirks piece, a statistics paper on sample design, and several form-builder templates with items like "Was the interviewer polite?" These are not the same thing as a working monitoring rubric, and they are not what you need when your QC supervisor is about to listen to their first hundred calls.

This page has the checklist itself — in the page, not behind a form. Seven sections covering every auditable dimension of a CATI interview, each item carrying a severity weight you can use for scoring. Below the checklist: how to allocate review coverage by interviewer tenure, how to run an inter-monitor calibration session, and how the full checklist changes when AI analysis covers 100% of calls instead of a sampled fraction.

The checklist is designed to be copied into a spreadsheet, printed as a paper form, or adapted into your CATI platform's QA module. Weights are suggested starting points — recalibrate to your protocol and adjust when the questionnaire changes.

How scoring works

Each checklist item carries a severity weight. A reviewed call accumulates points for each finding; the total determines the outcome:

  • 0–7 points: Pass — no required action, findings feed the coaching record
  • 8–19 points: Warning — coaching session required; supervisor reviews the call with the interviewer
  • 20+ points: Fail — call is excluded and replaced; escalate per your falsification protocol if fabrication is suspected
  • 100 points (auto-fail): Confirmed or strongly suspected fabrication — triggers the falsification protocol, not just exclusion

A call can fail by accumulation: a leading probe (10 pts) plus a misread scale (8 pts) plus a vague answer accepted without probing (6 pts) is 24 points, a failed call. A binary pass/fail system would log this as a minor delivery call and move on. Don't run a binary system.

This weighted structure adapts the LAPOP QuAC rubric — cancel at 20 accumulated points, falsification scored as an automatic 100 — to agency CATI work.

For the full statistical argument behind these thresholds — including how monitoring rate interacts with detection probability per behavior frequency — see the CATI Quality Control guide.

The complete monitoring checklist

Copy, print, or adapt. Items marked C are critical (20+ points each); W are warnings (8–15 points); M are minor (2–6 points). N/A is a valid score for items not applicable to a given survey design.

#ItemSevPtsFinding / notes
1.1Interviewer identified themselves and the research organization as scriptedM4
1.2Study purpose explained as required by the protocolM4
1.3Recording disclosure given verbatim before any substantive contentC20Omission is a compliance exposure independent of data quality
1.4Consent obtained and confirmed before proceedingC20
1.5Confidentiality/anonymity assurance given as scriptedM4

Section 2: Screener administration

#ItemSevPtsFinding / notes
2.1All screener questions read as scripted, in orderW10
2.2Eligibility decision correctly reflects the respondent's answersC25A wrong pass is a fabricated complete; a wrong fail is a lost eligible
2.3Ineligible respondents terminated correctly and professionallyM4
2.4No coaching of respondents toward eligible answersC20Treat as partial falsification if observed

Section 3: Question administration

#ItemSevPtsFinding / notes
3.1All mandatory questions administeredC20A mandatory question skipped with an answer recorded is a fabricated data point
3.2Questions read verbatim or close equivalent (meaning preserved)W8Wording deviations that change meaning score higher; track even minor deviations
3.3Skip/routing patterns followed correctlyC20Wrong routing can render multiple downstream answers meaningless
3.4Questions not repeated or summarized before respondent answersM4
3.5Questions delivered at a pace that allows the respondent to process them before answeringM4Consistently rushed pace predicts probing failures downstream and is independently scorable; duration outliers belong in paradata review, not here

Section 4: Probing

#ItemSevPtsFinding / notes
4.1Vague or inadequate responses probed appropriately (once, then accepted)M6Missing a required probe is the item; probing too aggressively scores under 5.1
4.2Probes used are neutral — question repeated, or generic non-directive probeW10A leading probe (see 5.1) scores higher than a missing probe
4.3Interviewer waited for respondent to answer before continuingM4Cutting off is both a delivery issue and a reliability risk
4.4Clarification questions from the respondent answered correctly per protocolM4

Section 5: Neutrality

#ItemSevPtsFinding / notes
5.1No leading probe or suggestion of an acceptable answerW10Even a tone shift that signals a preferred answer scores here
5.2Reactions to respondent answers neutral — no verbal reinforcementM4"Great," "perfect," "exactly" after scale items bias subsequent responses
5.3Scale options read in full, in order, without emphasis on any itemW8Partial reading or emphasis systematically biases scale data
5.4Answer options not read aloud for unaided or spontaneous recall itemsW8Reading options that should be spontaneous contaminates the distribution
5.5No personal opinions, commentary, or side conversationsW8

Section 6: Recording accuracy

#ItemSevPtsFinding / notes
6.1Recorded answer matches what the respondent saidC20Confirmed mismatch is a data-file integrity issue; check against the data file if available
6.2Open-end verbatim captured fully and accuratelyW10Partial capture may be unavoidable; verify against what is audible
6.3No answers recorded for questions the respondent did not answer (or refused)C20If a pattern — see falsification protocol
6.4Respondent's final answer recorded (not a mid-deliberation value)W8

Note on data-file availability: Items 6.1 and 6.3 require comparison against the data file to score with confidence. When the data file is not yet available at time of review, mark these items "Pending — verify at data close." Do not score Pass without verification: a reflexive pass defeats the purpose of the section. Schedule a second-pass review when the export is ready.

Section 7: Closing

#ItemSevPtsFinding / notes
7.1Interview closed as scripted — completion confirmedM2
7.2Incentive explained and delivery process described, if applicableW8Non-delivery or incorrect description is a separate operations issue
7.3Respondent given opportunity to ask questions or raise concernsM2

Scoring summary

Total pointsOutcomeAction
0–7PassAdd findings to coaching record; no immediate action required
8–19WarningCoaching session with the interviewer within 48 hours; document and track for recurrence
20–99FailExclude and replace the call; escalate to supervisor; follow your error escalation protocol
100 (auto)Fabrication suspectedFalsification protocol — do not inform the interviewer until evidence is assembled; see the falsification guide

The 100-point auto-fail is a sentinel value, not an accumulation target. Falsification is a qualitative finding — one confirmed fabricated answer is not the same kind of finding as four accumulated probe failures. The 100 exists to separate "this call failed on quality grounds" from "this call requires the falsification protocol." You should never be adding up to 100; you should be asking: does the evidence pattern trigger that separate workflow?

Monitoring cadence by interviewer tenure

A flat "10% of all calls" allocation gives your highest-volume interviewers coverage and your newest hires almost none, which is the opposite of where monitoring effort pays off.

The more defensible approach: set the unit to the interviewer, not the project, and front-load new hires.

Interviewer statusRecommended review rateRationale
First 5 completes100%Habits form in the first week; catch deviations before they become habits
First month (completes 6–50)50%Taper as baseline confidence builds; maintain high coverage for first full study
First quarter25%Continue above the standard floor while tenure is short
Established, no signals10–15% (floor: 2 calls/week)At or above the IQCS 10% validation floor, with a minimum regardless of volume — part-timers still need coverage
Paradata signal triggered25–50%Duration outliers, completes-per-hour anomalies, refusal-rate shifts — temporarily increase until explained or confirmed
Confirmed error patternBack to 50–100%Reset monitoring intensity until the specific issue is resolved across consecutive reviewed calls

One rule that survives every QC program restructure: no active interviewer goes a full study week with zero reviewed calls, regardless of how low their volume is. A part-timer doing 8 completes a week and a full-timer doing 50 should both have at least 2 reviewed calls per week. The statistical protection of the floor is low; the deterrent value and the coaching signal are not — the AAPOR task force found that visibly verifying interviewers' work deters falsification on its own.

For the sampling math behind these rates — how behavior frequency interacts with detection probability at different coverage levels — the CATI Quality Control guide has the full P(detected) table. The short version: a 5%-frequency error has an 81% chance of clearing a week of 10% monitoring undetected. Front-loading tenure coverage is the mitigation.

Inter-monitor calibration

A checklist only produces consistent data if multiple reviewers apply it consistently. Without calibration, your scores measure reviewer differences as much as interviewer behavior — which means the same call passes one desk and fails another, which is the variance problem the checklist was supposed to solve.

Monthly calibration protocol, in four steps:

  1. Select two or three calls blind. Pull calls from the archive that reviewers have not previously scored. Include at least one borderline call (likely to generate disagreement) and one clear case in each direction.

  2. Score independently. Each reviewer scores all three calls against the current checklist without seeing other reviewers' scores. No discussion until everyone has filed.

  3. Compare and discuss disagreements. Focus on items where scores differ by a full severity tier or more. The goal is not consensus on this session's calls — it's documentation of which items are ambiguous and why.

  4. Update the rubric and the example library. Disagreements that reveal genuine ambiguity in a checklist item get a written clarification added to the rubric. Edge cases that reviewers handled well get added to a shared example library that new reviewers train on. If a questionnaire change is driving disagreement (a reworded question, a new scale format), update the checklist version and date-stamp the revision.

Track inter-monitor agreement rates over time — the percentage of items where independent reviewers give the same severity rating on the same call. A rate below 80% on pass/fail items signals that the checklist or its training needs work. A rate below 70% on specific items signals that those items need clearer definitions.

One principle that should be explicit in your calibration protocol: when two senior reviewers persistently disagree on the same item type, the correct response is to update the item definition — not to decide one reviewer is right. Persistent disagreement usually means the checklist item is doing too much work: it covers two different behaviors under one label, or the threshold between "minor" and "warning" is unspecified. The fix is a rewritten item, not a coaching conversation.

Calibration for a team of two to four monitors takes under two hours per month. An annual calibration "when we get around to it" is not calibration; it is documentation that disagreements were acceptable.

Automating the checklist across 100% of calls

Manual monitoring, even with a tight checklist and good cadence, is a sampling program. As the CATI Quality Control guide shows, a 5%-frequency error has an 81% chance of going undetected in a week of 10% sampling. That probability is a property of sampling math, not of reviewer quality — you cannot review your way out of it with the same headcount.

The checklist does not get shorter when AI analysis covers every call. The coverage gets larger. That reframe belongs at the front of this section because it is the thing most often misunderstood: AI-assisted call review does not replace the rubric or the reviewer — it changes what the reviewer is doing, from selecting which calls to listen to, to deciding what the flags mean.

The checklist categories above map directly onto what AI-based call analysis evaluates automatically:

VeriCATI · Analysis

Findings · Call 0934

Analyzed in 7m 41s
Q12 · Response mismatch vs. data file
Q3 · Mandatory question skipped
Q31 · Scale not re-read after 4 statements
Q13 · Vague answer not clarified
Q28 · Answer choices read when they shouldn’t be

The analysis panel above shows the finding types that surface per call in the review queue: response mismatches against the data file, mandatory questions skipped, probing failures, scale read incorrectly, neutrality violations. Each corresponds to a checklist section. The difference is that AI evaluation runs on every call, so the 5%-frequency error that would survive several weeks of sampled monitoring appears in the first day's findings.

What does not change: the severity weights, the fail threshold, the remediation steps, and the human decision on each finding. An AI flag is "listen to this" — every finding still requires a reviewer to confirm, dismiss, or reclassify before it generates a coaching record or a cancel decision.

Practical transition path for moving from a manual sampling program to AI-assisted review: run both in parallel for the first two to three weeks while AI criteria calibrate to your protocol. Assign a subset of calls to manual scoring using this checklist, then compare manual findings against AI flags on those same calls. The goal in this period is calibration, not coverage: you are checking that AI flags land on the right checklist items at the right severity level, and adjusting the flagging criteria where they don't. When the manual-versus-AI agreement rate on high-severity items reaches a level your QC lead is comfortable with, taper the parallel manual sample and use it going forward for calibration sessions rather than coverage. The cadence table in this post still applies — the difference is that your manual reviewer hours move from listening-to-find to deciding-on-findings, and the floor rules (every active interviewer reviewed, new hires at 50–100%) are now met by the automated layer rather than by sampling.

False-positive rates are the number to watch in this transition. If AI flags are generating too much noise, reviewers learn to dismiss without looking — which defeats the program. Our methodology and measured rates are at /accuracy.

Download the monitoring form

Everything in this checklist turned into a per-call scoring sheet — the seven sections with their severity weights, a score column and notes field per item, the score-summary bands, and the cadence reference table: download the CATI interviewer monitoring form (Markdown — paste it into Google Docs, Word, or your wiki and score one call per copy). No email required.


Sources and refresh policy

Published June 11, 2026. Checklist items and severity weights are reviewed annually and recalibrated when questionnaire formats or industry guidance change; the dateline above reflects the last full review.

See your own calls reviewed this way.

Demos run on your own surveys — we calibrate to your protocol and process a batch of your real calls.