interviewer falsification
Interviewer Falsification in CATI: Detection Signals, Verification Workflow, and What to Do Next
What CATI falsification sounds like in recorded calls, the paradata and audio signals that detect it, and a tiered workflow for verifying and acting on it.
If you run CATI fieldwork, the falsification literature has a structural problem: nearly all of it is about face-to-face or CAPI surveys, where an interviewer travels alone, no supervisor is present, and cheating means filling in a questionnaire under a tree. Centralized CATI changes the risk profile significantly — a supervisor can monitor, every call goes through the dialer log, and the physical environment is controlled. But it doesn't eliminate the risk. It shifts it.
In a supervised phone room, the falsification modes that survive are subtler: a screener administered too fast to an unengaged respondent who is then recorded as eligible; questions answered on behalf of someone who has already mentally disconnected from the interview; a complete fabricated by a proficient interviewer who knows the questionnaire well enough to navigate it plausibly. None of these look like the cartoon version of cheating. All of them leave traces — in the paradata, and especially in the audio — that structured verification can surface.
This post covers the detection layer that most CATI QC programs have not built: what falsification looks and sounds like in actual recordings, the paradata signals that should trigger a closer look, the tiered workflow for verification once signals appear, and the HR and client implications of confirmed cases.
One position stated upfront: a falsification signal is not an accusation. The detection signals in this post are triggers for human review — not conclusions. The false-positive risk in this domain is an HR liability and a trust problem, and any detection program that skips that chapter is incomplete. It gets its own section near the end.
What counts as falsification in CATI: the AAPOR taxonomy
The AAPOR/ASA Falsification Task Force (2003) defines falsification broadly — fabricating data, miscoding eligibility or call dispositions, misrepresenting the data-collection process. In a centralized phone room those forms resolve into five recognizable modes, and the distinctions matter for both detection and appropriate response:
| Type | What it means in CATI terms |
|---|---|
| Whole-interview fabrication | An interviewer keys through an entire questionnaire with no live respondent — the completion is invented. |
| Partial fabrication | A real interview is started, then the respondent disengages or terminates, and the interviewer completes the remaining questions solo. |
| Screener falsification | The screener is administered, the respondent fails eligibility, but the interviewer records a pass to generate a complete. |
| Key-data falsification | A real interview is completed, but one or more data points are changed before submission — typically to avoid a quota flag or to reduce clarification pressure from a supervisor. |
| Procedural falsification | Mandatory question portions are skipped or compressed, but the interview is recorded as fully compliant. |
For CATI specifically, the risk gradient runs — in our operational experience and consistent with how these modes map to a centralized phone-room environment — procedural falsification is most common; whole-interview fabrication is rarest but most damaging. Partial fabrication — an interview started legitimately and completed unilaterally after the respondent drifts — sits in the middle, and is the mode that audio review most frequently surfaces, because the acoustic character of the call changes visibly partway through.
Prevalence: what the evidence shows
Putting a single number on CATI falsification rates is not something the literature supports clearly. Most studies were designed to catch falsification rather than measure its base rate, and most were conducted in face-to-face fieldwork where the risk is structurally higher — no supervisor, no recording, payment-per-complete incentives in decentralized assignments.
What the AAPOR task force's synthesis of available research supports:
- In face-to-face fieldwork, falsification by a non-trivial fraction of interviewers has been documented across multiple independent studies. The mechanisms generating that risk are partially or fully absent in centralized CATI — which is the primary argument for the phone-room model.
- In supervised CATI environments, the evidence suggests lower baseline rates than field interviewing, but not zero — and investigations specifically targeting CATI phone rooms have produced confirmed cases even in monitored settings.
- Incentive structures matter. Interviewers paid per complete, or under quota pressure during a short field period, face falsification pressure regardless of their prior history. A phone room with a week of quota to fill and two days left in fieldwork is a higher-risk environment than the same room at mid-field.
- Detection affects incidence. The task force found evidence that informing interviewers their work will be verified reduces falsification. The implication: a visible verification program is itself a prevention program.
The honest operational take: falsification occurs in CATI environments. Its frequency is low enough that most programs will go months without a confirmed case, and that low frequency is the main argument for a systematic detection program rather than against one. A program that only investigates when a problem is obvious will structurally miss the non-obvious cases.
Curbstoning vs. falsification: the vocabulary problem
"Curbstoning" is the term that surfaces in face-to-face survey fraud discussions — referring to an interviewer who fills out questionnaires without contacting any respondent. The name is old: the interviewer is pictured sitting on a curbstone completing surveys rather than knocking on doors. Search for it and you'll land in two very different places: used-car fraud (a completely separate meaning — unlicensed sellers flipping cars curbside to dodge dealer regulations) and market-research academic papers, where the two populations of searcher do not overlap.
In market research, curbstoning is a specific form of whole-interview fabrication — the face-to-face mode where the interviewer never attempts respondent contact. In CATI, the closest equivalent is an interviewer who navigates the dialer interface to generate a call record without ever connecting to a live line, or who connects, immediately ends the call, and keys through the questionnaire solo.
The distinction worth preserving for QC purposes:
- Curbstoning (in market research) = complete fabrication, no respondent contact attempted — detectable primarily via paradata (call duration far too short, no ring time, no dialer handoff) and back-checks (number was never actually called)
- Falsification = the umbrella term — includes curbstoning but also partial fabrication, screener falsification, key-data changes, and procedural compression
Using "falsification" in CATI contexts is more precise, and better maps to what audio review actually surfaces. Partial fabrication and procedural falsification are far more detectable via recording than full curbstoning, which leaves little acoustic evidence and shows up mostly in paradata and back-checks.
Detection signals: what the paradata shows
Paradata — the process data generated by the dialer and CATI platform: call timestamps, durations, keystroke logs — is the first detection layer, because it covers every interview automatically without requiring anyone to listen to calls.
Call duration outliers. A 15-minute questionnaire takes at minimum 15 minutes to administer, even at an aggressive delivery pace. Systematic shortfalls — interviews substantially shorter than the established survey average — warrant audio review. Calibrate to your questionnaire: a 10% duration shortfall on a long survey may reflect a fast respondent; a 40% shortfall consistently logged by one interviewer during a slow period does not.
Completes-per-hour anomalies. Real CATI fieldwork involves connection time, refusals, no-answers, early terminations, and natural variation in interview length. An interviewer logging three to four times the room's average completes per hour is not faster — they are not administering full interviews. The signal strengthens when it co-occurs with a low refusal rate for that session.
Unusually consistent inter-interview intervals. Legitimate call sessions have randomness: some respondents answer immediately, some require multiple attempts, some interviews run long. An interviewer whose completes are logged at nearly uniform intervals — every N minutes, consistently — across a session is showing a pattern inconsistent with live interviewing.
Screener pass rate outliers. Screeners exist to filter ineligible respondents. An interviewer who converts significantly more contacts to completes than the room average — particularly during a tight-quota period — may be passing ineligible respondents. Cross-reference against back-check eligibility confirmation if available.
Response distribution anomalies. For scale questions, an interviewer's distribution of responses across their completes should broadly track the survey average, absent a documented demographic skew in their assigned segment. Systematic clustering at particular scale points — especially midpoints, which require no justification — is a flag for possible data fabrication.
None of these signals is conclusive alone. They function as a triage layer: a completes-per-hour outlier triggers audio review of that session; audio review confirms or dismisses. The paradata layer's job is to tell you whose calls to listen to, not to provide findings.
Detection signals: what the audio actually sounds like
This is where the CATI falsification literature has its largest gap. The academic detection literature focuses almost entirely on paradata and back-checks — both originally designed for face-to-face fieldwork, both predating widespread call recording — because listening to every call for acoustic characteristics was not feasible at scale until recently.
The signals below are observable acoustic and structural patterns in calls flagged for human review in VeriCATI deployments. To our knowledge, no published source systematically describes what falsification sounds like in recorded CATI audio.
No respondent voice. The most unambiguous signal. An interviewer reads questions and response options aloud; there is no audible voice on the respondent side between keystrokes. In a legitimate interview, a respondent is present: they speak, hesitate, ask for clarification, occasionally sigh. Absence of respondent voice in a call that results in a complete is a strong signal of partial or full fabrication. Distinguish this from brief calls that end in refusal or early termination before a complete is recorded — those have a recognizably different acoustic structure.
Speed-read screener delivery. Screener questions delivered at machine-gun pace to a silent or unresponsive line, with eligibility recorded immediately without any discernible respondent answer. The interviewer's speech is continuous and fast — more like someone reading aloud to themselves than someone waiting for a response. This is the characteristic audio of screener falsification.
Missing question-and-answer exchange structure. In a legitimate CATI interview, questions and responses follow a recognizable conversational rhythm: question delivered, pause of one to several seconds, respondent answers (some speech, however brief), interviewer acknowledges, next question. Falsified interviews collapse this structure: questions are delivered consecutively with minimal pause, keystrokes come immediately after question completion, and there is no variation in the pause structure that would reflect genuine respondent deliberation on scale items.
Acoustic discontinuity partway through the call. The most common audio signature of partial fabrication — where a real interview begins and the interviewer takes over after the respondent disengages. The first portion has a respondent voice; the second does not. The transition is often identifiable: delivery pace increases, the conversational rhythm disappears, and background audio may shift slightly. This is audibly different from a respondent who simply becomes terse toward the end of a long interview — a terse real respondent still speaks briefly per question.
Consistent background audio across demographically distinct respondents. Legitimate telephone interviews have variable respondent environments: some respondents are in quiet rooms, some have a TV on, some are in kitchens. When multiple completes attributed to demographically distinct respondents share an identical ambient audio signature, it suggests the calls were made from a single environment — not to real respondents in their homes.
Q12 · Monthly household income
Interviewer #14 · Call 0934 · 12:34
The bracket recorded in the data file doesn't match the income the respondent stated aloud.
Respondent said
“…about four and a half thousand a month.”
Recorded in data file
Bracket 2 · 2,000–3,000
Your reviewer decides
The finding card above shows what one of these flags looks like in the review queue: the specific finding, the audio segment positioned at its location in the call timeline, and the verdict controls for a reviewer to confirm or dismiss. Falsification signals surface in the same format — the flagged segment brought forward for human decision, never presented as a conclusion.
The tiered verification workflow
When paradata or audio flags appear, a graduated response reduces both false negatives and the HR exposure that comes with premature accusation.
Tier 1 — Paradata signal only: increase review coverage, take no action. When paradata shows an outlier (duration shortfall, completes-per-hour spike, consistent inter-call intervals), respond by increasing audio review of that interviewer's sessions from that period. Document the signal. No action toward the interviewer until audio review is complete.
Tier 2 — Single audio flag: human review of the flagged call. A QC reviewer listens to the call with the specific flag in context. Likely outcomes: dismissed (explainable), confirmed as a procedural issue (coaching material), or escalated — the flag is substantive and paradata signals are consistent with it.
Tier 3 — Multiple corroborating signals: supervisor-level review of a broader sample. If the audio finding is confirmed and paradata signals are consistent, pull a larger sample — 20–30% of the interviewer's recent completes — for review by a second QC reviewer who was not involved in the initial flag. This is the stage where back-checks become relevant: re-contact verification on a sample of the flagged interviewer's listed respondents (World Bank DIME and J-PAL put baseline back-check coverage at 10–20%, every interviewer covered at least once). The back-check is what transforms an audio finding into a verifiable claim about respondent contact.
Tier 4 — Back-check discrepancy or multiple confirmed fabrication findings: escalate to management with documented evidence. Only at this stage does the investigation become visible to the interviewer or generate a formal record. The documentation package includes: specific audio findings with timestamps and audio segments, paradata anomalies, back-check results, and the reviewer's signed verdict on each finding. Nothing before this stage creates a record the interviewer will see.
| Tier | Trigger | Action | Output |
|---|---|---|---|
| 1 — Paradata signal | Duration shortfall, completes-per-hour spike, or unusually consistent inter-call intervals | Increase audio review coverage of that session; document the signal | No interviewer contact |
| 2 — Single audio flag | One confirmed audio signal after human review | Classify as dismissed, procedural issue, or escalated | Verdict logged; escalation if substantive and consistent with paradata |
| 3 — Multiple corroborating signals | Confirmed audio finding + supporting paradata | Pull 20–30% sample for a second independent reviewer; initiate back-checks on the flagged interviewer's listed respondents | Back-check results + broader sample verdicts |
| 4 — Back-check discrepancy or confirmed fabrication | Tier 3 evidence confirmed | Escalate to management with full documentation package | Formal evidence package; interviewer and client decisions follow |
The discipline that makes this workflow defensible is the irreversibility gate at tier 4. Informal conversations with an interviewer about "flags we noticed" before the evidence is complete are well-intentioned and legally problematic in most employment frameworks. Let the evidence accumulate before the meeting.
The false-positive problem
This section deserves direct treatment, because the vendor incentive runs in the opposite direction: a falsification detection system that generates findings looks productive. A system that generates findings loosely creates liabilities.
False positives in falsification detection are an HR problem, not just a QC inaccuracy. A falsification flag that reaches an interviewer's file — or a client — without a confirmed human verdict is a wrongful-accusation problem. Interviewers have jobs; flagging someone incorrectly for fraud has consequences that a miscategorized probe finding does not.
Each audio signal has legitimate alternative explanations:
- No respondent voice can mean a respondent with a very soft voice, a Bluetooth headset on their end, or single-track call recording that captures only the interviewer channel.
- Speed-read screener can mean a fluent respondent who answered briskly between questions — audibly, but quietly.
- Acoustic discontinuity can mean the respondent switched from handset to speakerphone, moved to a different room, or had background noise change mid-call.
These alternatives are not reasons to avoid audio signals — they are reasons to use audio signals as triggers for human review, never as findings in themselves. The tiered workflow above exists to interpose human judgment between a detected signal and any formal action.
For managing this in practice:
- Never present an AI falsification flag as a confirmed finding. The output should read "Review: potential falsification signal" — not "Falsification detected." One label invites review; the other invites a grievance.
- Publish your false-positive rate. If a QC program confirms 90% of its falsification flags, either the detection criteria are too narrow or the reviewing is too permissive. If it confirms 5%, the flags are mostly noise. A well-calibrated program keeps false-positive rates low enough that reviewers trust what they are asked to review. VeriCATI publishes how we measure ours at /accuracy.
- Train reviewers on the explainable alternatives. A reviewer who only knows "no respondent voice equals fabrication" will over-confirm. One who understands recording artifacts, soft-voice respondents, and the acoustic difference between partial fabrication and a respondent who went quiet will make better calls.
After a confirmed case: remediation and client disclosure
If tier-4 evidence is assembled and confirmed, the decisions from that point have a constrained option set. Most QC programs have not answered "what do we do next" before they need it.
Regarding the affected completes. Confirmed fabricated interviews are removed and replaced — they cannot be cleaned. How many to remove depends on verification scope: if back-checks confirm fabrication only in the reviewed sample, you have a decision about how to treat the unreviewed portion of that interviewer's work. That decision should be documented, with reasoning, against the tier-4 evidence package.
Regarding the interviewer. That decision belongs to management and HR, not to the QC team. The QC team's role ends when it delivers a documented finding — both what was confirmed and what was dismissed — to the appropriate decision-maker. Documenting ruled-out findings alongside confirmed ones protects the process against later challenge.
Regarding the client. The obligation here is clearest: if falsified data reached the client, they need to know. Most client contracts require notification of data-integrity issues. The defensible position is proactive disclosure with: the number of completes affected, the nature of the falsification, what replacement or cleaning was done, and the supporting evidence. Agencies that notify proactively, with documentation, manage this as a QC-process story. Those whose clients discover it independently manage it as a trust story.
Regarding future deterrence. A confirmed case is evidence that existing deterrence was insufficient for at least one interviewer. Review the pre-field brief: was the existence of full audio review communicated to interviewers? Was the back-check program described? The AAPOR task force finding is worth repeating: informing interviewers that their work will be verified is itself a preventive control. Making the monitoring program more visible — stated plainly in the pre-field brief, not as a threat but as a matter of practice — is the cheapest response to a confirmed case, and the most durable one.
A note on AI detection and its limits
Audio-based AI analysis can cover 100% of a call batch rather than a reviewed sample — the same argument made in the CATI Quality Control guide for first-pass AI review generally. (LAPOP reaches 100% audio coverage today with a dedicated human audit team, weighting falsification as an automatic fail in its QuAC rubric; AI brings that coverage within reach without the headcount.) For falsification detection specifically, that coverage means a partial-fabrication pattern on a Tuesday afternoon doesn't depend on landing in a 10% manual review sample. The paradata signals above can be monitored automatically; the audio signals can be flagged without a reviewer listening to every call first.
What AI detection does not change: the need for human verification before any formal action, the back-check as the only definitive confirmation of respondent contact, and the false-positive obligations described above. A program that replaces human judgment with automated verdicts on falsification flags is not tighter — it is faster to generate a liability.
The right role: AI as the layer that surfaces where to look; human reviewers as the layer that decides what was found.
Sources and refresh policy
- AAPOR/ASA, Interviewer Falsification in Survey Research (2003) — the primary falsification taxonomy, deterrence findings, and synthesis of prevalence evidence
- LAPOP Quality Control — 100% audio audit model and falsification weighting in the QuAC rubric (falsification scores 100 points, auto-fail)
- World Bank DIME Wiki: Back Checks — back-check design, coverage rates, and discrepancy reconciliation
- J-PAL: Data quality checks — every-interviewer coverage rule and verification floor
Published June 11, 2026. The audio-signal descriptions in this post reflect patterns observed in VeriCATI deployments and are updated as detection refinements occur. Primary sources are re-verified annually.