Precision & AccuracyMetrics

Understanding how we measure and ensure the accuracy of our AI quality control system when comparing against human QC reviewers.

≥0.85
Target

F2-Score

Harmonic mean favoring recall over precision

<10%
Threshold

False Positive Rate

Keeping false alarms manageable

85-90%
Baseline

Human Agreement

Inter-reviewer consistency benchmark

What We Measure (and Why)

We want a system that catches real problems while minimizing unnecessary work

R

Recall

“Of all the real problems, how many did the system catch?”

&
P

Precision

“Of all the things flagged, how many were truly problems?”

F₂

F2-Score

Combines precision and recall in harmonic mean that prioritizes catching problems over false alarms

Why F2-Score for Quality Control?

Precision, Recall and F2-Score are widely accepted metrics across many industries.

🏥
Cancer Screening
Missing a tumor is worse than a false alarm
💳
Fraud Detection
Missing fraud costs more than extra checks
🔧
Quality Control
Missing defects is worse than reviewing good calls

Confusion Matrix

The four outcomes on each survey

What the human saidWhat the system saidName we use
Red flag presentFlaggedHit(True Positive, TP)
Red flag presentNot flaggedMiss(False Negative, FN)
No red flagFlaggedFalse alarm(False Positive, FP)
No red flagNot FlaggedCorrect blank(True Negative, TN)

Mathematical Foundation

How we compute our key performance metrics

R

Recall

Sensitivity to problems

“Of all real problems, how many did we catch?”

Recall = TP ÷ (TP + FN)
Hits ÷ (Hits + Misses)
P

Precision

Accuracy of flags

“Of all flags raised, how many were correct?”

Precision = TP ÷ (TP + FP)
Hits ÷ (Hits + False Alarms)
F₂

F2-Score Formula

The harmonic mean that prioritizes recall

F₂ = (1 + 2²) × P × R ÷ (2² × P + R)
= 5 × P × R ÷ (4 × P + R)
FPR

False Positive Rate

Rate of unnecessary flags

FPR = FP ÷ (FP + TN)
False Alarms ÷ (False Alarms + Correct Blanks)
< 10%
Target: Under 10% false positive rate

Performance Standards

Our rigorous thresholds for system acceptance

F2-Score Standard

Primary Performance Metric

Target: ≥ 0.85
0.00.851.0

Why 0.85?

Human reviewers typically agree 85-90% on complex quality control tasks. Our target of 0.85 ensures AI performance matches human-level consistency without demanding impossible perfection.

Aligns with industry “80-90% accuracy” standards

False Positive Rate

Workload Management

Limit: < 10%
0%10%100%

Why under 10%?

False positives create extra work for reviewers. Keeping this under 10% ensures that 9 out of 10 flagged issues are genuine problems, maintaining reviewer efficiency and system credibility.

Maintains positive business case and reviewer trust

Common Questions

Understanding our methodology

Q1

“Why not use accuracy (right vs wrong)?”

Because red flags are rare. A system that never flags anything could look “99% accurate” but would miss real problems. F2 focuses on catching problems, which is what matters for QC.

F2 optimizes for problem detection over raw accuracy
Q2

“Why weight Recall more than Precision?”

Because missing a real problem (false negative) costs far more than checking a false alarm (false positive). A false alarm is just 20-40 seconds of review time, while missing problems can have serious consequences.

Cost of missing problems > Cost of false alarms

Ready to See These Metrics in Action?

Our AI consistently achieves F2 scores above 0.85 while maintaining false positive rates under 10%. See how this translates to real results for your quality control process.