Precision & Accuracy
Metrics

Understanding how we measure and ensure the accuracy of our AI quality control system when comparing against human QC reviewers.

≥0.85

Target

F2-Score

Harmonic mean favoring recall over precision

<10%

Threshold

False Positive Rate

Keeping false alarms manageable

85-90%

Baseline

Human Agreement

Inter-reviewer consistency benchmark

What We Measure (and Why)

We want a system that catches real problems while minimizing unnecessary work

Recall

“Of all the real problems, how many did the system catch?”

Precision

“Of all the things flagged, how many were truly problems?”

→

F₂

F2-Score

Combines precision and recall in harmonic mean that prioritizes catching problems over false alarms

Why F2-Score for Quality Control?

Precision, Recall and F2-Score are widely accepted metrics across many industries.

🏥

Cancer Screening

Missing a tumor is worse than a false alarm

💳

Fraud Detection

Missing fraud costs more than extra checks

🔧

Quality Control

Missing defects is worse than reviewing good calls

Confusion Matrix

The four outcomes on each survey

What the human said	What the system said	Name we use
Red flag present	Flagged	Hit(True Positive, TP)
Red flag present	Not flagged	Miss(False Negative, FN)
No red flag	Flagged	False alarm(False Positive, FP)
No red flag	Not Flagged	Correct blank(True Negative, TN)

Mathematical Foundation

How we compute our key performance metrics

Recall

Sensitivity to problems

“Of all real problems, how many did we catch?”

Recall = TP ÷ (TP + FN)

Hits ÷ (Hits + Misses)

Precision

Accuracy of flags

“Of all flags raised, how many were correct?”

Precision = TP ÷ (TP + FP)

Hits ÷ (Hits + False Alarms)

F₂

F2-Score Formula

The harmonic mean that prioritizes recall

F₂ = (1 + 2²) × P × R ÷ (2² × P + R)

= 5 × P × R ÷ (4 × P + R)

FPR

False Positive Rate

Rate of unnecessary flags

FPR = FP ÷ (FP + TN)

False Alarms ÷ (False Alarms + Correct Blanks)

< 10%

Target: Under 10% false positive rate

Performance Standards

Our rigorous thresholds for system acceptance

F2-Score Standard

Primary Performance Metric

Target: ≥ 0.85

0.00.851.0

Why 0.85?

Human reviewers typically agree 85-90% on complex quality control tasks. Our target of 0.85 ensures AI performance matches human-level consistency without demanding impossible perfection.

Aligns with industry “80-90% accuracy” standards

False Positive Rate

Workload Management

Limit: < 10%

0%10%100%

Why under 10%?

False positives create extra work for reviewers. Keeping this under 10% ensures that 9 out of 10 flagged issues are genuine problems, maintaining reviewer efficiency and system credibility.

Maintains positive business case and reviewer trust

Common Questions

Understanding our methodology

“Why not use accuracy (right vs wrong)?”

Because red flags are rare. A system that never flags anything could look “99% accurate” but would miss real problems. F2 focuses on catching problems, which is what matters for QC.

F2 optimizes for problem detection over raw accuracy

“Why weight Recall more than Precision?”

Because missing a real problem (false negative) costs far more than checking a false alarm (false positive). A false alarm is just 20-40 seconds of review time, while missing problems can have serious consequences.

Cost of missing problems > Cost of false alarms

Ready to See These Metrics in Action?

Our AI consistently achieves F2 scores above 0.85 while maintaining false positive rates under 10%. See how this translates to real results for your quality control process.

Precision & AccuracyMetrics

F2-Score

False Positive Rate

Human Agreement

What We Measure (and Why)

Recall

Precision

F2-Score

Why F2-Score for Quality Control?

Confusion Matrix

Mathematical Foundation

Recall

Precision

F2-Score Formula

False Positive Rate

Performance Standards

F2-Score Standard

Why 0.85?

False Positive Rate

Why under 10%?

Common Questions

“Why not use accuracy (right vs wrong)?”

“Why weight Recall more than Precision?”

Ready to See These Metrics in Action?

Precision & Accuracy
Metrics