Precision & Accuracy
Metrics
Understanding how we measure and ensure the accuracy of our AI quality control system when comparing against human QC reviewers.
F2-Score
Harmonic mean favoring recall over precision
False Positive Rate
Keeping false alarms manageable
Human Agreement
Inter-reviewer consistency benchmark
What We Measure (and Why)
We want a system that catches real problems while minimizing unnecessary work
Recall
“Of all the real problems, how many did the system catch?”
Precision
“Of all the things flagged, how many were truly problems?”
F2-Score
Combines precision and recall in harmonic mean that prioritizes catching problems over false alarms
Why F2-Score for Quality Control?
Precision, Recall and F2-Score are widely accepted metrics across many industries.
Confusion Matrix
The four outcomes on each survey
What the human said | What the system said | Name we use |
---|---|---|
Red flag present | Flagged | Hit(True Positive, TP) |
Red flag present | Not flagged | Miss(False Negative, FN) |
No red flag | Flagged | False alarm(False Positive, FP) |
No red flag | Not Flagged | Correct blank(True Negative, TN) |
Mathematical Foundation
How we compute our key performance metrics
Recall
Sensitivity to problems
“Of all real problems, how many did we catch?”
Precision
Accuracy of flags
“Of all flags raised, how many were correct?”
F2-Score Formula
The harmonic mean that prioritizes recall
False Positive Rate
Rate of unnecessary flags
Performance Standards
Our rigorous thresholds for system acceptance
F2-Score Standard
Primary Performance Metric
Why 0.85?
Human reviewers typically agree 85-90% on complex quality control tasks. Our target of 0.85 ensures AI performance matches human-level consistency without demanding impossible perfection.
False Positive Rate
Workload Management
Why under 10%?
False positives create extra work for reviewers. Keeping this under 10% ensures that 9 out of 10 flagged issues are genuine problems, maintaining reviewer efficiency and system credibility.
Common Questions
Understanding our methodology
“Why not use accuracy (right vs wrong)?”
Because red flags are rare. A system that never flags anything could look “99% accurate” but would miss real problems. F2 focuses on catching problems, which is what matters for QC.
“Why weight Recall more than Precision?”
Because missing a real problem (false negative) costs far more than checking a false alarm (false positive). A false alarm is just 20-40 seconds of review time, while missing problems can have serious consequences.
Ready to See These Metrics in Action?
Our AI consistently achieves F2 scores above 0.85 while maintaining false positive rates under 10%. See how this translates to real results for your quality control process.