TL;DR: Reporting “Model A got 72.3% and Model B got 71.8%” tells you almost nothing. Without confidence intervals, you don’t know if that difference is real. Without effect sizes, you don’t know if it matters. rotalabs-eval adds the statistical rigor that LLM evaluation desperately needs.

The Problem with Current Benchmarks

Open a typical LLM paper. You’ll see tables like this:

Model MMLU HumanEval HellaSwag
Model A 72.3 45.2 83.1
Model B 71.8 46.1 82.9

And the conclusion: “Model A performs better on MMLU and HellaSwag, while Model B excels at HumanEval.”

But wait. Is 72.3 vs 71.8 a real difference? Or just noise?

Without confidence intervals, we have no idea. That 0.5% difference could easily be within the margin of error. We might be comparing random fluctuations.

This happens constantly in the field. Papers claim improvements that aren’t statistically significant. Leaderboards rank models by differences smaller than measurement error. Everyone pretends the numbers are more meaningful than they are.


What We Should Be Reporting

Here’s what proper evaluation looks like:

from rotalabs_eval import Evaluator, ComparisonResult

evaluator = Evaluator()

result = evaluator.compare(
    model_a_outputs=predictions_a,
    model_b_outputs=predictions_b,
    references=ground_truth,
    metric="accuracy"
)

print(f"Model A: {result.model_a.mean:.3f} (95% CI: {result.model_a.ci})")
print(f"Model B: {result.model_b.mean:.3f} (95% CI: {result.model_b.ci})")
print(f"Difference: {result.difference.mean:.3f} (95% CI: {result.difference.ci})")
print(f"Significant: {result.is_significant} (p={result.p_value:.4f})")
print(f"Effect size: {result.effect_size:.3f} ({result.effect_size_interpretation})")

Output:

Model A: 0.723 (95% CI: [0.701, 0.745])
Model B: 0.718 (95% CI: [0.696, 0.740])
Difference: 0.005 (95% CI: [-0.025, 0.035])
Significant: False (p=0.7432)
Effect size: 0.03 (negligible)

Now we know: the difference isn’t significant. The confidence intervals overlap substantially. The effect size is negligible. These models perform the same on this benchmark.


Confidence Intervals

A confidence interval tells you the range where the true value likely falls.

When you run a model on 1000 test examples and get 72.3% accuracy, that’s a sample estimate. The true accuracy (on all possible examples) might be 70% or 75%. The confidence interval quantifies this uncertainty.

rotalabs-eval computes CIs using bootstrap resampling by default:

from rotalabs_eval import compute_metric_with_ci

score, ci_low, ci_high = compute_metric_with_ci(
    predictions=predictions,
    references=references,
    metric="bleu",
    confidence_level=0.95,
    n_bootstrap=1000
)

print(f"BLEU: {score:.3f} [{ci_low:.3f}, {ci_high:.3f}]")

Bootstrap is non-parametric and works for any metric. For simple metrics like accuracy, we also support exact binomial intervals which are faster.


Significance Testing

“Is Model A better than Model B?” is a yes/no question. Significance testing gives you the answer.

We use paired tests when possible (same test set, same examples):

result = evaluator.compare(
    model_a_outputs=predictions_a,
    model_b_outputs=predictions_b,
    references=references,
    metric="rouge_l",
    test="paired_bootstrap"  # or "wilcoxon", "mcnemar"
)

if result.is_significant:
    winner = "A" if result.difference.mean > 0 else "B"
    print(f"Model {winner} is significantly better (p={result.p_value:.4f})")
else:
    print("No significant difference detected")

Paired tests are more powerful because they control for example-to-example variation.


Effect Sizes

Statistical significance isn’t the same as practical significance.

With a large enough test set, even tiny differences become “significant.” But a 0.1% improvement might not matter for your application.

Effect size measures the magnitude of the difference, independent of sample size:

print(f"Cohen's d: {result.cohens_d:.3f}")
print(f"Interpretation: {result.effect_size_interpretation}")

Standard interpretations:

  • |d| < 0.2: negligible
  • 0.2 ≤ |d| < 0.5: small
  • 0.5 ≤ |d| < 0.8: medium
  • |d| ≥ 0.8: large

A “significant” result with negligible effect size means: yes, there’s a difference, but it’s too small to care about.


Power Analysis

Before running an expensive evaluation, you should know: how many examples do I need to detect a meaningful difference?

from rotalabs_eval import power_analysis

required_n = power_analysis(
    expected_effect_size=0.3,  # small-to-medium effect
    significance_level=0.05,
    power=0.80,  # 80% chance of detecting the effect if it exists
    test="paired"
)

print(f"Required sample size: {required_n}")

If you run an evaluation with too few examples, you might miss real differences (low power). If you run with way more than needed, you’re wasting compute.


Multiple Metrics

Evaluating on multiple metrics introduces the multiple comparisons problem. If you test 20 metrics at p<0.05, you expect one false positive by chance.

rotalabs-eval applies corrections automatically:

results = evaluator.compare_multi(
    model_a_outputs=predictions_a,
    model_b_outputs=predictions_b,
    references=references,
    metrics=["bleu", "rouge_1", "rouge_2", "rouge_l", "bertscore", "bleurt"],
    correction="bonferroni"  # or "holm", "fdr"
)

for metric, result in results.items():
    sig = "*" if result.is_significant else ""
    print(f"{metric}: {result.difference.mean:+.3f} (p={result.p_value_corrected:.4f}){sig}")

Bonferroni is conservative but simple. Holm is more powerful. FDR (Benjamini-Hochberg) is appropriate when you expect many true positives.


LLM-as-Judge Evaluation

When using an LLM to judge quality, you need to account for judge variance:

from rotalabs_eval import LLMJudge

judge = LLMJudge(
    model="claude-3-opus",
    criteria="helpfulness, accuracy, and clarity"
)

# Multiple judgments per example to estimate variance
scores = judge.evaluate(
    outputs=predictions,
    references=references,
    n_judgments=3  # Judge each example 3 times
)

print(f"Mean: {scores.mean:.3f}")
print(f"Judge agreement (ICC): {scores.judge_agreement:.3f}")
print(f"95% CI: {scores.ci}")

Inter-rater reliability (ICC) tells you how consistent the judge is. Low agreement means your evaluation is noisy and you need more judgments.


Putting It All Together

Here’s a complete evaluation workflow:

from rotalabs_eval import Evaluator, EvalConfig

config = EvalConfig(
    metrics=["bleu", "rouge_l", "bertscore"],
    confidence_level=0.95,
    n_bootstrap=1000,
    significance_test="paired_bootstrap",
    multiple_comparison_correction="holm",
    compute_effect_sizes=True
)

evaluator = Evaluator(config)

# Run comparison
results = evaluator.compare(
    model_a_outputs=outputs_a,
    model_b_outputs=outputs_b,
    references=references
)

# Generate report
report = results.to_markdown()
print(report)

Output:

## Model Comparison Results

| Metric | Model A | Model B | Diff | 95% CI | p-value | Effect |
|--------|---------|---------|------|--------|---------|--------|
| BLEU | 0.342 | 0.328 | +0.014 | [0.002, 0.026] | 0.023* | small |
| ROUGE-L | 0.456 | 0.451 | +0.005 | [-0.008, 0.018] | 0.441 | negligible |
| BERTScore | 0.891 | 0.887 | +0.004 | [-0.002, 0.010] | 0.182 | negligible |

*Significant after Holm correction (alpha=0.05)

### Summary
Model A shows a statistically significant improvement on BLEU with a small effect size.
No significant differences on ROUGE-L or BERTScore.

Now you have something defensible. The numbers mean something.


Installation

pip install rotalabs-eval

# With semantic metrics (BERTScore, BLEURT)
pip install rotalabs-eval[semantic]

# With LLM judge
pip install rotalabs-eval[judge]

Resources


Further Reading

These papers cover the theory behind proper NLP evaluation. We’ve tried to make the practice easy.


Questions about evaluation methodology? Reach out at research@rotalabs.ai.