Daily Report: Perplexity Evaluation Kicked Off · Experiment Matrix Finalized — Mar 2, 2026


Perplexity Evaluation Kicked Off

After building the test sets yesterday, today the evaluation pipeline is finally running.

The evaluation script loads each experimental group’s LoRA adapter in sequence and computes the average cross-entropy loss on assistant tokens across three independent test sets. Thanks to history caching, already-completed models are automatically skipped. Currently running: exp4 (TopQ-1000):

[INFO] Model loaded, VRAM usage: 14.19 GB
[INFO] Loaded historical results for: ['baseline', 'exp1', 'exp2', 'exp3']
[INFO] Evaluating model: exp4
[INFO] Mounting LoRA adapter: /workspace/outputs/exp4/final
[exp4/gold]  50/200  avg_loss=0.4233  elapsed=227s
[exp4/gold] 100/200  avg_loss=0.4169  elapsed=451s

Everything looks healthy — model loading is clean and the loss is ticking down steadily.

exp4 perplexity evaluation in progress (terminal log)


Experiment Matrix Finalized

While the evaluation was running in the background, I took the time to lock down the full experiment design. The matrix consists of 13 groups + 1 baseline, organized into three research Blocks:

Experiment Matrix Overview: 13 Groups + 1 Baseline

Block 1: Data Volume & Strategy Comparison (7 groups)

The heart of the experiment. This block systematically stress-tests the data selection strategy from three angles:

ComparisonResearch Question
exp1 vs exp5Does the Gate matter? Full-pool random vs resolved-pool random
exp5 vs exp3Does scoring matter? Resolved random vs resolved top-ranked
500 → 1000Data scaling effect — how much does doubling the data help across each strategy?
exp3 vs exp7Sanity check — best vs worst selection, verifying the scoring system makes sense

exp7 (BottomQ-500) is sampled by reversing the composite score ranking — the “bad data” control group. If the final loss gradient holds up, it’s the most direct validation of the entire scoring system.

Block 2: Major Dimension Ablation (2 groups)

Isolates the independent contribution of the two top-level scoring dimensions:

  • Ablation-NoEfficiency-500 (exp8): selection based on Style score only
  • Ablation-NoStyle-500 (exp9): selection based on Efficiency score only

Compared against exp3 (full composite), this tells us which dimension carries more weight in determining data quality.

Block 3: Sub-dimension Ablation (4 groups)

Drills further into each major dimension to isolate sub-metric contributions:

ComparisonResearch Question
exp10 vs exp11 vs exp3Inside Efficiency: Error-Retry Cycles vs Step Count Ratio — which one matters more?
exp12 vs exp13 vs exp3Inside Style: Action Diversity vs Observation Utilization — which one matters more?

Expected Results

Across all models, the expected loss gradient is:

Loss(Gold) < Loss(Random) < Loss(Low-Q)

This gradient serves a dual purpose: it validates the scoring system’s ability to distinguish quality tiers, and acts as the primary signal for whether fine-tuning on higher-quality data actually improves model behavior. A group that achieves significantly lower Gold loss is genuinely learning the high-quality patterns.


Reuse Status

Due to a refinement in the composite scoring formula, some previously trained models’ sample selections are now invalid and need to be retrained:

ExperimentReusable?Reason
baseline✅ YesNo fine-tuning, scoring-independent
Random-500 / Random-1000✅ YesRandom sampling, unaffected by score changes
All others❌ RetrainScore formula changed, different samples selected

Effective new training runs required: 11 groups.


Tomorrow’s Plan

  • Wait for all experimental groups to finish perplexity evaluation (estimated a few more hours)
  • Aggregate and compare per-group loss across Gold / Random / Low-Q test sets
  • Plot a loss comparison heatmap or line chart to draw preliminary conclusions
  • Based on Block 1 results, determine whether to prioritize kicking off the ablation group retraining