Mar 2, 2026

Daily Report: Perplexity Evaluation Kicked Off · Experiment Matrix Finalized — Mar 2, 2026

Perplexity Evaluation Kicked Off

After building the test sets yesterday, today the evaluation pipeline is finally running.

The evaluation script loads each experimental group’s LoRA adapter in sequence and computes the average cross-entropy loss on assistant tokens across three independent test sets. Thanks to history caching, already-completed models are automatically skipped. Currently running: exp4 (TopQ-1000):

[INFO] Model loaded, VRAM usage: 14.19 GB
[INFO] Loaded historical results for: ['baseline', 'exp1', 'exp2', 'exp3']
[INFO] Evaluating model: exp4
[INFO] Mounting LoRA adapter: /workspace/outputs/exp4/final
[exp4/gold]  50/200  avg_loss=0.4233  elapsed=227s
[exp4/gold] 100/200  avg_loss=0.4169  elapsed=451s

Everything looks healthy — model loading is clean and the loss is ticking down steadily.

exp4 perplexity evaluation in progress (terminal log)

Experiment Matrix Finalized

While the evaluation was running in the background, I took the time to lock down the full experiment design. The matrix consists of 13 groups + 1 baseline, organized into three research Blocks:

Experiment Matrix Overview: 13 Groups + 1 Baseline

Block 1: Data Volume & Strategy Comparison (7 groups)

The heart of the experiment. This block systematically stress-tests the data selection strategy from three angles:

Comparison	Research Question
exp1 vs exp5	Does the Gate matter? Full-pool random vs resolved-pool random
exp5 vs exp3	Does scoring matter? Resolved random vs resolved top-ranked
500 → 1000	Data scaling effect — how much does doubling the data help across each strategy?
exp3 vs exp7	Sanity check — best vs worst selection, verifying the scoring system makes sense

exp7 (BottomQ-500) is sampled by reversing the composite score ranking — the “bad data” control group. If the final loss gradient holds up, it’s the most direct validation of the entire scoring system.

Block 2: Major Dimension Ablation (2 groups)

Isolates the independent contribution of the two top-level scoring dimensions:

Ablation-NoEfficiency-500 (exp8): selection based on Style score only
Ablation-NoStyle-500 (exp9): selection based on Efficiency score only

Compared against exp3 (full composite), this tells us which dimension carries more weight in determining data quality.

Block 3: Sub-dimension Ablation (4 groups)

Drills further into each major dimension to isolate sub-metric contributions:

Comparison	Research Question
exp10 vs exp11 vs exp3	Inside Efficiency: Error-Retry Cycles vs Step Count Ratio — which one matters more?
exp12 vs exp13 vs exp3	Inside Style: Action Diversity vs Observation Utilization — which one matters more?

Expected Results

Across all models, the expected loss gradient is:

Loss(Gold) < Loss(Random) < Loss(Low-Q)

This gradient serves a dual purpose: it validates the scoring system’s ability to distinguish quality tiers, and acts as the primary signal for whether fine-tuning on higher-quality data actually improves model behavior. A group that achieves significantly lower Gold loss is genuinely learning the high-quality patterns.

Reuse Status

Due to a refinement in the composite scoring formula, some previously trained models’ sample selections are now invalid and need to be retrained:

Experiment	Reusable?	Reason
baseline	✅ Yes	No fine-tuning, scoring-independent
Random-500 / Random-1000	✅ Yes	Random sampling, unaffected by score changes
All others	❌ Retrain	Score formula changed, different samples selected

Effective new training runs required: 11 groups.

Tomorrow’s Plan

Wait for all experimental groups to finish perplexity evaluation (estimated a few more hours)
Aggregate and compare per-group loss across Gold / Random / Low-Q test sets
Plot a loss comparison heatmap or line chart to draw preliminary conclusions
Based on Block 1 results, determine whether to prioritize kicking off the ablation group retraining