Terminal-Lego: Scaling Real-World Terminal Agents via Distilled Verification Paradigms

The Pedagogical Paradox: Claude Opus 4.6 achieves the highest standalone score, yet its trajectories produce weaker students.

Observation Support: DeepSeek-V3.2 emerges as the most effective teacher through Environment-Grounded Supervision, with the highest Targeted Observation Ratio (13.4%).

15K+

Executable Tasks
90+ domains

24.3%

TB 2.0 (32B)
from 3.4% base

3

Key Stages
for constructing tasks

7x

Improvement
over backbone

Abstract

Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harness design, and student capacity. We investigate this pedagogical link using Terminal-Lego, a scalable pipeline that transforms multi-domain real-world issues into environment-verified agentic tasks.

Surprisingly, standalone performance does not dictate teaching efficacy: while Claude Opus 4.6 achieves higher scores on Terminal-Bench 2.0, students fine-tuned on trajectories from DeepSeek-V3.2, a lower-scoring agent, exhibit significantly stronger generalization. We attribute this "pedagogical paradox" to Environment-Grounded Supervision (EGS): trajectories that explicitly expose inspect-act-verify behaviors allow students to internalize robust problem-solving routines rather than fragile action sequences.

Scaling analysis reveals exceptional data efficiency: with only 15.3k Terminal-Lego trajectories, Qwen3-32B achieves a 24.3% score on Terminal-Bench 2.0, rivaling previous SOTA performance established with over 30x the data volume.

The Pedagogical Paradox

Under a matched-task setting with 8.1K trajectories per teacher, we find a striking result: the strongest standalone agent (Claude Opus 4.6) produces the weakest students, while the weakest agent (DeepSeek-V3.2) produces the strongest students.

Teacher Model	TB 2.0	Avg. Turns	Qwen3-8B	Qwen3-32B
Claude Opus 4.6	69.4	3.8	5.6%	15.5%
Qwen3.5-Plus	52.5	6.7	8.6%	17.2%
GLM-5	56.2	5.4	8.6%	19.5%
DeepSeek-V3.2	39.3	7.3	10.5%	20.6%

All teachers use 8.1K task-aligned trajectories. Student performance reported as average pass@1 on Terminal-Bench 2.0 across three trials.

Environment-Grounded Supervision (EGS)

We propose that teachable trajectories are characterized by an explicit inspect-act-verify loop. We define the Targeted Observation Ratio (TOR) to measure the fraction of actions supported by path-aligned prior observations.

Observation Masking

Masking observation turns in DeepSeek-V3.2 trajectories drops Qwen3-32B from 20.6% to 13.8%, proving observation supervision is essential.

High-TOR vs Low-TOR

High-TOR trajectories produce 14.6% vs 11.8% for low-TOR on Qwen3-32B, confirming TOR predicts data utility.

Failed Trajectories Still Teach

2.5K failed DeepSeek-V3.2 trajectories train a 32B student (16.1%) that outperforms 8.1K passed Claude trajectories (15.5%).

TOR Across Teachers

DeepSeek-V3.2: 13.4% TOR | GLM-5: 7.3% | Qwen3.5-Plus: 6.5% | Claude: 2.5%

Activating Observation Behavior

To further examine observation behavior, we inject two system-level instructions that encourage environment-grounded interaction before any write actions:

- Inspect the environment and relevant files first, as thoroughly as possible.

- Only begin write/create/modify actions after you have fully verified the current file contents and directory structure.

Using the same teacher (Claude Opus 4.6) and the same 8.1K instances, TOR increases from 2.5% to 6.6%, and 32B student performance rises from 15.4% to 19.5%.
At inference time on 15K Terminal-Lego instances, the modified prompt improves Claude Opus 4.6 pass rate from 88.7% to 95.4%, showing gains in both training trajectory quality and teacher task-solving performance.

Takeaways

A teacher's raw capability does not directly determine teaching quality; trajectory length and error-recovery loops are not decisive factors either.
EGS captures a fundamental reasoning behavior that students must acquire to act reliably, benefiting both inference and training.
Students can learn useful behavioral patterns from high targeted observation ratio trajectories even when those trajectories fail the task, decoupling process quality from outcome.

Terminal-Lego Pipeline

Terminal-Lego converts real StackOverflow issues into Docker-verified agentic terminal tasks through three main stages, with an iterative validation sub-loop inside Stage 2.

Stage 1 — Source Collection. We sample StackOverflow questions from 90+ domains, requiring accepted answers and vote-based quality filtering.
Stage 2 — Cascaded Task Construction. Each question is converted into a Terminal-Bench-style task through dependent generation of instruction, environment, solution, Dockerfile, and tests.
Stage 2.5 — Test Validation Loop. We first catch syntax errors, then use an independent LLM to review five common test failure modes. Failed reviews are fed back for regeneration for up to 3 rounds.
Stage 3 — Docker Round-Trip Verification. Each task must pass full containerized checks (build image, run solution, run tests), and we keep only tasks with positive post-solution reward.

Pipeline overview: StackOverflow issues are filtered, converted through cascaded generation with test-validation feedback, and retained after Docker round-trip verification. The dataset spans 90+ technical domains with 15K+ verified tasks.

Domain Coverage

Terminal-Lego spans 90+ technical domains across 13 categories sourced from StackOverflow.

Terminal-Bench 2.0 Results

Using only 15K trajectories, Terminal-Lego achieves competitive performance with models trained on 30x more data.

Model	Size	Post-Training Data	Scaffold	TB 2.0
Official Leaderboard References
GPT-5.5	--	--	Codex CLI	82.0
Claude Opus 4.6	--	--	Terminus-2	62.9
GLM-5	744B	--	Terminus-2	52.4
DeepSeek-V3.2	685B	--	Terminus-2	39.6
GPT-5-Mini	--	--	Terminus-2	24.0
Post-Trained Student Models
Qwen3-8B (backbone)	8B	--	Terminus-2	2.5
TerminalTraj-Qwen2.5-Coder-7B	7B	50.7K	Terminus-2	10.1
Nemotron-Terminal-Qwen3-8B	8B	490.5K	Terminus-2	13.0
Terminal-Lego-Qwen3-8B	8B	15.3K	Terminus-2	11.8 (+9.3)
Qwen3-32B (backbone)	32B	--	Terminus-2	3.4
TerminalTraj-Qwen2.5-Coder-32B	32B	50.7K	Terminus-2	22.0
Nemotron-Terminal-Qwen3-32B	32B	490.5K	Terminus-2	27.4
Terminal-Lego-Qwen3-32B	32B	15.3K	Terminus-2	24.3 (+20.9)

Key Takeaways

Prioritize Paradigm over Performance — When selecting teacher models for distillation, prioritize those with rigorous inspect-act-verify habits (high TOR) rather than those with the highest benchmark scores.

Observation as the Core Reasoning Primitive — Environment-grounded observation is not merely a safety check, but the fundamental bridge between perception and action that students must learn.

The Value of Methodological Failure — Sufficiently capable students can extract reusable interaction patterns from imperfect trajectories, decoupling process quality from outcome.

Citation

@misc{yang2026terminallego,
  title={What Makes Interaction Trajectories Effective for Training Terminal Agents?},
  author={Sidi Yang and Chaofan Tao and Jierun Chen and Tiezheng Yu and Ruoyu Wang and Yuxin Jiang and Yiming Du and Wendong Xu and Jing Xiong and Taiqiang Wu and Lifeng Shang and Xiaohui Li and Ngai Wong and Haoli Bai},
  year={2026},
  eprint={2606.03461},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2606.03461}
}