# Evaluate on Terminal-Bench v2 [Terminal-Bench v2](https://github.com/laude-institute/terminal-bench-2) is an 89-task suite of long-horizon terminal tasks. Each task ships its own Docker image, resource profile, agent / verifier timeout budgets, and `solution` + `tests` directories. This page shows how to: 1. Preprocess Terminal-Bench v2 into a Uni-Agent parquet. 2. Sanity-check the parquet by running the gold solutions. 3. Run parallel inference with a model and collect rewards. The runnable scripts live under `examples/data_preprocess/terminal_bench_v2.py` and `examples/agent_interaction/`. **Reference result:** | **Model** | Inference Config | **Uni-Agent** | | --------------- | ----------------------------------------- |:-----------------:| | Qwen3.6-35B-A3B | temp=1.0, top_p=0.95, tp=8, 200K context | **42.53** (Avg@1) | --- ## Why a dedicated guide Unlike SWE-Bench, where every task shares the same image family, tool list, and timeout budget, **every Terminal-Bench v2 task has its own deployment config** — different Docker image, CPU / memory request, `agent.timeout_sec`, `verifier.timeout_sec`. Encoding all of that into one YAML would mean either one YAML per task or a long stack of `--override` flags. Uni-Agent solves this with **per-sample agent config**: the preprocessing script emits a complete `tools_kwargs` (env + reward + interaction + tools) into each parquet row's `extra_info`, and `UniAgentLoop._init_config` deep-merges those over the agent-loop YAML at run time. For Terminal-Bench v2 the YAML degrades to a thin shell that only carries `_target_`, `name`, `concurrency`, and `log_dir`; everything else comes from the dataset row. See `examples/data_preprocess/terminal_bench_v2.py` for the exact `tools_kwargs` shape. --- ## Step 1: Preprocess the dataset The preprocessor clones [`terminal-bench-2`](https://github.com/laude-institute/terminal-bench-2) at a pinned commit, then for each task packs `solution/` and `tests/` into deterministic tar.gz blobs and writes one parquet row with: - `extra_info.tools_kwargs.env` — full `AgentEnvConfig` (modal deployment, per-task image, CPU / memory, timeouts, `env_variables`). - `extra_info.tools_kwargs.reward` — `name="terminal_bench_v2"`, `metadata` (task config + solution / tests archives + workdir), `eval_timeout`. - `extra_info.tools_kwargs.interaction` — `action_timeout` (= `task.agent.timeout_sec`) and a generous `max_turns` safety net. - `extra_info.tools_kwargs.tools` — `execute_bash`, `str_replace_editor`, `submit`. Currently only the Modal deployment backend is supported: ```bash DEPLOYMENT=modal python examples/data_preprocess/terminal_bench_v2.py \ --local-save-dir ~/data/swe_agent ``` This writes `~/data/swe_agent/terminal_bench_v2_modal.parquet`. Two tasks (`qemu-alpine-ssh`, `qemu-startup`) are currently skipped because Modal sandbox creation does not work for them; the remaining 87 rows are included. --- ## Step 2: Verify the parquet with gold solutions Before spending GPU time on inference, run the included gold solutions through the same Modal deployment + reward spec to confirm the parquet is healthy. `parallel_verify_terminal_bench.py` starts each task's sandbox, applies the gold `solve.sh`, runs `test.sh`, and aggregates pass / fail / timeout counts: ```bash python examples/agent_interaction/parallel_verify_terminal_bench.py \ --data-path ~/data/swe_agent/terminal_bench_v2_modal.parquet \ --num-workers 8 ``` Useful flags: - `--limit N` — only verify the first `N` rows (smoke test). - `--task-ids id1,id2` — verify a specific subset by `task_id`. A healthy parquet should resolve essentially all tasks. Anything in the `fail_tle` (verifier did not complete) bucket points to a deployment or timeout config problem rather than a model problem. --- ## Step 3: Run parallel inference Once the parquet verifies, run the agent loop with `parallel_infer.py`. The matching agent-loop YAML is intentionally minimal because the parquet carries the per-task config: ```yaml # examples/agent_interaction/agent_config_terminal_bench.yaml - name: swe_agent _target_: uni_agent.agent_loop.UniAgentLoop concurrency: 128 log_dir: /tmp/terminal_bench_eval mask_abnormal_exit_traj: false ``` Submit the inference job (Qwen3.6-35B-A3B at 200K context is the reference config above): ```bash ray job submit --no-wait \ --runtime-env $RAY_DATA_HOME/data/swe_agent/runtime_env.yaml \ --working-dir . \ -- python3 examples/agent_interaction/parallel_infer.py \ --data-path $RAY_DATA_HOME/data/swe_agent/terminal_bench_v2_modal.parquet \ --agent-config-path examples/agent_interaction/agent_config_terminal_bench.yaml \ --model-path $RAY_DATA_HOME/models/Qwen3.6-35B-A3B --tp 8 \ --prompt-length 8192 \ --response-length 204800 \ --temperature 1.0 --top-p 0.95 --n 1 \ --num-workers 8 --nnodes 1 ``` Notes: - `concurrency` (in the YAML) and `--num-workers` together bound how many Modal sandboxes are alive at once. Modal's per-account sandbox quota is usually the binding constraint — start low and ramp up. - The dataset's `interaction.action_timeout` is already set to each task's declared `agent.timeout_sec`; do not override it from the CLI unless you intend to truncate task budgets. - Per-task trajectories, rewards, and logs land under `log_dir//` (one directory per sample). --- ## Where to look next - `uni_agent/reward/terminal_bench.py` — the `terminal_bench_v2` reward spec (uploads gold / tests archives, runs `test.sh`, parses `reward.json`). - `uni_agent/agent_loop.py` — `UniAgentLoop._init_config`, the merge between the YAML and per-sample `tools_kwargs`. - `examples/agent_interaction/agent_config_modal.yaml` vs. `agent_config_terminal_bench.yaml` — contrast a "YAML carries defaults" agent config with a "YAML is a thin shell" one.