Evaluate on Terminal-Bench v2
Terminal-Bench v2 is an 89-task suite of long-horizon terminal tasks. Each task ships its own Docker image, resource profile, agent / verifier timeout budgets, and solution + tests directories.
This page shows how to:
Preprocess Terminal-Bench v2 into a Uni-Agent parquet.
Sanity-check the parquet by running the gold solutions.
Run parallel inference with a model and collect rewards.
The runnable scripts live under examples/data_preprocess/terminal_bench_v2.py and examples/agent_interaction/.
Reference result:
Model |
Inference Config |
Uni-Agent |
|---|---|---|
Qwen3.6-35B-A3B |
temp=1.0, top_p=0.95, tp=8, 200K context |
42.53 (Avg@1) |
Why a dedicated guide
Unlike SWE-Bench, where every task shares the same image family, tool list, and timeout budget, every Terminal-Bench v2 task has its own deployment config — different Docker image, CPU / memory request, agent.timeout_sec, verifier.timeout_sec. Encoding all of that into one YAML would mean either one YAML per task or a long stack of --override flags.
Uni-Agent solves this with per-sample agent config: the preprocessing script emits a complete tools_kwargs (env + reward + interaction + tools) into each parquet row’s extra_info, and UniAgentLoop._init_config deep-merges those over the agent-loop YAML at run time. For Terminal-Bench v2 the YAML degrades to a thin shell that only carries _target_, name, concurrency, and log_dir; everything else comes from the dataset row. See examples/data_preprocess/terminal_bench_v2.py for the exact tools_kwargs shape.
Step 1: Preprocess the dataset
The preprocessor clones terminal-bench-2 at a pinned commit, then for each task packs solution/ and tests/ into deterministic tar.gz blobs and writes one parquet row with:
extra_info.tools_kwargs.env— fullAgentEnvConfig(modal deployment, per-task image, CPU / memory, timeouts,env_variables).extra_info.tools_kwargs.reward—name="terminal_bench_v2",metadata(task config + solution / tests archives + workdir),eval_timeout.extra_info.tools_kwargs.interaction—action_timeout(=task.agent.timeout_sec) and a generousmax_turnssafety net.extra_info.tools_kwargs.tools—execute_bash,str_replace_editor,submit.
Currently only the Modal deployment backend is supported:
DEPLOYMENT=modal python examples/data_preprocess/terminal_bench_v2.py \
--local-save-dir ~/data/swe_agent
This writes ~/data/swe_agent/terminal_bench_v2_modal.parquet. Two tasks (qemu-alpine-ssh, qemu-startup) are currently skipped because Modal sandbox creation does not work for them; the remaining 87 rows are included.
Step 2: Verify the parquet with gold solutions
Before spending GPU time on inference, run the included gold solutions through the same Modal deployment + reward spec to confirm the parquet is healthy. parallel_verify_terminal_bench.py starts each task’s sandbox, applies the gold solve.sh, runs test.sh, and aggregates pass / fail / timeout counts:
python examples/agent_interaction/parallel_verify_terminal_bench.py \
--data-path ~/data/swe_agent/terminal_bench_v2_modal.parquet \
--num-workers 8
Useful flags:
--limit N— only verify the firstNrows (smoke test).--task-ids id1,id2— verify a specific subset bytask_id.
A healthy parquet should resolve essentially all tasks. Anything in the fail_tle (verifier did not complete) bucket points to a deployment or timeout config problem rather than a model problem.
Step 3: Run parallel inference
Once the parquet verifies, run the agent loop with parallel_infer.py. The matching agent-loop YAML is intentionally minimal because the parquet carries the per-task config:
# examples/agent_interaction/agent_config_terminal_bench.yaml
- name: swe_agent
_target_: uni_agent.agent_loop.UniAgentLoop
concurrency: 128
log_dir: /tmp/terminal_bench_eval
mask_abnormal_exit_traj: false
Submit the inference job (Qwen3.6-35B-A3B at 200K context is the reference config above):
ray job submit --no-wait \
--runtime-env $RAY_DATA_HOME/data/swe_agent/runtime_env.yaml \
--working-dir . \
-- python3 examples/agent_interaction/parallel_infer.py \
--data-path $RAY_DATA_HOME/data/swe_agent/terminal_bench_v2_modal.parquet \
--agent-config-path examples/agent_interaction/agent_config_terminal_bench.yaml \
--model-path $RAY_DATA_HOME/models/Qwen3.6-35B-A3B --tp 8 \
--prompt-length 8192 \
--response-length 204800 \
--temperature 1.0 --top-p 0.95 --n 1 \
--num-workers 8 --nnodes 1
Notes:
concurrency(in the YAML) and--num-workerstogether bound how many Modal sandboxes are alive at once. Modal’s per-account sandbox quota is usually the binding constraint — start low and ramp up.The dataset’s
interaction.action_timeoutis already set to each task’s declaredagent.timeout_sec; do not override it from the CLI unless you intend to truncate task budgets.Per-task trajectories, rewards, and logs land under
log_dir/<run_id>/(one directory per sample).
Where to look next
uni_agent/reward/terminal_bench.py— theterminal_bench_v2reward spec (uploads gold / tests archives, runstest.sh, parsesreward.json).uni_agent/agent_loop.py—UniAgentLoop._init_config, the merge between the YAML and per-sampletools_kwargs.examples/agent_interaction/agent_config_modal.yamlvs.agent_config_terminal_bench.yaml— contrast a “YAML carries defaults” agent config with a “YAML is a thin shell” one.