Apr 06, 2026

Scraping and analyzing submissions to Terminal Bench 2.0

Terminal Bench 2.0 is an influential benchmark that tests coding agents against a set of 89 different coding problems to measure their performance in comparison to each other.

A coding agent is an LLM combined with an agent harness - software that prompts the LLM and then executes tools such as Bash commands in a loop to help write and test code in order to achieve a goal.

Terminal Bench maintain a terminal-bench@2.0 Leaderboard showing the most recent top ranked model+agent combos. Current first place, as-of 12th March 2026, is ForgeCode running against either GPT-5.4 or Claude Opus 4.6, achieving a 81.8% score against both of those models.

We wanted to explore the Terminal Bench results in more detail, and keep track of how that leaderboard changes over time.

So we turned to Git scraping, a pattern where you snapshot information into a Git repository and watch the differences to see what has changed. This pattern is an excellent fit for GitHub Actions where the scraper can run on a schedule and commit changes back to its own repo.

Accessing the Terminal Bench data

The Terminal Bench 2.0 leaderboard is managed as a Hugging Face dataset - an enormous repository containing 111GB of data.

The reason it's so large is that each submission to Terminal Bench includes the full output of the agent across at least five runs for every one of the 89 challenges. That's 89 * 5 = 445 runs minimum per submission.

For our purposes we don't need complete details of every run - we just need the high level results. Thankfully these are stored in result.json files, for example this one.

Here's a lightly annotated copy of the key fields:

{
    // Unique identifier for this trial run
    "id": "8706a96d-52db-4347-9d3a-124c84443769",
    // The benchmark task that was attempted
    "task_name": "nginx-request-logging",
    // Trial name = task name + random suffix to distinguish repeat runs
    "trial_name": "nginx-request-logging__iVDUKjC",
    // Local filesystem URI where the trial's working directory lived during execution
    "trial_uri": "file:///Users/aleks.petrov/Projects/startups/pilot/pilot-bench/jobs/pilot-cc-v35-k5/nginx-request-logging__iVDUKjC",
    // Pointer to the exact task definition in the benchmark repo
    "task_id": {
        "git_url": "https://github.com/laude-institute/terminal-bench-2.git",
        "git_commit_id": "69671fbaac6d67a7ef0dfec016cc38a64ef7a77c",  // pinned commit for reproducibility
        "path": "nginx-request-logging"  // subdirectory within the repo containing this task
    },
    // Which benchmark suite this task belongs to
    "source": "terminal-bench",
    // SHA-256 hash of the task definition — detects if the task changed between runs
    "task_checksum": "86a8bd681301002456da831adfae62fa7a538e8187d654e11335e92b81b0c2b3",
    // Full configuration used for this trial run
    "config": {
        "task": {
            "path": "nginx-request-logging",
            "git_url": "https://github.com/laude-institute/terminal-bench-2.git",
            "git_commit_id": "69671fbaac6d67a7ef0dfec016cc38a64ef7a77c",
            "overwrite": false,
            "download_dir": null,
            "source": "terminal-bench"
        },
        "trial_name": "nginx-request-logging__iVDUKjC",
        "trials_dir": "jobs/pilot-cc-v35-k5",
        // Timeout multipliers scale the default timeouts for each phase
        "timeout_multiplier": 1.0,               // global multiplier
        "agent_timeout_multiplier": 9.0,          // agent gets 9x the base timeout
        "verifier_timeout_multiplier": null,       // null = use default
        "agent_setup_timeout_multiplier": 5.0,     // setup phase gets 5x
        "environment_build_timeout_multiplier": null,
        // Agent configuration — which AI system is being evaluated
        "agent": {
            "name": null,
            "import_path": "pilot_agent:PilotAgent",        // Python import path
            "model_name": "anthropic/claude-opus-4-6",
            "override_timeout_sec": null,
            "override_setup_timeout_sec": null,
            "max_timeout_sec": null,
            "kwargs": {},
            "env": {
                // Auth token passed to the agent's environment (redacted in practice)
                "CLAUDE_CODE_OAUTH_TOKEN": "sk-ant-oat01-..."
            }
        },
        // Sandbox environment where the task runs
        "environment": {
            "type": "modal",            // runs on Modal (cloud compute platform)
            "import_path": null,
            "force_build": false,       
            "delete": true,             
            "override_cpus": null,      
            "override_memory_mb": null,
            "override_storage_mb": null,
            "override_gpus": null,
            "suppress_override_warnings": false,
            "kwargs": {}
        },
        // Verifier settings — the automated grader that checks the agent's work
        "verifier": {
            "override_timeout_sec": null,
            "max_timeout_sec": null,
            "disable": false
        },
        "artifacts": [],
        "job_id": "2091aa0a-be8e-40f8-96c1-4b56588bac52"
    },
    // Metadata about the agent being evaluated (used for leaderboard display)
    "agent_info": {
        "name": "pilot-real",           // submission/agent name
        "version": "unknown",
        "model_info": {
            "name": "claude-opus-4-6",  // underlying model
            "provider": "anthropic"
        }
    },
    // Token usage and cost from the agent's LLM calls
    "agent_result": {
        "n_input_tokens": 33,
        "n_cache_tokens": null, // cache hits (not tracked here)
        "n_output_tokens": 4449,
        "cost_usd": null,
        "rollout_details": null,
        "metadata": null
    },
    // The score! reward=1.0 means the agent fully solved the task
    "verifier_result": {
        "rewards": {
            "reward": 1.0  // 1.0 = pass, 0.0 = fail (can be partial)
        }
    },
    // null means no errors — if the trial crashed, exception details appear here
    "exception_info": null,
    // Wall-clock timestamps for the entire trial (start to finish)
    "started_at": "2026-03-28T01:01:34.435892Z",
    "finished_at": "2026-03-28T01:18:41.832461Z",
}

111GB is too much to periodically pull into a GitHub Actions run - but we actually just need those result.json files.

We built a script that fetches the full list of files in the repo from the Hugging Face API, then filtered out and fetched just those result.json files.

Even this proved expensive to run, so we moved to recording the last commit hash we had seen and then asking for a list of files that had changed since that commit. This was efficient enough for us to run on a scheduled basis in GitHub Actions.

Our terminal-bench-analysis GitHub repo now contains a subset of harborframework/terminal-bench-2-leaderboard - just the result.json files. The whole repo is 128M when checked out - not tiny, but a whole lot more agile to work with than 111GB!

Loading the results into SQLite

SQLite is my hammer, to which everything else looks like a nail. I showed Claude Code the documentation for my sqlite-utils Python library and challenged it to build a script that would scan all of those result.json files and load them into a set of tables.

The JSON files have a pretty obvious mapping to relational tables and Claude did a great job of setting this up. So what can we learn from the data now that it's in SQLite?

I had Claude use my Showboat tool to make detailed notes as it explored the data. Showboat lets a coding agent build a Markdown file that combines notes with the output of terminal commands, providing a detailed linear document showing exactly what the agent did while exploring a problem.

Here's the initial Showboat report Claude Code prepared for us.

I also ran my Datasette web application against the SQLite database to help me run my own SQL queries to further understand the data.

Here's one of those queries - this one shows the leaderboard in terms of the agent/model combinations with the highest scores:

select submission, n_trials, n_passed, n_failed, n_errored, avg_reward
from submission_stats
order by avg_reward desc

Building and deploying the database with GitHub Actions

Up to this point I had been working on my own laptop, but now had a proof of concept strong enough to justify automating everything and deploying the results.

GitHub offer free static file hosting in the form of GitHub Pages. This can be deployed to via GitHub Actions, which means we can deploy files that were generated from the contents of our repository without checking those files into GitHub themselves.

The SQLite database was an ideal candidate for that. I configured GitHub Actions to build that and deploy it as a static file, now available at this URL:

https://primeradiant.com/terminal-bench-analysis/terminal-bench.db

A database on its own isn't much use without a tool to query it. Datasette Lite is a version of my Python application that runs entirely in the browser, using Python compiled to WebAssembly (Pyodide). I modified the GitHub Actions build to grab a copy of Datasette Lite and configure it to load the hosted SQLite database, which gave us this interactive UI for running queries:

https://primeradiant.com/terminal-bench-analysis/datasette-lite.html#/terminal-bench

Using Datasette Lite in this way means we can provide links to directly execute queries and view the results by adding ?sql= to the end of the URL.

Here's a link to run that top agent/model scores query listed above.

Automating the README with Cog

The earlier experiment with Showboat had generated dozens of useful queries for exploring the data. I decided it would be neat to include those queries and their results in the repository README, along with Markdown tables showing their results.

I decided to use Cog for this. Cog is a Python tool that lets you embed code directly in comments in a Markdown file, then execute that code to help regenerate that file with dynamically generated content.

You can also define functions and use them later on. Here's what that looks like in our README:

<!--[[[cog
run_sql("""
    select submission, n_trials, n_passed, n_failed, n_errored, avg_reward
    from submission_stats
    order by avg_reward desc
""")
]]]-->
...
<!--[[[end]]]-->

When Cog executes the ... between those two comments is replaced by the output of the function. The run_sql(sql) function outputs three things: the SQL itself in a ```sql syntax highlighted block, a link to execute that SQL statement in Datasette Lite and a full Markdown table of the results of the query.

At build time GitHub Actions runs this command, which executes all of the SQL and rebuilds the file

uv run --with cogapp cog -r README.md

Commit messages as an Atom feed

A neat, little known feature of GitHub is that any timeline of commits on the site can be accessed as an Atom feed by adding .atom to the URL - which means you can subscribe to them using feed reader software such as NetNewsWire.

The feed shows the full commit message, so the more detailed your commit message the better.

For newly submitted benchmark reports there are some obviously interesting questions to answer: what model and harness were benchmarked, what was the result and how they now fare in the overall rankings.

So I had Claude write a script to generate a useful commit message based on analyzing the latest entries, then backfilled a few commit messages (via a Git filter branch operation and a force push) to include those new messages. They now look like this:

Add 941 trials for 2 submissions: BashAgent__TermiGen-32B, grok-cli__grok-4.20-0309-reasoning
grok-cli__grok-4.20-0309-reasoning: 62.8% pass, #29/45
BashAgent__TermiGen-32B: 19.7% pass, #44/45
Top 5 unchanged (pilot-real__claude-opus-4-6 leads at 83.1%)

And that's the whole project - entirely run out of GitHub Actions, so there's no additional infrastructure we need to maintain ourselves. I hope you find it as useful as we do.

If you want to explore it in more detail, I suggest pointing your coding agent of choice at the GitHub Actions workflow file and asking it to dig in and figure out how everything works.

Simon Willison

Researcher