How well can AI assemble a machine? The first benchmark results.

We gave three AI workflows the same box of parts and the same picture of a finished machine, and asked each to build it, with no instructions on how. We then graded all three the same way. Each placed every part on its own; none is buildable yet.

Score versus build time for three AI runs; all near the floor, OpenAI Codex fastest at 13 minutes
Three AIs, one ~100-part machine each. A clean build scores 100; all three sit well below it. OpenAI Codex was fastest by a wide margin.

1. The task — the goal, not the method

The target is the gantry frame of M3-CRETE, an open-source large-format 3D printer / CNC frame: a ~2-metre machine of roughly 100 parts — aluminium extrusions, gantry plates, V-wheels, belts, pulleys, and seven NEMA-23 motors.

Each AI received only two things: the parts kit (the authored CAD files) and one reference image of the finished machine, plus a short list of design constraints. Critically, we did not give the build sequence. The original human-guided build specified an inside-out order — X axis, then Y, then Z-posts — and detailed steps; here we withheld all of that. The AI had to figure out how to reach the pictured result on its own.

The reference image of the finished M3-CRETE frame given to each AI
The goal image. The only picture each AI was given — no steps, just this plus the box of parts.

That design choice is the point: ARB tests whether a model understands the mechanical system — here, a V-slot-and-roller gantry, a basic and well-documented pattern — not whether it can follow a recipe. A fairness wall kept each run honest: the driver worked only from the brief and kit, never the reference solution.

2. The grader — one checker, any tool

Every build was graded by CADCLAW, our open-source engine, as a black box: it loads the exported STEP file — no roles, no metadata — labels each solid by its shape signature, and runs three host-agnostic gates:

Because it grades the exported file rather than a vendor's feature tree, the same procedure judges Fusion, CadQuery, and any AI agent on equal footing. The result feeds the ARB capability ladder (L0–L7) and a buildability factor: interference scales the score down, because a self-intersecting frame cannot be built. A clean build earns an L1 sub-grade of 100; the ladder then maps onto the TRL / MRL / IRL readiness scales industry already uses.

The ARB pipeline from inputs to AI driver to exported STEP to grader to score
One task, any tool, one grader. The only variable is the driver.

3. The three drivers

Each ran in a fresh, prompt-only session (no project memory, no answer key):

4. Results

TrackPartsInterferenceL1 / 100Full-stackTimeAttempts
CADCLAW reference (answer key)100/100✓ clean10015.0
Claude-CadQuery100/100✗ 35 clips · 248 cm³15.26.5248.8 min9
Claude-Fusion100/100✗ 44 clips · 323 cm³12.46.2433.7 min12
Codex-CadQuery (OpenAI Codex)100/100✗ 50 clips · 288 cm³11.96.1913.0 min1

All three got every part in and connected — a real result. None cleared interference, so none is buildable. The spread is the story: OpenAI Codex one-shot the build in 13 minutes (one attempt, no retries) — three to four times faster than either Claude run — but with the most clips and the lowest score. Claude-CadQuery iterated hardest (nine attempts, re-extracting the real plate hole patterns from the STEP files) and made the cleanest build, but took the longest. More self-review buys accuracy at the cost of time and tokens — and the benchmark makes that trade visible. (Claude token use, recovered from the run transcripts: 22M and 34M billed; Codex did not expose its count.)

These low scores sit on top of a real result. From a single image and a parts box, with no build sequence, every model produced a coherent ~100-part machine, and one did it in 13 minutes on the first attempt, with no retries. The scores are low because the bar is a buildable assembly, not one that looks about right. The remaining gap is precise, non-overlapping placement, and that is what the benchmark tracks as the models improve.

5. What went wrong — the same failure class

The grader skips accessories (motors, belts, wheels), so every flagged clip is structural metal overlapping structural metal. All three shared the same mistakes: beams placed coincident at the splice joints and post/frame junctions instead of end-to-end, and the centred insert rails overlapping the beams they splice. They got the inventory and the topology right, but not the precise, non-overlapping placement.

There was also an error the current gates do not yet catch: in every run, the Z-posts were turned the wrong way versus the reference. A part that is present, connected, and non-overlapping but mis-oriented passes all three gates today. That is a real defect and a clear next step — an orientation/pose gate — which will rank builds more critically still.

6. Watch them build it

Both CadQuery drivers produced their own review renders as they worked. The Fusion driver inspected via live screenshots that weren't saved — so we recovered its build order afterward by driving the live model through its MCP, revealing the placed parts in sequence under a fixed camera (the design carries no parametric timeline).

Claude-Fusion build progression from 10 to 100 parts
Claude-Fusion, 10 → 100 parts — recovered from the live model via the Fusion MCP.
Grid of in-process review renders generated by the Claude-CadQuery driver
Claude-CadQuery's own in-process checks. These orthographic + isometric renders are the human-reviewable output that lets a watcher stop a bad run early to save tokens.

7. Limitations & honesty

8. Why it matters, and what's next

Until recently there was no way to measure this. An AI that can drive CAD, and an automatic checker that confirms whether the result is correct, only recently became usable together. With both in place, the same task can be scored the same way for any tool or model, so improvement is measurable rather than anecdotal.

Next on the curve: a top local model as a low-capability anchor, more frontier drivers, the orientation gate, and additional reference tasks — each plotted as capability vs. ARB score. We are publishing the ladder, the scoring, and the readiness correlation as a candidate standard and inviting labs, CAD vendors, and standards bodies to measure against it.

Reproduce or review: the benchmark scaffold (prompt, scoring, run procedures) and the CADCLAW engine are open. Method overview: cadclaw.io/benchmark. Engine: GitHub. Reviewers welcome — info@sunn3d.com.