How well can AI assemble a machine? The first benchmark results.
We gave three AI workflows the same box of parts and the same picture of a finished machine, and asked each to build it, with no instructions on how. We then graded all three the same way. Each placed every part on its own; none is buildable yet.
1. The task — the goal, not the method
The target is the gantry frame of M3-CRETE, an open-source large-format 3D printer / CNC frame: a ~2-metre machine of roughly 100 parts — aluminium extrusions, gantry plates, V-wheels, belts, pulleys, and seven NEMA-23 motors.
Each AI received only two things: the parts kit (the authored CAD files) and one reference image of the finished machine, plus a short list of design constraints. Critically, we did not give the build sequence. The original human-guided build specified an inside-out order — X axis, then Y, then Z-posts — and detailed steps; here we withheld all of that. The AI had to figure out how to reach the pictured result on its own.
That design choice is the point: ARB tests whether a model understands the mechanical system — here, a V-slot-and-roller gantry, a basic and well-documented pattern — not whether it can follow a recipe. A fairness wall kept each run honest: the driver worked only from the brief and kit, never the reference solution.
2. The grader — one checker, any tool
Every build was graded by CADCLAW, our open-source engine, as a black box: it loads the exported STEP file — no roles, no metadata — labels each solid by its shape signature, and runs three host-agnostic gates:
- Inventory — are all the right parts present, in the right count?
- Interference — do any two solids overlap (a collision)?
- Floating — is every part connected, or is something loose?
Because it grades the exported file rather than a vendor's feature tree, the same procedure judges Fusion, CadQuery, and any AI agent on equal footing. The result feeds the ARB capability ladder (L0–L7) and a buildability factor: interference scales the score down, because a self-intersecting frame cannot be built. A clean build earns an L1 sub-grade of 100; the ladder then maps onto the TRL / MRL / IRL readiness scales industry already uses.
3. The three drivers
Each ran in a fresh, prompt-only session (no project memory, no answer key):
- Claude-Fusion — Claude Opus 4.7 driving a live Autodesk Fusion via its MCP connection.
- Claude-CadQuery — Claude Opus 4.7 writing a CadQuery (Python) build script.
- Codex-CadQuery — OpenAI Codex (GPT-5), the coding agent (not the chat app), also writing CadQuery.
4. Results
| Track | Parts | Interference | L1 / 100 | Full-stack | Time | Attempts |
|---|---|---|---|---|---|---|
| CADCLAW reference (answer key) | 100/100 | ✓ clean | 100 | 15.0 | — | — |
| Claude-CadQuery | 100/100 | ✗ 35 clips · 248 cm³ | 15.2 | 6.52 | 48.8 min | 9 |
| Claude-Fusion | 100/100 | ✗ 44 clips · 323 cm³ | 12.4 | 6.24 | 33.7 min | 12 |
| Codex-CadQuery (OpenAI Codex) | 100/100 | ✗ 50 clips · 288 cm³ | 11.9 | 6.19 | 13.0 min | 1 |
All three got every part in and connected — a real result. None cleared interference, so none is buildable. The spread is the story: OpenAI Codex one-shot the build in 13 minutes (one attempt, no retries) — three to four times faster than either Claude run — but with the most clips and the lowest score. Claude-CadQuery iterated hardest (nine attempts, re-extracting the real plate hole patterns from the STEP files) and made the cleanest build, but took the longest. More self-review buys accuracy at the cost of time and tokens — and the benchmark makes that trade visible. (Claude token use, recovered from the run transcripts: 22M and 34M billed; Codex did not expose its count.)
These low scores sit on top of a real result. From a single image and a parts box, with no build sequence, every model produced a coherent ~100-part machine, and one did it in 13 minutes on the first attempt, with no retries. The scores are low because the bar is a buildable assembly, not one that looks about right. The remaining gap is precise, non-overlapping placement, and that is what the benchmark tracks as the models improve.
5. What went wrong — the same failure class
The grader skips accessories (motors, belts, wheels), so every flagged clip is structural metal overlapping structural metal. All three shared the same mistakes: beams placed coincident at the splice joints and post/frame junctions instead of end-to-end, and the centred insert rails overlapping the beams they splice. They got the inventory and the topology right, but not the precise, non-overlapping placement.
There was also an error the current gates do not yet catch: in every run, the Z-posts were turned the wrong way versus the reference. A part that is present, connected, and non-overlapping but mis-oriented passes all three gates today. That is a real defect and a clear next step — an orientation/pose gate — which will rank builds more critically still.
6. Watch them build it
Both CadQuery drivers produced their own review renders as they worked. The Fusion driver inspected via live screenshots that weren't saved — so we recovered its build order afterward by driving the live model through its MCP, revealing the placed parts in sequence under a fixed camera (the design carries no parametric timeline).
7. Limitations & honesty
- A tooling-maturity bias to disclose. CADCLAW and its placement resolver were first built around CadQuery, before the Fusion connection (MCP) existed. The CadQuery runs may therefore carry a home-field edge. We flag it so the comparison stays honest; an orientation gate and more reference tasks will tighten it.
- The reference "100" is our own method. The clean baseline is the spec-driven build (our home workflow) shown as the target, not a neutral third party. We publish the grader and the benchmark scaffold openly so anyone can reproduce — or challenge — these scores.
- Verification is not physical certification. ARB grades geometry and buildability, not real-world performance.
- The scorer is early and version-pinned; effort metrics (time, tokens, attempts) are reported separately and never folded into the artifact score.
- One task, three models — a starting line, not a verdict. More drivers and tasks are coming.
- To our knowledge no prior benchmark grades whole-assembly buildability tool-independently; pairwise-joint datasets (JoinABLe, AutoMate) are the nearest neighbours and address a sub-problem.
8. Why it matters, and what's next
Until recently there was no way to measure this. An AI that can drive CAD, and an automatic checker that confirms whether the result is correct, only recently became usable together. With both in place, the same task can be scored the same way for any tool or model, so improvement is measurable rather than anecdotal.
Next on the curve: a top local model as a low-capability anchor, more frontier drivers, the orientation gate, and additional reference tasks — each plotted as capability vs. ARB score. We are publishing the ladder, the scoring, and the readiness correlation as a candidate standard and inviting labs, CAD vendors, and standards bodies to measure against it.
Reproduce or review: the benchmark scaffold (prompt, scoring, run procedures) and the CADCLAW engine are open. Method overview: cadclaw.io/benchmark. Engine: GitHub. Reviewers welcome — info@sunn3d.com.