Assembly Readiness Benchmark
AI can assemble machines in CAD. ARB is the missing yardstick for whether the result is correct and buildable — a tool-independent, automated benchmark graded by CADCLAW and mapped onto the readiness scales industry already trusts: TRL, MRL, IRL.
White paper — review copies on request: info@sunn3d.com
The gap
AI-CAD benchmarking is an active field — there are strong public benchmarks for generating geometry (ABC, DeepCAD, Fusion 360 Gallery, CADBench, Text2CAD-Bench), classifying parts (MCB), and predicting how two parts join (JoinABLe, AutoMate). None grade the thing a builder needs.
No public benchmark asks whether a complete, multi-part machine an AI assembled is correct and buildable — right parts, no collisions, nothing floating, every interface aligned across the whole system — judged automatically, independent of the authoring tool, and tied to a readiness level. That is the niche ARB fills.
The ladder
The benchmark is a ladder. Each rung is defined by the next thing an automated verifier must be able to prove — you can only benchmark what you can verify.
L0 · Component
One part is exactly as specified.
L1 · Assemble the kit Today
Parts placed, aligned, no collisions, nothing floating. The Model T.
L2 · Constraint-robust
Re-solves correctly when parameters change.
L3 · Mechanically valid
Full-travel kinematics + load-holding, measured.
L4 · Engineering change
Re-design to a new requirement, no regressions.
L5 · Design from intent
Pick & arrange parts from functional goals. Tesla.
L6 · Optimize & invent
Provably beat the human baseline, multi-physics.
L7 · Autonomous loop
Design → build → measure → certify → self-improve.
Today the bar is L1. L6–L7 are not achievable by anyone yet — that gap is the measure of the road.
Capability × Readiness
The most useful artifact ARB offers: every rung mapped onto TRL (Technology Readiness), MRL (Manufacturing Readiness), and the integration-focused IRL — so anyone who already speaks readiness-levels can place an AI result instantly.
CADCLAW capability ladder ↔ TRL / MRL / IRL — indicative evidence bands, not equivalence.
Two axes, one bridge. TRL/MRL/IRL measure a machine's readiness, assigned by an authority. The ladder measures the AI's capability. CADCLAW is the bridge: it auto-generates the integration evidence those gates consume. The tightest native fit is IRL — CADCLAW's checks (interference, alignment, floating) are integration-readiness evidence.
How the benchmark runs
One task, any tool, one grader. Every AI workflow gets the same prompt + kit and is judged on its exported STEP by the same automated, tool-independent gates — the score is the only thing that differs.
The ARB pipeline — inputs → AI driver → exported STEP → CADCLAW gates → ARB score + readiness → human review, with a read-fix-rerun loop.
The metrics
Every run yields the same small, defined set of numbers — two kinds, never mixed: artifact quality (is the build correct?) and effort (what did it cost?).
Gates
Black-box, on the exported STEP: Inventory (right parts) · Interference (no overlaps) · Floating (all connected) · Orientation (correct pose — v-next).
L1 sub-grade · 0–100
How well the kit is assembled, × a buildability factor that falls with interference (clip count + overlap volume). A self-intersecting frame can't be built, so it can't score high.
ARB full-stack · 0–100
Rung on the L0–L7 capability ladder. A clean L1 ≈ 15 today; the rest of the scale is the road to autonomous design. Mapped to TRL / MRL / IRL.
Effort (reported separately)
Wall-clock, tokens, attempts, retries, corrections, human interventions — measured from the run log, never folded into the score.
Today — L1, on a real machine
An AI placed roughly 100 parts into a correct, two-metre 3D-printer / CNC frame:
- All geometry gates pass — black-box grade 95/100, zero fails or warnings.
- Independently graded — the same automated checker that will grade every other tool, with no human in the loop.
First head-to-head
Three independent AI workflows — Claude Opus 4.7 (Fusion and CadQuery) and OpenAI Codex / GPT-5 (CadQuery — the coding agent, not the chat app) — fresh prompt-only sessions, same kit, graded on the identical black-box lens. All three placed every part with zero human help; none is buildable yet.
The goal: the one reference image each AI was given — no build steps, just this picture plus the box of parts.
How good vs. how fast. All three sit near the floor (a clean build = 100); OpenAI Codex did it in 13 minutes — far faster, slightly lower score.
| Track | Parts | Inv / Float | Interference | L1 | Full‑stack | Time | Tokens |
|---|---|---|---|---|---|---|---|
| CADCLAW reference (answer key) | 100/100 | ✓ / ✓ | ✓ clean | 100 | 15.0 | — | — |
| Claude‑CadQuery | 100/100 | ✓ / ✓ | ✗ 35 clips · 248 cm³ | 15.2 | 6.52 | 48.8 min | 22.1M |
| Claude‑Fusion | 100/100 | ✓ / ✓ | ✗ 44 clips · 323 cm³ | 12.4 | 6.24 | 33.7 min | 34.2M |
| Codex‑CadQuery (OpenAI Codex / GPT-5) | 100/100 | ✓ / ✓ | ✗ 50 clips · 288 cm³ | 11.9 | 6.19 | 13.0 min | n/a |
Every flagged clip is structural — beams overlapping at splice joints and post/frame junctions, and the centered 2040 inserts overlapping their beams. Both also placed the Z-posts in the wrong rotation versus the reference — an orientation error the current gates don't yet catch, and the next gate we're adding.
Claude-Fusion, building the frame (10 → 100 parts). The model emits no parametric timeline, so we recovered the build order afterward by driving the live Fusion model through its MCP — revealing the placed parts in order under a fixed isometric camera.
Claude-CadQuery's own in-process renders — the orthographic + isometric checks it generated as it built. This is the human-reviewable output that lets a watcher catch a bad run early and stop it to save tokens.
What we asked — and what we didn't
We gave the goal, not the method. The driver got the target — the pictured assembly plus design constraints — and the kit, but not the build sequence. The original human-guided build specified an inside-out order (X axis → Y → Z-posts) and detailed steps; here we deliberately withheld that and let the AI decide how to reach the pictured result.
So ARB really tests whether a model understands the mechanical system — here a V-slot-and-roller gantry, a basic, well-documented pattern. Placing all 100 parts but self-intersecting, with mis-oriented posts, says a model recognizes the parts but not yet how they go together.
Human-reviewable throughout. Both drivers emitted orthographic + isometric renders as they built, so a person could watch the assembly take shape and abort a bad run early to save tokens. (Fusion didn't persist its in-session views, so we recovered the progression afterward straight from the live model via its MCP — see above.)
One point on a bigger graph — and the spread is instructive. The three frontier runs took 13–49 minutes for one ~100-part assembly. The model that iterated most — reviewing its own renders, re-extracting real hole patterns — scored highest but ran longest; the one-shot run was 4× faster and scored lowest. More careful self-review buys accuracy at the cost of time and tokens. Next on the curve: a top local model as a low-capability anchor — same task, same grader, plotted as capability vs ARB score.
Why it matters — in plain English
Checking a design for mistakes is one of the most common — and costly — jobs in engineering. The longer a mistake hides, the more it costs. A widely-used rule of thumb, the 1-10-100 rule: a mistake caught while designing costs about $1 to fix; caught while building, about $10; caught after it ships, $100 or more.
Today that checking is mostly done by hand — an engineer rotating a model on screen, hunting for parts that overlap or don't line up. It's slow, easy to miss things, and it doesn't scale. The hours and the escaped mistakes are pure lost value.
Why now, and not a year ago? Two things just arrived at the same time: AI that can actually drive CAD and assemble parts, and an automated checker (CADCLAW) that can prove an assembly is right — instantly, the same way every time. Put them together and you can both build and verify by machine. ARB is the scoreboard that keeps it honest.
For engineers
A pytest-style, tool-independent check that an assembly is buildable — on every change, not just at sign-off.
For programs
Automated, auditable readiness evidence that plugs into the TRL/MRL/IRL gate reviews you already run.
What makes it different
ARB is not "a CAD checker." Specifics separate it from CAD-vendor tooling and from prior benchmarks:
- Grades the shipped STEP, not a vendor's feature tree. Fusion, SolidWorks, and an AI agent are scored the same way — the only fair cross-tool yardstick.
- Scores the whole machine, not part pairs. System-level integrity across every interface at once, where joint/mate datasets stop at two parts.
- Speaks readiness (TRL/MRL/IRL), not a private number. A score drops straight into the gate reviews programs already run.
- Traces to a real, built machine. The reference target is a physical prototype that exists in metal, not only pixels.
- Audits its own claims. CADCLAW ships honesty tooling and refuses to assert a head-to-head it hasn't run.
Read & review
We are publishing the ladder and the readiness correlation as a candidate standard and inviting the field to measure against it.
Reviewers and collaborators welcome — labs, CAD vendors, and standards bodies especially.