Assembly Readiness Benchmark

AI can assemble machines in CAD. ARB is the missing yardstick for whether the result is correct and buildable — a tool-independent, automated benchmark graded by CADCLAW and mapped onto the readiness scales industry already trusts: TRL, MRL, IRL.

See the readiness chart Plain-English summary CADCLAW on GitHub

White paper — review copies on request: info@sunn3d.com

L0–L7Capability ladder

0Alignment failures (L1)

95/100Black-box geometry grade

~100Parts placed & verified

The gap

AI-CAD benchmarking is an active field — there are strong public benchmarks for generating geometry (ABC, DeepCAD, Fusion 360 Gallery, CADBench, Text2CAD-Bench), classifying parts (MCB), and predicting how two parts join (JoinABLe, AutoMate). None grade the thing a builder needs.

No public benchmark asks whether a complete, multi-part machine an AI assembled is correct and buildable — right parts, no collisions, nothing floating, every interface aligned across the whole system — judged automatically, independent of the authoring tool, and tied to a readiness level. That is the niche ARB fills.

Honest scope. Joint-prediction datasets score part pairs; generation benchmarks score single-part fidelity; none map to TRL/MRL/IRL. Prior-art survey current as of 2026-05.

The ladder

The benchmark is a ladder. Each rung is defined by the next thing an automated verifier must be able to prove — you can only benchmark what you can verify.

L0 · Component

One part is exactly as specified.

L1 · Assemble the kit Today

Parts placed, aligned, no collisions, nothing floating. The Model T.

L2 · Constraint-robust

Re-solves correctly when parameters change.

L3 · Mechanically valid

Full-travel kinematics + load-holding, measured.

L4 · Engineering change

Re-design to a new requirement, no regressions.

L5 · Design from intent

Pick & arrange parts from functional goals. Tesla.

L6 · Optimize & invent

Provably beat the human baseline, multi-physics.

L7 · Autonomous loop

Design → build → measure → certify → self-improve.

Today the bar is L1. L6–L7 are not achievable by anyone yet — that gap is the measure of the road.

Capability × Readiness

The most useful artifact ARB offers: every rung mapped onto TRL (Technology Readiness), MRL (Manufacturing Readiness), and the integration-focused IRL — so anyone who already speaks readiness-levels can place an AI result instantly.

The CADCLAW capability ladder (L0–L7) mapped to TRL, MRL, and IRL evidence bands

CADCLAW capability ladder ↔ TRL / MRL / IRL — indicative evidence bands, not equivalence.

Two axes, one bridge. TRL/MRL/IRL measure a machine's readiness, assigned by an authority. The ladder measures the AI's capability. CADCLAW is the bridge: it auto-generates the integration evidence those gates consume. The tightest native fit is IRL — CADCLAW's checks (interference, alignment, floating) are integration-readiness evidence.

What it does not claim. CADCLAW supports a readiness assessment; it does not assign a TRL. Upper bands also need operational and production data that design-time verification cannot supply alone.

How the benchmark runs

One task, any tool, one grader. Every AI workflow gets the same prompt + kit and is judged on its exported STEP by the same automated, tool-independent gates — the score is the only thing that differs.

ARB pipeline: inputs to AI driver to exported STEP to CADCLAW gates to ARB score and readiness to human review

The ARB pipeline — inputs → AI driver → exported STEP → CADCLAW gates → ARB score + readiness → human review, with a read-fix-rerun loop.

The metrics

Every run yields the same small, defined set of numbers — two kinds, never mixed: artifact quality (is the build correct?) and effort (what did it cost?).

Gates

Black-box, on the exported STEP: Inventory (right parts) · Interference (no overlaps) · Floating (all connected) · Orientation (correct pose — v-next).

L1 sub-grade · 0–100

How well the kit is assembled, × a buildability factor that falls with interference (clip count + overlap volume). A self-intersecting frame can't be built, so it can't score high.

ARB full-stack · 0–100

Rung on the L0–L7 capability ladder. A clean L1 ≈ 15 today; the rest of the scale is the road to autonomous design. Mapped to TRL / MRL / IRL.

Effort (reported separately)

Wall-clock, tokens, attempts, retries, corrections, human interventions — measured from the run log, never folded into the score.

Same task, same kit, same grader — the only variable is the driver. That makes ARB a clean performance-vs-model-capability axis: every model (frontier, local, or agentic system) is one point on the same graph.

Today — L1, on a real machine

An AI placed roughly 100 parts into a correct, two-metre 3D-printer / CNC frame:

All geometry gates pass — black-box grade 95/100, zero fails or warnings.
Independently graded — the same automated checker that will grade every other tool, with no human in the loop.

The reference above is the resolver-built answer key — a clean L1. Below: the first cross-tool head-to-head, independent AI workflows given the same task and judged by this same grader.

First head-to-head

Three independent AI workflows — Claude Opus 4.7 (Fusion and CadQuery) and OpenAI Codex / GPT-5 (CadQuery — the coding agent, not the chat app) — fresh prompt-only sessions, same kit, graded on the identical black-box lens. All three placed every part with zero human help; none is buildable yet.

The target M3-CRETE machine — the single reference image each AI was given

The goal: the one reference image each AI was given — no build steps, just this picture plus the box of parts.

Score vs build time for the three AI runs; all near the floor, OpenAI Codex fastest

How good vs. how fast. All three sit near the floor (a clean build = 100); OpenAI Codex did it in 13 minutes — far faster, slightly lower score.

Track	Parts	Inv / Float	Interference	L1	Full‑stack	Time	Tokens
CADCLAW reference (answer key)	100/100	✓ / ✓	✓ clean	100	15.0	—	—
Claude‑CadQuery	100/100	✓ / ✓	✗ 35 clips · 248 cm³	15.2	6.52	48.8 min	22.1M
Claude‑Fusion	100/100	✓ / ✓	✗ 44 clips · 323 cm³	12.4	6.24	33.7 min	34.2M
Codex‑CadQuery (OpenAI Codex / GPT-5)	100/100	✓ / ✓	✗ 50 clips · 288 cm³	11.9	6.19	13.0 min	n/a

Every flagged clip is structural — beams overlapping at splice joints and post/frame junctions, and the centered 2040 inserts overlapping their beams. Both also placed the Z-posts in the wrong rotation versus the reference — an orientation error the current gates don't yet catch, and the next gate we're adding.

Interference is the buildability gate. A self-intersecting frame can't be built, so it scales the L1 score down — a near-miss outranks a mess. Effort (time, tokens, attempts) is reported separately, never folded into the artifact score.

Fairness note. CADCLAW and its placement resolver were first built around CadQuery, before the Fusion connection (MCP) existed — so the CadQuery runs may carry a home-field edge in tooling maturity. We flag it so the comparison stays honest; an orientation gate and more reference tasks will tighten it.

Claude-Fusion build progression from 10 to 100 parts

Claude-Fusion, building the frame (10 → 100 parts). The model emits no parametric timeline, so we recovered the build order afterward by driving the live Fusion model through its MCP — revealing the placed parts in order under a fixed isometric camera.

Grid of in-process CAD review renders the Claude-CadQuery driver generated while assembling the machine

Claude-CadQuery's own in-process renders — the orthographic + isometric checks it generated as it built. This is the human-reviewable output that lets a watcher catch a bad run early and stop it to save tokens.

What we asked — and what we didn't

We gave the goal, not the method. The driver got the target — the pictured assembly plus design constraints — and the kit, but not the build sequence. The original human-guided build specified an inside-out order (X axis → Y → Z-posts) and detailed steps; here we deliberately withheld that and let the AI decide how to reach the pictured result.

So ARB really tests whether a model understands the mechanical system — here a V-slot-and-roller gantry, a basic, well-documented pattern. Placing all 100 parts but self-intersecting, with mis-oriented posts, says a model recognizes the parts but not yet how they go together.

Human-reviewable throughout. Both drivers emitted orthographic + isometric renders as they built, so a person could watch the assembly take shape and abort a bad run early to save tokens. (Fusion didn't persist its in-session views, so we recovered the progression afterward straight from the live model via its MCP — see above.)

One point on a bigger graph — and the spread is instructive. The three frontier runs took 13–49 minutes for one ~100-part assembly. The model that iterated most — reviewing its own renders, re-extracting real hole patterns — scored highest but ran longest; the one-shot run was 4× faster and scored lowest. More careful self-review buys accuracy at the cost of time and tokens. Next on the curve: a top local model as a low-capability anchor — same task, same grader, plotted as capability vs ARB score.

Why it matters — in plain English

Checking a design for mistakes is one of the most common — and costly — jobs in engineering. The longer a mistake hides, the more it costs. A widely-used rule of thumb, the 1-10-100 rule: a mistake caught while designing costs about $1 to fix; caught while building, about $10; caught after it ships, $100 or more.

Today that checking is mostly done by hand — an engineer rotating a model on screen, hunting for parts that overlap or don't line up. It's slow, easy to miss things, and it doesn't scale. The hours and the escaped mistakes are pure lost value.

Why now, and not a year ago? Two things just arrived at the same time: AI that can actually drive CAD and assemble parts, and an automated checker (CADCLAW) that can prove an assembly is right — instantly, the same way every time. Put them together and you can both build and verify by machine. ARB is the scoreboard that keeps it honest.

For engineers

A pytest-style, tool-independent check that an assembly is buildable — on every change, not just at sign-off.

For programs

Automated, auditable readiness evidence that plugs into the TRL/MRL/IRL gate reviews you already run.

What makes it different

ARB is not "a CAD checker." Specifics separate it from CAD-vendor tooling and from prior benchmarks:

Grades the shipped STEP, not a vendor's feature tree. Fusion, SolidWorks, and an AI agent are scored the same way — the only fair cross-tool yardstick.
Scores the whole machine, not part pairs. System-level integrity across every interface at once, where joint/mate datasets stop at two parts.
Speaks readiness (TRL/MRL/IRL), not a private number. A score drops straight into the gate reviews programs already run.
Traces to a real, built machine. The reference target is a physical prototype that exists in metal, not only pixels.
Audits its own claims. CADCLAW ships honesty tooling and refuses to assert a head-to-head it hasn't run.

Open standard, proprietary engine. The ladder, rubric, schemas, and chart are open to measure and audit against; the gate algorithms and resolver are the engine. Adopt the standard without the IP — and trust the score because the method is open. ARB sits on top of every CAD tool; it doesn't compete with where designs are drawn. CADCLAW reads and proposes fixes — it never edits your model. The human stays the engineer.

Read & review

We are publishing the ladder and the readiness correlation as a candidate standard and inviting the field to measure against it.

Request the white paper CADCLAW on GitHub How CADCLAW works

Reviewers and collaborators welcome — labs, CAD vendors, and standards bodies especially.