Now in early access

Unstructured input,
structured datasets.

Build a pipeline of LLM and code steps that turns documents, forms, and transcripts into a queryable, provenance-tracked dataset. Then iterate without re-running everything.

Join the Waitlist See Features

Built for the iteration loop

A visual pipeline editor, datasets as the output, and an engine designed so you only ever re-run what changed.

✎

Visual pipeline editor

Build a DAG of LLM, code, branch, map, and transform nodes on a canvas. Schema-typed ports, validated as you connect. The bound input and output datasets appear as fixed terminals.

▦

Datasets as the output

One output row per input, keyed and replace-on-update. Every row carries provenance: source input, pipeline version, and the processing that produced it. Browse, filter, and export.

⚡

The iteration loop

Change a prompt, fix a bad row, swap a model, then re-run only the affected rows and nodes. A content-addressed cache means upstream results are never re-paid for.

❖

LLM nodes

Gemini, GPT, Claude, and OpenRouter behind one interface. Structured output, response repair, multimodal input, retry / on-error policies, per-node cost tracking.

☷

Explicit reconciliation

Re-process a whole dataset, a single row, or a single node. Always an explicit action, with a row-count and cost preview first. No surprise bills when a prompt changes.

❯

Python SDK

Prefer code? Author the same pipeline in Python. The editor and SDK produce the same document, though the visual editor is the primary surface.

How it works

Corpus in, dataset out, iterate.

Point a pipeline at your data

Bind a pipeline to an input dataset of documents, forms, transcripts, or scraped data. Build the transform as a DAG of LLM and code nodes with schema-typed ports. Rows arrive by upload, API, webhook, file watch, or Google Forms.

Get a structured dataset

Each input row becomes one output row, keyed and replace-on-update, with provenance back to the source row, pipeline version, and processing. Browse, filter, sort, and export it as CSV or JSON.

Iterate without re-paying

Spot a bad row, fix the prompt, swap a model, or tweak a schema, then re-run only the affected rows and nodes. Cached upstream results stay, and you see a cost preview before any bulk reconciliation runs.

One job, done well

Substructure is built end to end for a single job: unstructured input in, a structured dataset out, and a tight loop to make that dataset good. Every feature serves that one path, which is exactly why it does it well.

Not designed to be an agent framework, a chatbot, or a generic workflow tool.

Have documents to turn into data?

Self-hosted version is rolling out first. Get in touch to talk about access or your use case.

Unstructured input,structured datasets.