Now in early access
Unstructured input,
structured datasets.
Build a pipeline of LLM and code steps that turns documents, forms, and transcripts into a queryable, provenance-tracked dataset. Then iterate without re-running everything.
Built for the iteration loop
A visual pipeline editor, datasets as the output, and an engine designed so you only ever re-run what changed.
Visual pipeline editor
Build a DAG of LLM, code, branch, map, and transform nodes on a canvas. Schema-typed ports, validated as you connect. The bound input and output datasets appear as fixed terminals.
Datasets as the output
One output row per input, keyed and replace-on-update. Every row carries provenance: source input, pipeline version, and the processing that produced it. Browse, filter, and export.
The iteration loop
Change a prompt, fix a bad row, swap a model, then re-run only the affected rows and nodes. A content-addressed cache means upstream results are never re-paid for.
LLM nodes
Gemini, GPT, Claude, and OpenRouter behind one interface. Structured output, response repair, multimodal input, retry / on-error policies, per-node cost tracking.
Explicit reconciliation
Re-process a whole dataset, a single row, or a single node. Always an explicit action, with a row-count and cost preview first. No surprise bills when a prompt changes.
Python SDK
Prefer code? Author the same pipeline in Python. The editor and SDK produce the same document, though the visual editor is the primary surface.
How it works
Corpus in, dataset out, iterate.
Point a pipeline at your data
Bind a pipeline to an input dataset of documents, forms, transcripts, or scraped data. Build the transform as a DAG of LLM and code nodes with schema-typed ports. Rows arrive by upload, API, webhook, file watch, or Google Forms.
Get a structured dataset
Each input row becomes one output row, keyed and replace-on-update, with provenance back to the source row, pipeline version, and processing. Browse, filter, sort, and export it as CSV or JSON.
Iterate without re-paying
Spot a bad row, fix the prompt, swap a model, or tweak a schema, then re-run only the affected rows and nodes. Cached upstream results stay, and you see a cost preview before any bulk reconciliation runs.
One job, done well
Substructure is built end to end for a single job: unstructured input in, a structured dataset out, and a tight loop to make that dataset good. Every feature serves that one path, which is exactly why it does it well.
Not designed to be an agent framework, a chatbot, or a generic workflow tool.
Have documents to turn into data?
Self-hosted version is rolling out first. Get in touch to talk about access or your use case.
Contact us