Introducing Substructure
By the Substructure Team
We’re building Substructure because turning a pile of unstructured input into clean, structured data with LLMs is harder than it should be.
The Problem
Every team doing LLM-powered extraction ends up solving the same problems:
- Iteration is expensive. Change one prompt and you re-run everything. With no caching you pay for every upstream step again, even on rows that were already fine.
- You can’t fix one bad row. Output is 92% right; the other 8% is wrong in three different ways. There’s no way to fix those specific rows without redoing the whole batch.
- LLM output is unreliable. Malformed JSON, broken escape sequences in LaTeX, trailing commas, freeform text wrapping structured output. Every team builds ad-hoc parsing and retry logic.
- There’s no canonical dataset. Results live as a pile of one-shot runs. “Re-extract every row where this field is null” means walking run history in reverse and reconstructing the collection by hand.
What We’re Building
Substructure takes unstructured input (documents, forms, transcripts, scraped data) and runs it through a pipeline of LLM and code steps into a structured, queryable dataset, one row per input. The distinguishing feature is the iteration loop on that dataset.
The key pieces:
Pipelines, authored visually. A pipeline is a DAG of LLM Prompt, Code, Branch, Map, Schema Transform, and HTTP Request nodes, built on a canvas with schema-typed ports validated as you connect them. The bound input and output datasets show up as fixed terminals. A Python SDK is there for engineers who’d rather author in code, but the visual editor is the primary surface.
Datasets with provenance. Output is a real dataset, not an API response. Each row is keyed to its input and replaces on update, and carries provenance back to the source row, the pipeline version, and the processing that produced it. Browse it, filter it, export it.
The iteration loop. A content-addressed cache keys every node result by its config and input, so changing a prompt only recomputes that node and what’s downstream of it; everything upstream is served from cache. Re-process a whole dataset, a single row, or a single node, always explicitly and with a cost preview first. Fix a row’s raw response and revalidate against the schema without making another model call.
Robust execution. Structured output, response repair, retry and on-error policies (fail, soft-fail, skip, fallback, pause), multimodal input, and per-node cost tracking with budgets. Processings are crash-safe and resume from the last cached checkpoint.
GPT, Claude, Gemini, and OpenRouter are supported behind a common interface, with per-node provider and model selection.
What it isn’t
Substructure is deliberately narrow. It is not an agent framework, not a chatbot, not RAG-as-a-service, not an AI backend, and not a generic workflow tool. It does one thing: corpus in, dataset out, with a tight loop to make the dataset good.
What’s Next
We’re opening early access for the self-hosted version. If you have a pile of documents to turn into structured data, get in touch.