Features
Build a pipeline that turns unstructured input into a queryable, provenance-tracked dataset, then iterate on prompts and rows without re-running everything. Self-hosted on FastAPI + Postgres.
Visual pipeline editor
The primary way to build. A non-engineer can author, run, and iterate a whole pipeline on the canvas without ever opening the SDK.
- •React Flow canvas: pan, zoom, undo / redo, minimap.
- •LLM Prompt, Code, Branch, Map, Schema Transform, and HTTP Request nodes.
- •Bound input and output datasets appear as fixed, read-only terminals.
- •Edges validated against port schemas as you draw them; inline node config.
Datasets & provenance
The output of a pipeline is a real dataset, not an API response.
Input datasets
Documents, forms, transcripts, images, scraped data. Rows have stable ids and optional content-addressed dedupe, so dropping the same file in twice doesn't reprocess it.
Output rows with provenance
One row per input, keyed on (input row, pipeline) and replace-on-update. Each row links back to its source row, pipeline version, and the processing that produced it; history is kept as versioned snapshots.
Browse, query, export
Sort, filter, and search rows in a table. Filter to stale rows (output at an older pipeline version) or failed fields. Export CSV or JSON; mirror to PostgreSQL.
The iteration loop
The differentiator: make a dataset good without re-running everything.
Re-run only what changed
A content-addressed cache keys every node result by config + input. Edit a prompt and only that node and its downstream recompute; upstream results are served from cache, across rows and pipelines.
Reconcile at any scope
Re-process a whole dataset, a single row, or a single node of a single row. Always explicit, always with a row-count and estimated-cost preview before anything runs. No surprise bills.
Fix bad rows in place
Open a row's processing detail, edit the raw LLM response, and revalidate against the schema with no new model call. Human-edited rows are marked and never served from cache elsewhere.
LLM nodes
Per-node provider and model selection with structured output.
Per-node model selection
Pick a provider (Gemini, GPT, Claude, OpenRouter) and model for each LLM node. Provider and model are set explicitly; missing config fails loudly rather than defaulting silently.
Structured & multimodal
Define an output schema (or a Pydantic model via the SDK) and responses are forced to match, with no parsing freeform text. PDF and image inputs adapt to each model's capabilities at call time.
Response repair
Markdown fence stripping today, with escape repair and JSON extraction from reasoning text on the way. The verbatim raw response is always persisted for inspection and edit-revalidate.
Execution engine
Crash-safe execution with caching, retries, and cost control.
Content-addressed cache
Every node result is keyed by config + input. Identical inputs reuse the cached output across rows and pipelines; file inputs are hashed by content so uploads dedupe.
Resumable processings
Processings are crash-safe: if a worker dies, the work returns to the queue and resumes from the last cached checkpoint. Execution state lives in Postgres, not in the worker.
Retry & on-error policies
Per node: a retry policy (max attempts, backoff, jitter) independent from an on-error action (fail, soft-fail, skip, fallback, pause). Map iterations are independent per item. Per-node and per-run cost tracking with budgets.
Schemas
JSON Schema on every node port, checked at design time and at runtime.
Schema authoring
Build nested schemas with a form, paste raw JSON Schema, or infer from a sample. Field types, requireds, constraints. The Python SDK accepts plain Pydantic models in the same slots.
Design-time edge validation
The editor flags edges between incompatible ports before you ever run, showing the path that doesn't match. The body of a Map narrows its inputs against the body node's input schema.
Runtime validation
Data is validated against each node's input schema before delivery. Failures route through the edge's on-error action rather than crashing the whole processing.
Triggers, integrations & hosting
How rows get in, how results get out, where it all runs.
Pipeline triggers
New rows land in an input dataset via webhook, API, file-store watch, or Google Forms, and the pipeline processes them automatically. Ad-hoc invocation runs a pipeline once without persisting.
Processing callbacks
Subscribe a URL and the server POSTs the final result (or failure) when a processing finishes. Useful when an integrator is waiting on a long batch.
External mirrors & self-hosting
Mirror an output dataset to PostgreSQL (more sinks planned). Self-host on FastAPI + Postgres with your own provider keys; files live in local FS or S3, so nothing leaves your network.
Python SDK
For engineers who'd rather author in code, or who are wiring Substructure into an app or CI. The editor and SDK emit the same schema-pinned document, so you can move between them.
- •Pass a Pydantic
BaseModelasoutput_schema(Pydantic optional). - •Sync and async clients trigger processings and read datasets and rows.
- •Validate locally against the canonical schema before you ship.
from pydantic import BaseModel
from substructure_sdk import Pipeline
class Invoice(BaseModel):
vendor: str
total: float
pipe = Pipeline("invoices")
src = pipe.input("src")
extract = pipe.llm(
"extract",
model="gemini-3-flash-preview",
user_prompt="Extract fields from: {{input.text}}",
output_schema=Invoice,
)
out = pipe.output("out")
pipe.connect(src, extract).connect(extract, out) Have documents to turn into data?
Self-hosted version is rolling out first. Get in touch to talk about access or your use case.
Contact us