Features

Build a pipeline that turns unstructured input into a queryable, provenance-tracked dataset, then iterate on prompts and rows without re-running everything. Self-hosted on FastAPI + Postgres.

Visual pipeline editor

The primary way to build. A non-engineer can author, run, and iterate a whole pipeline on the canvas without ever opening the SDK.

•React Flow canvas: pan, zoom, undo / redo, minimap.
•LLM Prompt, Code, Branch, Map, Schema Transform, and HTTP Request nodes.
•Bound input and output datasets appear as fixed, read-only terminals.
•Edges validated against port schemas as you draw them; inline node config.

Input dataset Documents

LLM Prompt Extract

Code Validate

Output dataset Records

Datasets & provenance

The output of a pipeline is a real dataset, not an API response.

Input datasets

Documents, forms, transcripts, images, scraped data. Rows have stable ids and optional content-addressed dedupe, so dropping the same file in twice doesn't reprocess it.

Output rows with provenance

One row per input, keyed on (input row, pipeline) and replace-on-update. Each row links back to its source row, pipeline version, and the processing that produced it; history is kept as versioned snapshots.

Browse, query, export

Sort, filter, and search rows in a table. Filter to stale rows (output at an older pipeline version) or failed fields. Export CSV or JSON; mirror to PostgreSQL.

The iteration loop

The differentiator: make a dataset good without re-running everything.

Re-run only what changed

A content-addressed cache keys every node result by config + input. Edit a prompt and only that node and its downstream recompute; upstream results are served from cache, across rows and pipelines.

Reconcile at any scope

Re-process a whole dataset, a single row, or a single node of a single row. Always explicit, always with a row-count and estimated-cost preview before anything runs. No surprise bills.

Fix bad rows in place

Open a row's processing detail, edit the raw LLM response, and revalidate against the schema with no new model call. Human-edited rows are marked and never served from cache elsewhere.

LLM nodes

Per-node provider and model selection with structured output.

Per-node model selection

Pick a provider (Gemini, GPT, Claude, OpenRouter) and model for each LLM node. Provider and model are set explicitly; missing config fails loudly rather than defaulting silently.

Structured & multimodal

Define an output schema (or a Pydantic model via the SDK) and responses are forced to match, with no parsing freeform text. PDF and image inputs adapt to each model's capabilities at call time.

Response repair

Markdown fence stripping today, with escape repair and JSON extraction from reasoning text on the way. The verbatim raw response is always persisted for inspection and edit-revalidate.

Execution engine

Crash-safe execution with caching, retries, and cost control.

Content-addressed cache

Every node result is keyed by config + input. Identical inputs reuse the cached output across rows and pipelines; file inputs are hashed by content so uploads dedupe.

Resumable processings

Processings are crash-safe: if a worker dies, the work returns to the queue and resumes from the last cached checkpoint. Execution state lives in Postgres, not in the worker.

Retry & on-error policies

Per node: a retry policy (max attempts, backoff, jitter) independent from an on-error action (fail, soft-fail, skip, fallback, pause). Map iterations are independent per item. Per-node and per-run cost tracking with budgets.

Schemas

JSON Schema on every node port, checked at design time and at runtime.

Schema authoring

Build nested schemas with a form, paste raw JSON Schema, or infer from a sample. Field types, requireds, constraints. The Python SDK accepts plain Pydantic models in the same slots.

Design-time edge validation

The editor flags edges between incompatible ports before you ever run, showing the path that doesn't match. The body of a Map narrows its inputs against the body node's input schema.

Runtime validation

Data is validated against each node's input schema before delivery. Failures route through the edge's on-error action rather than crashing the whole processing.

Triggers, integrations & hosting

How rows get in, how results get out, where it all runs.

Pipeline triggers

New rows land in an input dataset via webhook, API, file-store watch, or Google Forms, and the pipeline processes them automatically. Ad-hoc invocation runs a pipeline once without persisting.

Processing callbacks

Subscribe a URL and the server POSTs the final result (or failure) when a processing finishes. Useful when an integrator is waiting on a long batch.

External mirrors & self-hosting

Mirror an output dataset to PostgreSQL (more sinks planned). Self-host on FastAPI + Postgres with your own provider keys; files live in local FS or S3, so nothing leaves your network.

Python SDK

For engineers who'd rather author in code, or who are wiring Substructure into an app or CI. The editor and SDK emit the same schema-pinned document, so you can move between them.

•Pass a Pydantic BaseModel as output_schema (Pydantic optional).
•Sync and async clients trigger processings and read datasets and rows.
•Validate locally against the canonical schema before you ship.

from pydantic import BaseModel
from substructure_sdk import Pipeline

class Invoice(BaseModel):
    vendor: str
    total: float

pipe = Pipeline("invoices")
src = pipe.input("src")
extract = pipe.llm(
    "extract",
    model="gemini-3-flash-preview",
    user_prompt="Extract fields from: {{input.text}}",
    output_schema=Invoice,
)
out = pipe.output("out")
pipe.connect(src, extract).connect(extract, out)

Have documents to turn into data?

Self-hosted version is rolling out first. Get in touch to talk about access or your use case.