Technical Reference

How LiteSeed works.

A complete technical overview of the Blueprint engine, distribution system, constraint types, export pipeline, and data masking — for teams that want to understand the system before they build on it.

Generation Modes

One engine. Two distinct modes.

Both modes share the same deterministic Blueprint engine, constraint system, and export pipeline. The difference is in variance policy, noise injection, and labeling behaviour.

Mode A

Sandbox Mode

Test data for engineering teams

Variance levelLow — production-close distributions
Edge case rate1% — minimal outlier injection
Constraint modeHard enforced — zero invalid rows
NoiseDisabled — clean, predictable output
ReproducibilityFull — same seed → same dataset

Used by software teams, QA engineers, and DevOps for staging environments, integration tests, and demo data — without exposing real customer records.

Mode B

Training Mode

Labelled data for ML pipelines

Variance levelHigh — diverse distribution coverage
Edge case rate5% — controlled outlier injection
Constraint modeSoft tracked — up to 10% violation rate
NoiseOCR mutation (O↔0, l↔1, swap) at 2%
LabelingGround-truth via blueprint label_fields

Used by ML engineers and AI researchers for LLM fine-tuning, classifier training, and NLP pipelines — with built-in label distribution reports per run.

Distribution Engine

Eight distribution types.

Every numeric and categorical field is backed by a fully parameterised distribution. All distributions are deterministic under the same seed — no external library dependency. Implemented in pure TypeScript using Mulberry32 seeded RNG.

normal

μ, σ

Gaussian — income, scores, measurements

lognormal

μ, σ

Right-skewed — prices, latency, revenue

gamma

shape, scale

Positive continuous — wait times, sizes

uniform

min, max

Flat — IDs, percentages, random picks

poisson

λ

Count data — events per interval

categorical

values, weights

Enum fields — status, country, category

rare_event

p, base_value, rare_value

Anomaly injection — fraud, outliers

mixture

components[ ]

Multi-modal — bimodal prices, clusters

Blueprint Schema v1

Machine-readable.
Version-controlled.

Every dataset is defined by a Blueprint — a JSON Schema v1 document that specifies fields, generators, distributions, constraints, and mode policies. Blueprints are SHA-256 hash-signed and versioned with a parent chain.

Constraint Types

formula
gross = net + vat

Arithmetic consistency across fields

range
amount > 0

Numeric bounds — hard or soft enforced

regex
iban ~ /DE[0-9]{20}/

Pattern validation for string fields

date_order
due_date >= invoice_date

Temporal ordering constraints

not_null
customer_id != null

Mandatory field presence

enum_only
status ∈ {paid, pending}

Enum membership validation

blueprint.schema.v1.json (excerpt)
{
  "schema_version": "1",
  "name": "Invoices",
  "fields": [
    {
      "name": "net_amount",
      "type": "float",
      "generator": { "kind": "from_distribution" },
      "distribution": {
        "kind": "lognormal",
        "params": { "mu": 4.7, "sigma": 0.7 }
      }
    },
    {
      "name": "status",
      "type": "enum",
      "generator": { "kind": "from_distribution" },
      "distribution": {
        "kind": "categorical",
        "params": {
          "values": ["paid","pending","failed"],
          "weights": [0.7, 0.2, 0.1]
        }
      }
    }
  ],
  "constraints": [
    {
      "kind": "formula",
      "expression": "gross = net_amount + vat",
      "severity": "hard"
    },
    {
      "kind": "range",
      "expression": "net_amount > 0",
      "severity": "hard"
    }
  ]
}

Export Pipeline

Seven export formats.
One generation run.

Every run produces all requested formats. CSV and JSONL stream directly via createWriteStream. Parquet writes row-by-row via appendRow(). No full dataset is held in memory at any point.

CSV

Streaming — createWriteStream, no RAM accumulation

JSONL

Streaming — one JSON object per line

Parquet

Columnar — appendRow() per chunk, S3 upload

JSON

Full array — suitable for small datasets

SQLite

Post-processing — queryable local database

SQL Dump

INSERT statements — portable schema + data

ZIP

All formats bundled — single download

Cloud-Assisted Mode

PII masking before
any LLM exposure.

When using Cloud-Assisted Blueprint Extraction, the seed sample is masked deterministically before being sent to the LLM. Masking is opt-in and requires explicit confirmation in the UI.

Emailuser{n}@example.com
Person namePerson_{n}
CompanyCompany_{n}
Phone+499999000{n}
IBANDE00MASKED{n:014d}
AddressStreet_{n} 1, 12345 City_{n}

Note: LiteSeed does not claim GDPR compliance for the Assisted Mode. Only non-sensitive sample data should be used. The UI enforces an explicit opt-in before any data leaves your environment.

Reproducibility

Identical seed.
Identical dataset.

Every run is signed with a randomSeed and a blueprintHash. Re-run with the same parameters and get a bit-for-bit identical dataset.

Mulberry32 seeded RNG — no external library dependency

Reproduce any run in one click — copies seed, blueprint, and mode

generationSeed + blueprintHash stored per dataset version

Determinism verified by automated test suite (seed 42 == seed 42)

run_report.json
{
  "run_id": 1042,
  "random_seed": 1837291,
  "blueprint_hash": "a3f9c2...",
  "rows_generated": 100000,
  "distribution_summary": {
    "amount": {
      "type": "numeric",
      "min": 10.50,
      "max": 9987.20,
      "mean": 523.10,
      "stddev": 412.30
    },
    "status": {
      "type": "categorical",
      "counts": {
        "paid": 48210,
        "pending": 31020,
        "failed": 10770
      }
    }
  }
}

Ready to build?

Start free. No credit card required. Full platform access from day one.