Technical Reference
A complete technical overview of the Blueprint engine, distribution system, constraint types, export pipeline, and data masking — for teams that want to understand the system before they build on it.
Generation Modes
Both modes share the same deterministic Blueprint engine, constraint system, and export pipeline. The difference is in variance policy, noise injection, and labeling behaviour.
Mode A
Test data for engineering teams
Used by software teams, QA engineers, and DevOps for staging environments, integration tests, and demo data — without exposing real customer records.
Mode B
Labelled data for ML pipelines
Used by ML engineers and AI researchers for LLM fine-tuning, classifier training, and NLP pipelines — with built-in label distribution reports per run.
Distribution Engine
Every numeric and categorical field is backed by a fully parameterised distribution. All distributions are deterministic under the same seed — no external library dependency. Implemented in pure TypeScript using Mulberry32 seeded RNG.
normalμ, σ
Gaussian — income, scores, measurements
lognormalμ, σ
Right-skewed — prices, latency, revenue
gammashape, scale
Positive continuous — wait times, sizes
uniformmin, max
Flat — IDs, percentages, random picks
poissonλ
Count data — events per interval
categoricalvalues, weights
Enum fields — status, country, category
rare_eventp, base_value, rare_value
Anomaly injection — fraud, outliers
mixturecomponents[ ]
Multi-modal — bimodal prices, clusters
Blueprint Schema v1
Every dataset is defined by a Blueprint — a JSON Schema v1 document that specifies fields, generators, distributions, constraints, and mode policies. Blueprints are SHA-256 hash-signed and versioned with a parent chain.
Constraint Types
formulagross = net + vatArithmetic consistency across fields
rangeamount > 0Numeric bounds — hard or soft enforced
regexiban ~ /DE[0-9]{20}/Pattern validation for string fields
date_orderdue_date >= invoice_dateTemporal ordering constraints
not_nullcustomer_id != nullMandatory field presence
enum_onlystatus ∈ {paid, pending}Enum membership validation
{
"schema_version": "1",
"name": "Invoices",
"fields": [
{
"name": "net_amount",
"type": "float",
"generator": { "kind": "from_distribution" },
"distribution": {
"kind": "lognormal",
"params": { "mu": 4.7, "sigma": 0.7 }
}
},
{
"name": "status",
"type": "enum",
"generator": { "kind": "from_distribution" },
"distribution": {
"kind": "categorical",
"params": {
"values": ["paid","pending","failed"],
"weights": [0.7, 0.2, 0.1]
}
}
}
],
"constraints": [
{
"kind": "formula",
"expression": "gross = net_amount + vat",
"severity": "hard"
},
{
"kind": "range",
"expression": "net_amount > 0",
"severity": "hard"
}
]
}Export Pipeline
Every run produces all requested formats. CSV and JSONL stream directly via createWriteStream. Parquet writes row-by-row via appendRow(). No full dataset is held in memory at any point.
CSVStreaming — createWriteStream, no RAM accumulation
JSONLStreaming — one JSON object per line
ParquetColumnar — appendRow() per chunk, S3 upload
JSONFull array — suitable for small datasets
SQLitePost-processing — queryable local database
SQL DumpINSERT statements — portable schema + data
ZIPAll formats bundled — single download
Cloud-Assisted Mode
When using Cloud-Assisted Blueprint Extraction, the seed sample is masked deterministically before being sent to the LLM. Masking is opt-in and requires explicit confirmation in the UI.
user{n}@example.comPerson_{n}Company_{n}+499999000{n}DE00MASKED{n:014d}Street_{n} 1, 12345 City_{n}Note: LiteSeed does not claim GDPR compliance for the Assisted Mode. Only non-sensitive sample data should be used. The UI enforces an explicit opt-in before any data leaves your environment.
Reproducibility
Every run is signed with a randomSeed and a blueprintHash. Re-run with the same parameters and get a bit-for-bit identical dataset.
Mulberry32 seeded RNG — no external library dependency
Reproduce any run in one click — copies seed, blueprint, and mode
generationSeed + blueprintHash stored per dataset version
Determinism verified by automated test suite (seed 42 == seed 42)
{
"run_id": 1042,
"random_seed": 1837291,
"blueprint_hash": "a3f9c2...",
"rows_generated": 100000,
"distribution_summary": {
"amount": {
"type": "numeric",
"min": 10.50,
"max": 9987.20,
"mean": 523.10,
"stddev": 412.30
},
"status": {
"type": "categorical",
"counts": {
"paid": 48210,
"pending": 31020,
"failed": 10770
}
}
}
}Start free. No credit card required. Full platform access from day one.