Product
Generate datasets on demand
Create structured datasets for training, testing and evaluation without being limited to what the real world already provides.
Why it matters
Beyond real-world data collection
Real-world datasets are expensive to collect, difficult to share and often missing the scenarios models actually need. LiteSeed lets teams define the dataset they need and generate it on demand — at any scale, with full control over structure, distributions and edge cases.
No data collection bottleneck
Generate training data without waiting for real-world collection pipelines.
Full schema control
Define field types, distributions, constraints and relationships in a Blueprint.
Any scale
Generate thousands or millions of rows using a chunk-based streaming architecture.
Core capabilities
Blueprint-driven generation
Define the schema of your dataset in a Blueprint — a versioned JSON document that specifies field types, distributions, constraints and generation policies.
- →10+ field types: numeric, categorical, string, boolean, date, UUID, computed
- →8 statistical distributions: normal, lognormal, gamma, uniform, poisson, categorical, rare_event, mixture
- →Computed fields with dependency resolution
- →Versioned Blueprints with parent-child lineage
Constraint system
Enforce business rules and data validity through a two-tier constraint system.
- →Hard constraints: reject and resample rows that violate rules (up to 50 retries)
- →Soft constraints: track violations without blocking generation
- →Constraint types: formula, range, regex, date_order, not_null, enum_only
- →Configurable violation rate thresholds per mode
Two generation modes
Sandbox mode generates test data for software development. Training mode generates ML/AI training data at scale — higher volume, edge-case injection and rare-event distributions.
- →Sandbox: up to 10,000 rows, max 1% soft violation rate, fast iteration
- →Training: up to 1,000,000 rows, edge-case injection, OCR noise mutation
- →Rare event and mixture distributions for underrepresented scenarios
- →Mode selection does not modify the Blueprint
Streaming architecture
Generate large datasets without memory constraints using a chunk-based streaming pipeline.
- →CHUNK_SIZE = 10,000 rows — no full dataset held in RAM
- →CSV and JSONL written via streaming file writers
- →Parquet written row-by-row via appendRow()
- →Peak RAM ≤ 256 MB for 5M+ row runs