Product
Generate datasets on demand
Create structured datasets for training, testing and evaluation without being limited to what the real world already provides.
Why it matters
Beyond real-world data collection
Real-world datasets are expensive to collect, difficult to share and often missing the scenarios models actually need. LiteSeed lets teams define the dataset they need and generate it on demand — at any scale, with full control over structure, distributions and edge cases.
No data collection bottleneck
Generate training data without waiting for real-world collection pipelines.
Full schema control
Define field types, distributions, constraints and relationships in a Blueprint.
Any scale
Generate thousands or millions of rows using a chunk-based streaming architecture.
Core capabilities
Blueprint-driven generation
Define the schema of your dataset in a Blueprint — a versioned JSON document that specifies field types, distributions, constraints and generation policies.
- 10+ field types: numeric, categorical, string, boolean, date, UUID, computed
- 8 statistical distributions: normal, lognormal, gamma, uniform, poisson, categorical, rare_event, mixture
- Computed fields with dependency resolution
- Versioned Blueprints with parent-child lineage
Constraint system
Enforce business rules and data validity through a two-tier constraint system.
- Hard constraints: reject and resample rows that violate rules (up to 50 retries)
- Soft constraints: track violations without blocking generation
- Constraint types: formula, range, regex, date_order, not_null, enum_only
- Configurable violation rate thresholds per mode
Two generation modes
Sandbox mode generates test data for software development — realistic, schema-valid datasets for filling databases, testing APIs and QA. Training mode generates ML/AI training data at scale — higher volume, edge-case injection and rare-event distributions.
- Sandbox: up to 10,000 rows, max 1% soft violation rate, fast iteration
- Training: up to 1,000,000 rows, edge-case injection, OCR noise mutation
- Rare event and mixture distributions for underrepresented scenarios
- Mode selection does not modify the Blueprint
Streaming architecture
Generate large datasets without memory constraints using a chunk-based streaming pipeline.
- CHUNK_SIZE = 10,000 rows — no full dataset held in RAM
- CSV and JSONL written via streaming file writers
- Parquet written row-by-row via appendRow()
- Peak RAM ≤ 256 MB for 5M+ row runs
Integration with evaluation and reproducibility
Dataset generation is the first step in a larger workflow. Every generated dataset can be linked to a Dataset version, evaluated for quality and reproduced exactly using the same Blueprint and seed.
