LiteSeed
Back

Product

Generate datasets on demand

Create structured datasets for training, testing and evaluation without being limited to what the real world already provides.

Why it matters

Beyond real-world data collection

Real-world datasets are expensive to collect, difficult to share and often missing the scenarios models actually need. LiteSeed lets teams define the dataset they need and generate it on demand — at any scale, with full control over structure, distributions and edge cases.

No data collection bottleneck

Generate training data without waiting for real-world collection pipelines.

Full schema control

Define field types, distributions, constraints and relationships in a Blueprint.

Any scale

Generate thousands or millions of rows using a chunk-based streaming architecture.

Core capabilities

Blueprint-driven generation

Define the schema of your dataset in a Blueprint — a versioned JSON document that specifies field types, distributions, constraints and generation policies.

  • 10+ field types: numeric, categorical, string, boolean, date, UUID, computed
  • 8 statistical distributions: normal, lognormal, gamma, uniform, poisson, categorical, rare_event, mixture
  • Computed fields with dependency resolution
  • Versioned Blueprints with parent-child lineage

Constraint system

Enforce business rules and data validity through a two-tier constraint system.

  • Hard constraints: reject and resample rows that violate rules (up to 50 retries)
  • Soft constraints: track violations without blocking generation
  • Constraint types: formula, range, regex, date_order, not_null, enum_only
  • Configurable violation rate thresholds per mode

Two generation modes

Sandbox mode generates test data for software development — realistic, schema-valid datasets for filling databases, testing APIs and QA. Training mode generates ML/AI training data at scale — higher volume, edge-case injection and rare-event distributions.

  • Sandbox: up to 10,000 rows, max 1% soft violation rate, fast iteration
  • Training: up to 1,000,000 rows, edge-case injection, OCR noise mutation
  • Rare event and mixture distributions for underrepresented scenarios
  • Mode selection does not modify the Blueprint

Streaming architecture

Generate large datasets without memory constraints using a chunk-based streaming pipeline.

  • CHUNK_SIZE = 10,000 rows — no full dataset held in RAM
  • CSV and JSONL written via streaming file writers
  • Parquet written row-by-row via appendRow()
  • Peak RAM ≤ 256 MB for 5M+ row runs

Integration with evaluation and reproducibility

Dataset generation is the first step in a larger workflow. Every generated dataset can be linked to a Dataset version, evaluated for quality and reproduced exactly using the same Blueprint and seed.