LiteSeed
Back

Product

Generate datasets on demand

Create structured datasets for training, testing and evaluation without being limited to what the real world already provides.

Start FreeExplore Platform

Why it matters

Beyond real-world data collection

Real-world datasets are expensive to collect, difficult to share and often missing the scenarios models actually need. LiteSeed lets teams define the dataset they need and generate it on demand — at any scale, with full control over structure, distributions and edge cases.

No data collection bottleneck

Generate training data without waiting for real-world collection pipelines.

Full schema control

Define field types, distributions, constraints and relationships in a Blueprint.

Any scale

Generate thousands or millions of rows using a chunk-based streaming architecture.

Core capabilities

Blueprint-driven generation

Define the schema of your dataset in a Blueprint — a versioned JSON document that specifies field types, distributions, constraints and generation policies.

  • 10+ field types: numeric, categorical, string, boolean, date, UUID, computed
  • 8 statistical distributions: normal, lognormal, gamma, uniform, poisson, categorical, rare_event, mixture
  • Computed fields with dependency resolution
  • Versioned Blueprints with parent-child lineage

Constraint system

Enforce business rules and data validity through a two-tier constraint system.

  • Hard constraints: reject and resample rows that violate rules (up to 50 retries)
  • Soft constraints: track violations without blocking generation
  • Constraint types: formula, range, regex, date_order, not_null, enum_only
  • Configurable violation rate thresholds per mode

Two generation modes

Sandbox mode generates test data for software development. Training mode generates ML/AI training data at scale — higher volume, edge-case injection and rare-event distributions.

  • Sandbox: up to 10,000 rows, max 1% soft violation rate, fast iteration
  • Training: up to 1,000,000 rows, edge-case injection, OCR noise mutation
  • Rare event and mixture distributions for underrepresented scenarios
  • Mode selection does not modify the Blueprint

Streaming architecture

Generate large datasets without memory constraints using a chunk-based streaming pipeline.

  • CHUNK_SIZE = 10,000 rows — no full dataset held in RAM
  • CSV and JSONL written via streaming file writers
  • Parquet written row-by-row via appendRow()
  • Peak RAM ≤ 256 MB for 5M+ row runs

Related