LiteSeed
Back

Technology

Performance targets and architecture

LiteSeed is designed to generate large datasets without memory constraints — using a chunk-based streaming architecture with defined performance targets.

Targets

Performance targets

≥ 50,000

rows/sec

Generation throughput

≤ 256 MB

peak RAM

5M+ row runs

10,000

rows/chunk

Chunk size

5M+

rows

Single run capacity

Streaming architecture

Chunk-based generation loop

The generation engine processes datasets in chunks of 10,000 rows. After each chunk, the chunk array is cleared and garbage collection runs normally.

  • CHUNK_SIZE = 10,000 rows
  • Chunk array cleared after each chunk
  • No --expose-gc flag required
  • Progress updated every CHUNK_SIZE rows

Streaming file writers

CSV and JSONL exports are written via Node.js createWriteStream — rows are written as they are generated, not after generation completes.

  • CSV: header written once, rows streamed
  • JSONL: one JSON object per line, streamed
  • Parquet: appendRow() via @dsnp/parquetjs
  • Temp files uploaded to S3 after generation

Incremental metrics

Quality metrics are computed incrementally during generation using the metricsAccumulator — no post-generation dataset scan.

  • row_count, soft_constraint_violations
  • Numeric: min, max, sum, sumSquares, count
  • Categorical: value counts per field
  • mean, variance, stddev derived at completion

Hard constraint resampling

When a hard constraint is violated, the row is resampled up to 50 times before being skipped. Resampling does not require buffering.

  • Max 50 resampling attempts per row
  • Skipped rows tracked in run report
  • No impact on streaming throughput