Technology
Performance targets and architecture
LiteSeed is designed to generate large datasets without memory constraints — using a chunk-based streaming architecture with defined performance targets.
Targets
Performance targets
≥ 50,000
rows/sec
Generation throughput
≤ 256 MB
peak RAM
5M+ row runs
10,000
rows/chunk
Chunk size
5M+
rows
Single run capacity
Streaming architecture
Chunk-based generation loop
The generation engine processes datasets in chunks of 10,000 rows. After each chunk, the chunk array is cleared and garbage collection runs normally.
- CHUNK_SIZE = 10,000 rows
- Chunk array cleared after each chunk
- No --expose-gc flag required
- Progress updated every CHUNK_SIZE rows
Streaming file writers
CSV and JSONL exports are written via Node.js createWriteStream — rows are written as they are generated, not after generation completes.
- CSV: header written once, rows streamed
- JSONL: one JSON object per line, streamed
- Parquet: appendRow() via @dsnp/parquetjs
- Temp files uploaded to S3 after generation
Incremental metrics
Quality metrics are computed incrementally during generation using the metricsAccumulator — no post-generation dataset scan.
- row_count, soft_constraint_violations
- Numeric: min, max, sum, sumSquares, count
- Categorical: value counts per field
- mean, variance, stddev derived at completion
Hard constraint resampling
When a hard constraint is violated, the row is resampled up to 50 times before being skipped. Resampling does not require buffering.
- Max 50 resampling attempts per row
- Skipped rows tracked in run report
- No impact on streaming throughput
