FastPersist: Accelerating Model Checkpointing in Deep Learning

Published in arXiv Preprint, 2024

FastPersist eliminates the I/O bottleneck of model checkpointing during large-scale training through NVMe-optimized writes, parallel persistence across multiple SSDs, and overlap of checkpointing with independent training computation. Achieves up to 116x faster checkpoint creation over baseline and enables per-iteration checkpointing with negligible training overhead.