FSDP now has an express auto-wrapper for Transformer models. This allows FSDP to create a 'model aware' sharding plan for how it breaks up the model across the GPU's and can result in some significant ...
This runs the most basic distributed training setup using DistributedDataParallel (DDP). Good starting point for understanding distributed training. This would produce logs like ... {'stage': 'after ...
PyTorch has announced a new series of 10 video tutorials on Fully Sharded Data Parallel (FSDP) today. The tutorials are led by Less Wright, an AI/PyTorch Partner Engineer and who also presented at ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results