Using Pytorch Lightning with DeepSpeed
To use DeepSpeed with Pytorch Lightning, you need to add the following to your training script:
model = SampleModel()
dataset = SampleDataset()
from pytorch_lightning import Trainer
trainer = Trainer(**Trainer Parameters,
callbacks=callbacks,
logger=logger,
accelerator='gpu',
strategy='deepspeed_stage_1',
num_nodes=4,
log_every_n_steps=5,
)
trainer.fit(model, dataset)
Trainer Parameters
The following are the parameters that you can pass to the Trainer if you were working on 100-200 M paramter model of 100 GB of dataset :
max_epochs = 50
min_epochs = 20
#accelerator = gpu
benchmark = True
weights_summary = full
precision = 16
auto_lr_find = True
auto_scale_batch_size = True
auto_select_gpus = True
check_val_every_n_epoch = 1
fast_dev_run = False
enable_progress_bar = True
accumulate_grad_batches=16
sync_batchnorm=True
limit_train_batches=0.1
limit_val_batches=0.1
num_sanity_val_steps=0
- Balance the accumulate_grad_batches and batch_size parmater such that it can fit the model & data into the GPU memory. Increasing the accumulate_grad_batch speeds up the training, without increasing the memory usage.
Distributed Run
export node_rank=<number>
starting from 0 to number of nodes - 1 assign the rank to each nodes. And run the following command in all the nodes :
python -m torch.distributed.run --nnodes=<no-of-nodes> --nproc_per_node=<no-of-gpus-per-node> -node_rank=$node_rank --master_addr=<master-node-ip> model.py
Example : -
python -m torch.distributed.run --nnodes=4 --nproc_per_node=1 --node_rank=$node_rank --master_addr=172.16.96.60 models/EfficientNetv2.py