pytorch中的单机多卡和多机多卡

在PyTorch中使用DistributedDataParallel进行多GPU分布式模型训练 - Pytorch中文手册pytorch handbook是一本开源的书籍,目标是帮助那些希望和使用PyTorch进行深度学习开发和研究的朋友快速入门,其中包含的Pytorch教程全部通过测试保证可以成功运行https://handbook.pytorch.wiki/chapter4/distributeddataparallel/readme.html罗西的思考的个人主页 - 墨天轮https://www.modb.pro/u/447077Pytorch的nn.DataParallel - 知乎公司配备多卡的GPU服务器,当我们在上面跑程序的时候,当迭代次数或者epoch足够大的时候,我们通常会使用nn.DataParallel函数来用多个GPU来加速训练。一般我们会在代码中加入以下这句: device_ids = [0, 1] net =…https://zhuanlan.zhihu.com/p/102697821

pytorch单机多卡的正确打开方式 以及可能会遇到的问题和相应的解决方法_我是一颗棒棒糖的博客-CSDN博客_pytorch 多卡pytorch 单机多卡的正确打开方式pytorch 使用单机多卡,大体上有两种方式简单方便的 torch.nn.DataParallel(很 low,但是真的很简单很友好)使用 torch.distributed 加速并行训练(推荐,但是不友好)首先讲一下这两种方式分别的优缺点nn.DataParallel优点:就是简单缺点就是:所有的数据要先load到主GPU上,然后再分发给每个GPU去train,注意这时候主GPU的显存占用很大,你想提升batch_size,那你的主GPU就会限制https://blog.csdn.net/qq_42255269/article/details/123427094pytorch 并行训练之DistributedDataParallel(代码样例和解释) - 知乎全网看了很多关于torch多gpu训练的介绍,但运行起来始终有各种问题。遂记录下来,力求简单。 不建议使用“torch.nn.DataParallel”这种并行方式,这种方式实际加速效果很差,gpu负载不均衡,常常是“一个gpu干活,…https://zhuanlan.zhihu.com/p/350301395分布式运算库有deepspeed,horovod等,工具很多,主要来学习一下pytorch中提供的nn.Dataparallel和distributeddataparallel,目前的卡资源越来越多,多卡训练已经是必须的方式了。

1.单击多卡,一般单机多卡就够了,第一个材料做了单机多卡和多机多卡的性能比较。

1.
net = torch.nn.DataParallel(net).cuda()
loss = torch.nn.CrossEntropyLoss().cuda()

2.
gpus = [0,1,2,3]
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
net = torch.nn.DataParallel(net.to(device), device_ids=gpus, output_device=gpus[0])

torch.nn.Dataparallel非常简单,只要在net那块加上这个指令就可以,当然也有人把优化器也放到了多卡上,这一步可以不需要。

1. 切分mini-batch到每个GPU上,2. 将模型复制到每一个GPU上,3. 模型对mini-batch进行forward计算结果,4. 将每个GPU上的计算结果汇总到第一个GPU上。反向传播过程:1. 在第一个GPU上根据计算的loss计算梯度,2. 将第一个GPU上计算的梯度值分发到每个GPU上,3. 每个GPU上进行梯度更新,4. 将每个GPU上更新的梯度值汇总到第一个GPU上。其中foward过程都会把第一个GPU上的模型重新分发到每个每个GPU上。

另外注意,此时model已经在module里面了,取model里面的name,就要写model.module.name就可以了。不过这个方法在我的单机多卡1080ti上跑是没问题,在单机多卡的k80上跑是有问题的。

pytorch中的单机多卡和多机多卡_第1张图片

2.DistributedDataParallel,适合单机多卡和多机多卡,也是pytorch官方推荐的分布式框架。DistributedDataParallel也不需要对loss做特殊处理,有两种调用方式,一种格式在终端输入torch.distributed.launch,另一种是直接用python的multiprocess,这一种方式比较小众,就不推荐,主要还是第一种方式。

nn.Dataparallel 使用单进程控制多个GPU不同,nn.DistributedDataParallel为每个GPU都创建一个进程。这些GPU可以位于同一个结点上(单机多卡),也可以分布在多个节点上(多机多卡)。每个进程都执行相同的任务,每个进程都与其他进程进行通信。

另外一点不同是,只有梯度会在进程(GPU)之间传播。以单机多卡举例,假设我们有三张卡并行训练,那么在每个epoch中,数据集会被划分成三份给三个GPU,每个GPU使用自己的minibatch数据做自己的前向计算,然后梯度在GPU之间全部约简。在反向传播结束的时候,每个GPU都有平均的梯度,确保模型权值保持同步(synchronized)。

import torch
import torchvision
import numpy as np
import os
from torch.utils.data import Dataset, DataLoader
import PIL
import torch.nn as nn
from torch.optim import Adam
# from torch.utils.tensorboard import SummaryWriter
from tensorboardX import SummaryWriter

# NEW
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel
from torch.utils.data.distributed import DistributedSampler


# VOCSegmentation returns a raw dataset: images are non-resized and in the PIL format. To transform them
# something suitable for input to PyTorch, we need to wrap the output in our own dataset class.
class PascalVOCSegmentationDataset(Dataset):
    def __init__(self, raw):
        super().__init__()
        self._dataset = raw
        self.resize_img = torchvision.transforms.Resize((256, 256), interpolation=PIL.Image.BILINEAR)
        self.resize_segmap = torchvision.transforms.Resize((256, 256), interpolation=PIL.Image.NEAREST)

    def __len__(self):
        return len(self._dataset)

    def __getitem__(self, idx):
        img, segmap = self._dataset[idx]
        img, segmap = self.resize_img(img), self.resize_segmap(segmap)
        img, segmap = np.array(img), np.array(segmap)
        img, segmap = (img / 255).astype('float32'), segmap.astype('int32')
        img = np.transpose(img, (-1, 0, 1))

        # The PASCAL VOC dataset PyTorch provides labels the edges surrounding classes in 255-valued
        # pixels in the segmentation map. However, PyTorch requires class values to be contiguous
        # in range 0 through n_classes, so we must relabel these pixels to 21.
        segmap[segmap == 255] = 21

        return img, segmap


def get_dataloader(rank, world_size):
    _PascalVOCSegmentationDataset = torchvision.datasets.VOCSegmentation(
        '/home/imcs/local_disk/', year='2007', image_set='train', download=False,
        transform=None, target_transform=None, transforms=None
    )
    dataset = PascalVOCSegmentationDataset(_PascalVOCSegmentationDataset)

    # NEW
    sampler = DistributedSampler(dataset, rank=rank, num_replicas=world_size)
    dataloader = DataLoader(dataset, batch_size=8, shuffle=False, sampler=sampler)

    return dataloader


# num_classes is 22. PASCAL VOC includes 20 classes of interest, 1 background class, and the 1
# special border class mentioned in the previous comment. 20 + 1 + 1 = 22.
def get_model():
    return torchvision.models.segmentation.deeplabv3_resnet101(
    pretrained=False, progress=True, num_classes=22, aux_loss=None,pretrained_backbone=False
)


def train(rank, num_epochs, world_size):
    # NEW
    # init_process(rank, world_size)
    dist.init_process_group(backend='gloo', init_method='env://')
    print(f"Rank {rank}/{world_size} training process initialized.\n")

    model = get_model()
    model.cuda(rank)
    model.train()

    # NEW
    model = DistributedDataParallel(model, device_ids=[rank])

    dataloader = get_dataloader(rank, world_size)

    # since the background class doesn't matter nearly as much as the classes of interest to the
    # overall task a more selective loss would be more appropriate, however this training script
    # is merely a benchmark so we'll just use simple cross-entropy loss
    criterion = nn.CrossEntropyLoss()

    # NEW
    # Since we are computing the average of several batches at once (an effective batch size of
    # world_size * batch_size) we scale the learning rate to match.
    optimizer = Adam(model.parameters(), lr=1e-3 * world_size)

    writer = SummaryWriter(f'./model_2')

    for epoch in range(1, num_epochs + 1):
        losses = []

        for i, (batch, segmap) in enumerate(dataloader):
            optimizer.zero_grad()

            batch = batch.cuda(rank)
            segmap = segmap.cuda(rank)

            output = model(batch)['out']
            loss = criterion(output, segmap.type(torch.int64))
            loss.backward()
            optimizer.step()

            curr_loss = loss.item()
            if i % 10 == 0:
                print(
                    f'Finished epoch {epoch}, rank {rank}/{world_size}, batch {i}. '
                    f'Loss: {curr_loss:.3f}.\n'
                )
            if rank == 0:
                writer.add_scalar('training loss', curr_loss)
            losses.append(curr_loss)

        print(
            f'Finished epoch {epoch}, rank {rank}/{world_size}. '
            f'Avg Loss: {np.mean(losses)}; Median Loss: {np.min(losses)}.\n'
        )

        if rank == 0 and epoch % 5 == 0:
            if not os.path.exists('./'):
                os.mkdir('./')
            torch.save(model.state_dict(), f'./model_{epoch}.pth')
    torch.save(model.state_dict(), f'./model_final.pth')


# NEW
# NUM_EPOCHS = 20
WORLD_SIZE = torch.cuda.device_count()

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int, default=0)
args = parser.parse_args()


def main():
    train(args.local_rank, 20, WORLD_SIZE)


if __name__ == "__main__":
    main()


python -m torch.distributed.launch   --nproc_per_node=4   --nnodes=1 --node_rank=0     --master_addr=localhost   --master_port=22222 train.py

backend中有好几种终端形式,nccl,gloo,mki,我用k80好像对nccl不太友好,容易卡主。这里没有对loss进行reduce_mean操作,其实对model进行ddp包装之后,本质是多进行的,多卡之间的reduce mean是自动完成,一般就是为了输出和计算才做下处理的。

        上面这个示例还是有个问题,就是没有测试过程,测试的话,是不需要写sampler,最好就在一张卡上测试,sampler之后,数据就分片并行了。

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel

1.
dist.init_process_group(backend='gloo', init_method='env://')
print(f"Rank {rank}/{world_size} training process initialized.\n")

2.
net = torch.nn.SyncBatchNorm.convert_sync_batchnorm(net)
net = torch.nn.parallel.DistributedDataParallel(net.cuda(rank), device_ids=[rank])
loss = torch.nn.CrossEntropyLoss().cuda(rank)
optimizer = optim.Adam(net.parameters(), lr=lr) # 大部分都没有lr* world_size


3. 测试样本放在一张卡上,就别分片了,不然不知道分片之后的结果怎么汇总??
 train_sampler = torch.utils.data.distributed.DistributedSampler(torch_dataset_train)
    train_iter = Data.DataLoader(
        dataset=torch_dataset_train,  # torch TensorDataset format
        batch_size=batch_size,  # mini batch size
        sampler = train_sampler,
        num_workers=0,
#         pin_memory=True,
    )
    
    valid_sampler = torch.utils.data.distributed.DistributedSampler(torch_dataset_valida)
    valiada_iter = Data.DataLoader(
        dataset=torch_dataset_valida,  # torch TensorDataset format
        batch_size=batch_size,  # mini batch size
        sampler = valid_sampler,
        num_workers=0,
#         pin_memory=True,
    )
    
#     test_sampler = torch.utils.data.distributed.DistributedSampler(torch_dataset_test)
    test_iter = Data.DataLoader(
        dataset=torch_dataset_test,  # torch TensorDataset format
        batch_size=batch_size,  # mini batch size
#         sampler=test_sampler,
        shuffle=False,
        num_workers=0,
#         pin_memory=True,
    )
    
#     all_sampler = torch.utils.data.distributed.DistributedSampler(torch_dataset_all)
    all_iter = Data.DataLoader(
        dataset=torch_dataset_all,  # torch TensorDataset format
        batch_size=batch_size,  # mini batch size
#         sampler=all_sampler,
        shuffle=False,
        num_workers=0,
#         pin_memory=True,
    )

4.
X = X.cuda(rank)
y = y.cuda(rank)
valida_acc, valida_loss = evaluate_accuracy(valida_iter, net, loss, device,rank)

5.输出内容要判断rank
if rank == 0:
   print('epoch %d, train loss %.6f, train acc %.3f, valida loss %.6f, valida acc %.3f, time %.1f sec'% (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, valida_loss, valida_acc,time.time() - time_epoch))

6.是在单卡上完成全部测试
with torch.no_grad():
     for X, y in test_iter:
         X = X.to(device)
         net.eval()  # 评估模式, 这会关闭dropout
         y_hat = net(X)
         pred_test.extend(np.array(y_hat.cpu().argmax(axis=1)))

7.
if rank == 0:
   torch.save(net.state_dict(), saved + "/" + str(round(overall_acc, 3)) + '.pth')

你可能感兴趣的:(深度机器学习基础)