分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练

(TensorFlow @ O’Reilly AI Conference, San Francisco '18)
整理自google tensorflow2018年9月参加的TensorFlow @ O’Reilly AI Conference,介绍的分布式tensorflow《Distributed TensorFlow training using Keras and Kubernetes》,内容和ppt图片来自YouTube视频。

分享的主题

tensorflow提供的Distribution Strategy API。
分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第1张图片

Let’s begin with the obvious question. Why should one care about distributed training? Training complex neural networks with large amounts of data can often take a long time. In the graph here, you can see training the resident model on a single but powerful GPU can take up to four days. If you have some experience running complex machine learning models, this may sound rather familiar to you. Bringing down your training time from days to hours can have a significant effect on your productivity because you can try out new ideas faster.

In this talk, we’re going to talk about distributed training that is running training in parallel on multiple devices such as CPUs, GPUs, or TPUs to bring down your training time. With the techniques that you-- we’ll talk about in this talk, you can bring down your training time from weeks or days to hours with just a few lines of change of code and some powerful hardware. To achieve these goals, we’re pleased to introduce the new distribution strategy API. This is an easy way to distribute your TensorFlow training with very little modification to your code. With distribution strategy API, you no longer need to place ops or parameters on specific devices, and you don’t need to restructure a model in a way that the losses and gradients get aggregated correctly across the devices. Distribution strategy takes care of all of that for you. So let’s go with what are the key goals of distribution strategy.

目标

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第2张图片
  1. The first one is ease of use. We want you to make minimal code changes in order to distribute your training.
  2. The second is to give great performance out of the box. Ideally, the user shouldn’t have to change any-- change or configure any settings to get the most performance out of their hardware.
  3. And third we want distribution strategy to work in a variety of different situations, so whether you want to scale your training on different hardware like GPUs or TPUs or you want to use different APIs like Keras or estimator or if you want to run distributed-- different distribution architectures like synchronous or asynchronous training, we have one distribution strategy to be useful for you in all these situations.

单机多GPU的训练

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第3张图片

So if you’re just beginning with machine learning, you might start your training with a multi-core CPU on your desktop. TensorFlow takes care of scaling onto a multi-core CPU automatically. Next, you may add a GPU to your desktop to scale up your training. As long as you build your program with the right CUDA libraries, TensorFlow will automatically run your training on the GPU and give you a nice performance boost. But what if you have multiple GPUs on your machine, and you want to use all of them for your training? This is where distribution strategy comes in.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第4张图片

In the next section, we’re going to talk about how you can use distribution strategy to scale your training to multiple GPUs.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第5张图片

First, we’ll look at some code to train the ResNet model without any distribution. We’ll use a Keras API, which is the recommended TensorFlow high level API. We begin by creating some datasets for training and validation using the TF data API. For the model, we’ll simply reuse the ResNetthat’s prepackaged with Keras and TensorFlow. Then we create an optimizer that we’ll be using in our training. Once we have these pieces, we can compile the model providing the loss and optimizer and maybe a few other things like metrics, which I’ve omitted in the slide here. Once a model’s compiled, you can then begin your training by calling model dot fit, providing the training dataset that you created earlier, along with how many epochs you want to run the training for. Fit will train your model and update the models variables. Then you can call evaluate with the validation dataset to see how well your training did.
分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第6张图片
So given this code to run your training on a single machine or a single GPU, let’s see how we can use distribution strategy to now run it on multiple GPUs. It’s actually very simple. You need to make only two changes:

  1. First, create an instance of something called mirrored strategy and
  2. second pass the strategy instance to the compile call with the distribute argument. That’s it. That’s all the code changes you need to now run this code on multiple GPUs using distribution strategy.

MirroredStrategy

Mirror strategy is a type of distribution strategy API that we introduced earlier. This API is available intensive on point release which will be out very shortly. And in the bottom of the slide, we’ve linked to a complete example of training mnist with Keras and multiple GPUs that you can try out. With mirror strategy, you don’t need to make any changes to your model code or your training loop, so it makes it very easy to use. This is because we’ve changed many underlying components of TensorFlow to be distribution aware. So this includes the optimizer, batch norm layers, metrics, and summaries are all now distribution aware. You don’t need to make any changes to your input pipeline as well as long as you’re using the recommended TF data APIs. And finally saving and checkpointing work seamlessly as well. So you can save with no or one distribution strategy and a store with another seamlessly.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第7张图片

Data parallelism和AllReduce

Now that you’ve seen some code on how to use mirror strategy to scale to multiple GPUs, let’s look under the hood a little bit and see what mirror strategy does. In a nutshell, mirror strategy implements data parallelism architecture. It mirrors the variables on each device GPU and hence the name mirror strategy, and it uses AllReduce to keep these variables in sync. And using these techniques, it implements synchronous training. So that’s a lot of terminology. Let’s unpack each of these a bit.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第8张图片 分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第9张图片

What is data parallelism? Let’s say you have end workers or end devices . In data parallelism, each device runs the same model and computation but for the different subset of the input data. Each device computes the loss and gradients based on the training samples that it sees. And then we combine these gradients and update the models parameters. The updated model is then used in the next round of computation. As I mentioned before, mirror strategy mirrors the variables across the different devices. So let’s say you have a variable A your model. It’ll be replicated as A0, A1, A2, and A3 across the four different devices. And together these four variables conceptually form a single conceptual variable called a mirrored variable. These variables are kept in sync by applying identical updates. A class of algorithms called AllReduce can be used to keep variables in sync by applying identical gradient updates. AllReduce algorithms can be used to aggregate the gradients across the different devices, for example, by adding them up and making them available on each device. It’s a fused algorithm that can be very efficient and reduce the overhead of synchronization by quite a bit.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第10张图片 分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第11张图片

There are many versions of algorithm-- AllReduce algorithms available based on the communication available between the different devices. One common algorithm is what is known as ring all-reduce. In ring all-reduce, each device sends a chunk of its gradients to its successor on the ring and receives another chunk from its predecessor. There are a few more such rounds of rate and exchanges, and at the end of these exchanges, each device has received a combined copy of all the gradients. Ring-Allreduce also uses network bandwidth optimally because it ensures that both the upload and download bandwidth at each host is fully utilized. We have a team working on fast implementations of all reduce for various network topologies. Some hardware vendors such as the Nvidia provide specialized implementation of all-reduce for their hardware, for example, Nvidia . The bottom line is that AllReduce can be fast when you have multiple devices on a single machine or a small number of machines with strong connectivity. Putting all these pieces together, mirror strategy uses mirrored variables and all reduce to implement synchronous training.

how AllReduce workers

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第12张图片

So let’s see how that works. Let’s say you have two devices, device and and your model has two layers, A and B. Each layer has a single variable. And as you can see, the variables are replicated across the two devices. Each device received one subset of the input data, and it computes the forward pass using its local copy of the variables. It then computes a backward pass and computes the gradients. Once agreements are computed on each device, the devices communicate with each other using all reduce to aggregate the gradients. And once the gradients are aggregated, each device updates its local copy of the variables. So in this way, the devices are always kept in sync. The next forward pass doesn’t begin until each device has received a copy of the combined gradients and updated its variables. All reduce can further optimize things and bring down your training time by overlapping computation of gradients at lower layers in the network with transmission of gradients at the higher layers. So in this case, you can see-- you can compute the gradients of layer A while you’re transmitting the gradients for layer B. And this can further reduce your training time.

(省略TPU部分)

多节点分布式学习

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第13张图片

What about multiple nodes the way we have multiple computers? Because the fact is that even though you can cram in a lot of GPU cards, for example, on a single computer, sooner or later, if you do massive amounts of training, you will need to consider an architecture where you can scale out the multiple nodes as well. So this is an example where we see four worker nodes with four GPU cards in each of them. In terms of support for multi-GPU-- multi-node support, we have currently support for premade estimators in terms of [INAUDIBLE] which is subject to be released shortly. And we are working very, very hard with some awesome developers to get this support into Keras as well. So you should be aware that Keras support will be there as soon as possible.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第14张图片 分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第15张图片

Keras转estimator

However, if you do want to use Keras with a multi-node distribution strategy, you can actually achieve that using a little trick that’s available in the Keras, and that’s called-- it’s a function called the TF dot Keras estimator-- modelestimator that takes a Keras model as an argument and then it actually returns an estimator that you can use for multi-node training.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第16张图片

建立多节点环境-kubernetes

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第17张图片

So how do we set up a multi-node training environment in the first place? This was a really, really difficult problem up until the technology that’s open source now called Kubernetes was released. And so we-- even though you can set up multi-node training with TensorFlow without running Kubernetes, it will certainly help to use Kubernetes as the orchestration platform to fire up multiple modes. And Kubernetes this is available in most clouds GCP and I think AWS and others as well.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第18张图片

So how does that work? Well, a Kubernetes cluster contains a set of nodes. So in this particular picture, you can see three nodes. In each of them is a worker node. And what TensorFlow requires in order for this to work is that each of these nodes have an environment variable called TF underscore config defined. So every single node that you’re having your cluster needs to have this variable defined. And in this TF config, you have two parts, first of all, the cluster part, which defines all of the hosts that participates in the distributed training, all the nodes in your cluster. And the second one is really to specify who am I. What is my identity within this cluster? So you can see the task here is . So this worker is hostport . It’s . That’s hostport, and it’s meaning that it’s hostand that-- at that port. So that’s how you need to configure your cluster in order to do this.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第19张图片

So that is really cumbersome to go around and round to all of the nodes and actually provide the specific configuration and Kubernetes provides-- so how do you configure this-- Kubernetes provides an excellent way of doing that through its deployment configuration, the yaml file, so you can actually distribute the configuration, the environment variables to set on the respective nodes. So how do we integrate that with TensorFlow? Well, it’s part of the initial support. And this is just one way of doing it. There are multiple ways, but this is one way that we’ve tested. You can use a template engine called Jinja. And you create a file called a Jinja file, and there is actually such a file available in the TensorFlow slash ecosystem repository, observe not the TensorFlow repository. This is the ecosystem. There will be a directory under that repository called distribution underscore strategy that contains useful functions to use with distribution strategies. So you can use this file as a template in order to automatically generate the deployment dot yaml for the Kubernetes cluster.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第20张图片

So what would that look like for a configuration like this where we have three nodes? Well, it’s really, really simple. The only thing you need to do in this file-- the Jinja file-- is the highlighted configuration up here. You set the worker replicas to three nodes. The rest is just code that you keep for all of the executions you setup to do. Make sense? So this is actually a macro that populates TF config based on this parameter up here.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第21张图片

So that’s very simple but what about the code? We’ve now configured the Kubernetes cluster to be able to do this distributed training with TensorFlow, but there are also some stuff we need to do with the code as we had for the single node as well. So it’s approximately the same as for single node, the multi GPU configuration. So this is the estimator lingo. So I provide a config here. You see the run config? It’s just a standard estimator construct. And I set the train distribute parameter to tf.contrib.distribute.CollectiveAllReduceStrategy, so not mirrored strategy for multi-node configuration. It’s collective AllReduce strategy. And then I specify the number of GPUs I have available for each of these workers that I have my cluster. And that’s it. Given that I have that config object, I can just put that as part of the config parameter when I do the conversion from Keras over to an estimator. And I now have multi-GPU-- multi-node, multi-GPU in each of the nodes configured for TensorFlow.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第22张图片

CollectiveAllReduceStrategy

And so let’s look at this collective AllReduce strategy because that’s something different than what we talked about previously with a mirrored strategy. So what is that thing? Well, it is specifically designed for multiple worker nodes. And it’s essentially based on mirrored strategy, but it adds functionality in order to deal with multi-host or multi-workers in my cluster. And the good thing about this is that it automatically selects the best algorithm for doing reduce– the AllReduce function across this cluster. So what does that mean? What kind of algorithms do we have for doing AllReduce in a multi-node configuration? Well, one of them is very similar to what we have for a single node, which is to ring-all reduce in which case the GPUs, they just travel across the nodes and they perform an overall ring reduce across multiple hosts and GPUs. So essentially the same as for single node. It’s just that they are traversing hosts with all of the penalties associated of course of doing that depending on the interconnect between these hosts.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第23张图片 分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第24张图片

Another algorithm is hierarchical all reduced. I think that this really complicated English word. And what happens here is that we essentially pass all of the variables up to a single GPU card on the respective hosts. See that. We all send them missing an error-- two errors over here-- with one arrow here. Never mind that. They’re supposed to all send this stuff to GPU. And then we do an AllReduce across the nodes there. And the GPUs performing that operation then propagates back to the individual GPUs within its own node.

So depending on network and other characteristics of your setup and hardware, one of these solutions would work very well. And the thing with collective overdue strategy is they will automatically detect the best algorithm to use in your distributed cluster. So that was multi-node, multi-accelerator cards within the nodes.

other way to scale to multiple nodes

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第25张图片

There are also other ways to scale to multiple nodes with TensorFlow. And one of them-- how many of you are familiar with parameter server strategy? Parameter servers? This is the classical way of how you do TensorFlow distributed training. And eventually this-- actually this way, the classical way, you should not continue to do that. You should actually-- once we roll out distribution strategies, that’s the way to go. So what I’m describing here is essentially the parameter server strategy, but instead of describing it in the old classical way of doing TensorFlow, I’m going to describe how to do it with distribution strategies. Does that make sense? Yeah. If you didn’t understand that and you haven’t used TO, just don’t worry about it.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第26张图片 分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第27张图片

Just listen to what I have to say here. To get a recap of what the parameter service strategy is, it’s essentially a strategy where we have shared storage. We have a number of worker nodes, and they’re working on batches of shared stories. They’re working completely independently. Well, not completely we’ll see shortly. But they are working independently calculating gradients based on batches. And then we have a number of parameter servers. So these workers, when they are finished with the batch, they send it up to the parameter servers. The parameter servers, they have the updates from the other workers, so they calculate the average of the gradients and then pass all of those variables down to the workers. So it’s not synchronous. These workers, they will get updates on the variables in that synchronous fashion, which has good sides and bad sides. The good side is one worker can go out, and the other workers can still execute as normal. That’s the way this works. So how can we set this up in a distributed strategy cluster? Well, it’s real easy. Instead of just specifying the worker replicas in their Jinja file, we also specify the PS underscore replicas. So that’s the number of parameter servers that we have in our Kubernetes cluster. So that is the Kubernetes setup. Now what about the code? So that’s also really easy. You saw the run config-- the config parameter previously. Instead of using the collective AllReduce strategy-- I got that right this time-- collective AllReduce strategy, you used the parameter server strategy. See that? So it’s just another type there. You still specified the number of GPUs per worker, you specify the config object to-- Keras model to estimator function call, and you’re all done. So very, very few lines of code needs changing even though we’re talking about massively different way of doing distributed TensorFlow-- TensorFlow training. There is one more configuration that we are working on. I think we will have a release of this in at least we can try out. That is a really, really cool setup where you actually run distributed training from your laptop. And in this particular case, you have all of your model training code here. And the only thing you-- so forget about parameter server.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第28张图片 分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第29张图片 分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第30张图片

Now we’re back to multiple workers and AllReduce here. The only thing you fire up on these workers is the TF underscore STD underscore server dot pi or whatever variant of that you want to use because this code is available also in the TensorFlow ecosystem repository. So you can go check it out how we did it for this normal setup, and you can change it to whatever way you want. The thing is that this script and installation of the workers, they don’t have the model program at all. So when we fire up the model training from our laptop or workstation here, it will distribute that model over to those. So if you have any changes to your model code, you can just make it locally, and it will automatically distribute that out to all of the workers. Now you may say, oh, that’s a hassle because now I’ve got to install this script on all the workers. And you do not have to do that because the only thing you do is just specify the script parameter in the Jinja file that you’ve seen a couple of times now-- and we have the same number of workers here-- and that means that the scripts will actually start on all of these nodes. So what we’re talking about here is the capability to fire up a Kubernetes cluster with an arbitrary number of nodes. Without any installation of code, you can use a local laptop, and it will automatically distribute the model and the training to all of these worker nodes just by having these two lines here.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第31张图片

What about the code? . So again, we have the wrong config here. And this time, we’re going to set a parameter called experimental distribute to the distribute config. And as part of distribute config, we are going to embed a collective AllReduce strategy with, as we saw before, the number of GPUs we have per worker. But the distributed config requires one more parameter, and that is the remote cluster. The cluster-- the master node here needs to know the cluster to which it should send all the model code for these demos that are waiting there for the model code to be shared. Make sense? So you gotta specify that parameters. Then you’re finishing up your config object in model testimony to specify the config object. And as you’ve seen before, it’s just a couple of lines of difference between these different configurations. That’s really it for TensorFlow multi-node training.

分布式TensorFlow - 基于keras和kubernetes的分布式tensorflow训练_第32张图片

你可能感兴趣的:(tensorflow)