Distributed Machine Learning with MPI
This feature is currently only available to our Gradient Private Cloud Customers. Contact Sales to learn more.
MPI (Message Passing Interface) is the de facto standard distributed communications framework for scientific and commercial parallel distributed computing.
Paperspace Gradient supports both Open MPI and Intel MPI implementations.
How does MPI work?
The MPI command is executed only on the master worker. Then, the master worker connects to the other workers to spin up processes.
In order for this to work, the master worker requires password-less ssh access to all the workers. There are many resources that describe how to set this up; a simple Google search will show you pages like this. This is not difficult to do, but it takes time to set up.
On Gradient, all of this setup is taken care of for you – all you'll need to do is run an MPI command. Continue reading to learn how!
Prerequisites
To launch an MPI experiment, all you need is:
Docker image with MPI library installed
At Least 2 machines (1 Master, 1 Worker)
Gradient CLI
A Gradient Enterprise cluster (contact Sales to get started)
That's it!
Security
By default, all inter-node communication is over the SSH layer. Before launching your workload, Gradient will automatically generate new SSH keys, and then will distribute them across all nodes that will be used in the experiment.
Host File
Gradient will generate a host file with list of available nodes at:
Note: when using mpirun
, be sure to specify the host file.
mpirun
With Gradient you have full control over mpirun command.
Example mpirun
Show Me Some Examples!
Now that we have a good foundation of how distributed training and inter-node communication works, let's look at two examples.
For simplicity's sake, we present here two examples (Horovod and ChainerMN) with relatively simple code, but these examples (especially Horovod) should give you a good idea of how to run any MPI jobs on Gradient.
Last updated