Gradient Docs
Gradient HomeHelp DeskCommunitySign up free
1.0.0
1.0.0
  • About Paperspace Gradient
  • Get Started
    • Quick Start
    • Core Concepts
    • Install the Gradient CLI
    • Common Errors
  • Tutorials
    • Tutorials List
      • Getting Started with Notebooks
      • Train a Model with the Web UI
      • Train a Model with the CLI
      • Advanced: Distributed training sample project
      • Registering Models in Gradient
      • Using Gradient Deployments
      • Using Custom Containers
  • Notebooks
    • Overview
    • Using Notebooks
      • The Notebook interface
      • Notebook metrics
      • Share a Notebook
      • Fork a Notebook
      • Notebook Directories
      • Notebook Containers
        • Building a Custom Container
      • Notebook Workspace Include Files
      • Community (Public) Notebooks
    • ML Showcase
    • Run on Gradient (GitHub badge)
  • Projects
    • Overview
    • Managing Projects
    • GradientCI
      • GradientCI V1 (Deprecated)
  • Workflows
    • Overview
      • Getting Started with Workflows
      • Workflow Spec
      • Gradient Actions
  • Experiments
    • Overview
    • Using Experiments
      • Containers
      • Single-node & multi-node CLI options
      • Experiment options
      • Gradient Config File
      • Environment variables
      • Experiment datasets
      • Git Commit Tracking
      • Experiment metrics
        • System Metrics
        • Custom Metrics
      • Experiment Logs
      • Experiment Ports
      • GradientCI Experiments
      • Diff Viewer
      • Hyperparameter Tuning
    • Distributed Training
      • Distributed Machine Learning with Tensorflow
      • Distributed Machine Learning with MPI
        • Distributed Training using Horovod
        • Distributed Training Using ChainerMN
  • Jobs
    • Overview
    • Using Jobs
      • Stop a Job
      • Delete a Job
      • List Jobs
      • Job Logs
      • Job Metrics
        • System Metrics
        • Custom Metrics
      • Job Artifacts
      • Public Jobs
      • Building Docker Containers with Jobs
  • Models
    • Overview
    • Managing Models
      • Example: Prepare a TensorFlow Model for Deployments
      • Model Path, Parameters, & Metadata
    • Public Models
  • Deployments
    • Overview
    • Managing Deployments
      • Deployment Containers
        • Custom Deployment Containers
      • Deployment States
      • Deployment Logs
      • Deployment Metrics
      • A Deployed Model's API Endpoint
        • Gradient + TensorFlow Serving
      • Deployment Autoscaling
      • Optimize Models for Inference
  • Data
    • Types of Storage
      • Managing Data in Gradient
        • Managing Persistent Storage with VMs
    • Storage Providers
    • Versioned Datasets
    • Public Datasets Repository
  • TensorBoards
    • Overview
    • Using Tensorboards
      • TensorBoards getting started with Tensorflow
  • Metrics
    • Metrics Overview
    • View and Query Metrics
    • Push Metrics
  • Secrets
    • Overview
    • Using Secrets
  • Gradient SDK
    • Gradient SDK Overview
      • Projects Client
      • Experiments Client
      • Models Client
      • Deployments Client
      • Jobs Client
    • End to end tutorial
    • Full SDK Reference
  • Instances
    • Instance Types
      • Free Instances (Free Tier)
      • Instance Tiers
  • Gradient Cluster
    • Overview
    • Setup
      • Managed Private Clusters
      • Self-Hosted Clusters
        • Pre-installation steps
        • Gradient Installer CLI
        • Terraform
          • Pre-installation steps
          • Install on AWS
          • Install on bare metal / VMs
          • Install on NVIDIA DGX
        • Let's Encrypt DNS Providers
        • Updating your cluster
    • Usage
  • Tags
    • Overview
    • Using Tags
  • Machines (Paperspace CORE)
    • Overview
    • Using Machines
      • Start a Machine
      • Stop a Machine
      • Restart a Machine
      • Update a Machine
      • Destroy a Machine
      • List Machines
      • Show a Machine
      • Wait For a Machine
      • Check a Machine's utilization
      • Check availability
  • Paperspace Account
    • Overview
    • Public Profiles
    • Billing & Subscriptions
    • Hotkeys
    • Teams
      • Creating a Team
      • Upgrading to a Team Plan
  • Release Notes
    • Product release notes
    • CLI/SDK Release notes
Powered by GitBook
On this page
  • How does MPI work?
  • Prerequisites
  • Security
  • Host File
  • mpirun
  • Show Me Some Examples!
  1. Experiments
  2. Distributed Training

Distributed Machine Learning with MPI

PreviousDistributed Machine Learning with TensorflowNextDistributed Training using Horovod

Last updated 5 years ago

This feature is currently only available to our Gradient Private Cloud Customers. to learn more.

MPI (Message Passing Interface) is the de facto standard distributed communications framework for scientific and commercial parallel distributed computing.

Paperspace Gradient supports both and implementations.

$ gradient experiments create multinode \
--name mpi-test \
--experimentType MPI 
--workerContainer horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
--workerMachineType p2.xlarge \
--workerCount 2 \
--masterContainer horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
--masterMachineType p2.xlarge \
--masterCommand "mpirun --allow-run-as-root -np 1 --hostfile /generated/hostfile  -bind-to none -map-by slot  -x NCCL_DEBUG=INFO -mca pml ob1 -mca btl ^openib python examples/keras_mnist.py"  \
--masterCount 1 \
--workspace https://github.com/horovod/horovod.git \
--apiKey XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

How does MPI work?

The MPI command is executed only on the master worker. Then, the master worker connects to the other workers to spin up processes.

In order for this to work, the master worker requires password-less ssh access to all the workers. There are many resources that describe how to set this up; a simple Google search will show you pages like . This is not difficult to do, but it takes time to set up.

On Gradient, all of this setup is taken care of for you – all you'll need to do is run an MPI command. Continue reading to learn how!

Prerequisites

To launch an MPI experiment, all you need is:

  • Docker image with MPI library installed

  • At Least 2 machines (1 Master, 1 Worker)

  • Gradient CLI

That's it!

Security

By default, all inter-node communication is over the SSH layer. Before launching your workload, Gradient will automatically generate new SSH keys, and then will distribute them across all nodes that will be used in the experiment.

Host File

Gradient will generate a host file with list of available nodes at:

--hostfile /generated/hostfile

Note: when using mpirun, be sure to specify the host file.

mpirun

With Gradient you have full control over mpirun command.

Example mpirun

mpirun --allow-run-as-root -np 2 --hostfile /generated/hostfile python main.py 

Show Me Some Examples!

Now that we have a good foundation of how distributed training and inter-node communication works, let's look at two examples.

A Gradient Enterprise cluster ( to get started)

For simplicity's sake, we present here two examples ( and ) with relatively simple code, but these examples (especially Horovod) should give you a good idea of how to run any MPI jobs on Gradient.

Contact Sales
Open MPI
Intel MPI
this
contact Sales
Horovod
ChainerMN