Gradient Docs
Gradient HomeHelp DeskCommunitySign up free
1.0.0
1.0.0
  • About Paperspace Gradient
  • Get Started
    • Quick Start
    • Core Concepts
    • Install the Gradient CLI
    • Common Errors
  • Tutorials
    • Tutorials List
      • Getting Started with Notebooks
      • Train a Model with the Web UI
      • Train a Model with the CLI
      • Advanced: Distributed training sample project
      • Registering Models in Gradient
      • Using Gradient Deployments
      • Using Custom Containers
  • Notebooks
    • Overview
    • Using Notebooks
      • The Notebook interface
      • Notebook metrics
      • Share a Notebook
      • Fork a Notebook
      • Notebook Directories
      • Notebook Containers
        • Building a Custom Container
      • Notebook Workspace Include Files
      • Community (Public) Notebooks
    • ML Showcase
    • Run on Gradient (GitHub badge)
  • Projects
    • Overview
    • Managing Projects
    • GradientCI
      • GradientCI V1 (Deprecated)
  • Workflows
    • Overview
      • Getting Started with Workflows
      • Workflow Spec
      • Gradient Actions
  • Experiments
    • Overview
    • Using Experiments
      • Containers
      • Single-node & multi-node CLI options
      • Experiment options
      • Gradient Config File
      • Environment variables
      • Experiment datasets
      • Git Commit Tracking
      • Experiment metrics
        • System Metrics
        • Custom Metrics
      • Experiment Logs
      • Experiment Ports
      • GradientCI Experiments
      • Diff Viewer
      • Hyperparameter Tuning
    • Distributed Training
      • Distributed Machine Learning with Tensorflow
      • Distributed Machine Learning with MPI
        • Distributed Training using Horovod
        • Distributed Training Using ChainerMN
  • Jobs
    • Overview
    • Using Jobs
      • Stop a Job
      • Delete a Job
      • List Jobs
      • Job Logs
      • Job Metrics
        • System Metrics
        • Custom Metrics
      • Job Artifacts
      • Public Jobs
      • Building Docker Containers with Jobs
  • Models
    • Overview
    • Managing Models
      • Example: Prepare a TensorFlow Model for Deployments
      • Model Path, Parameters, & Metadata
    • Public Models
  • Deployments
    • Overview
    • Managing Deployments
      • Deployment Containers
        • Custom Deployment Containers
      • Deployment States
      • Deployment Logs
      • Deployment Metrics
      • A Deployed Model's API Endpoint
        • Gradient + TensorFlow Serving
      • Deployment Autoscaling
      • Optimize Models for Inference
  • Data
    • Types of Storage
      • Managing Data in Gradient
        • Managing Persistent Storage with VMs
    • Storage Providers
    • Versioned Datasets
    • Public Datasets Repository
  • TensorBoards
    • Overview
    • Using Tensorboards
      • TensorBoards getting started with Tensorflow
  • Metrics
    • Metrics Overview
    • View and Query Metrics
    • Push Metrics
  • Secrets
    • Overview
    • Using Secrets
  • Gradient SDK
    • Gradient SDK Overview
      • Projects Client
      • Experiments Client
      • Models Client
      • Deployments Client
      • Jobs Client
    • End to end tutorial
    • Full SDK Reference
  • Instances
    • Instance Types
      • Free Instances (Free Tier)
      • Instance Tiers
  • Gradient Cluster
    • Overview
    • Setup
      • Managed Private Clusters
      • Self-Hosted Clusters
        • Pre-installation steps
        • Gradient Installer CLI
        • Terraform
          • Pre-installation steps
          • Install on AWS
          • Install on bare metal / VMs
          • Install on NVIDIA DGX
        • Let's Encrypt DNS Providers
        • Updating your cluster
    • Usage
  • Tags
    • Overview
    • Using Tags
  • Machines (Paperspace CORE)
    • Overview
    • Using Machines
      • Start a Machine
      • Stop a Machine
      • Restart a Machine
      • Update a Machine
      • Destroy a Machine
      • List Machines
      • Show a Machine
      • Wait For a Machine
      • Check a Machine's utilization
      • Check availability
  • Paperspace Account
    • Overview
    • Public Profiles
    • Billing & Subscriptions
    • Hotkeys
    • Teams
      • Creating a Team
      • Upgrading to a Team Plan
  • Release Notes
    • Product release notes
    • CLI/SDK Release notes
Powered by GitBook
On this page
  • Horovod Architecture
  • MPI
  • Usage
  • mpirun
  • Creating a multi-node experiment using the CLI
  • More info
  1. Experiments
  2. Distributed Training
  3. Distributed Machine Learning with MPI

Distributed Training using Horovod

Horovod is an open source framework for distributed deep learning. It is available for use with TensorFlow and several other deep learning frameworks

PreviousDistributed Machine Learning with MPINextDistributed Training Using ChainerMN

Last updated 5 years ago

Horovod Architecture

Horovod’s cluster architecture differs from the parameter server architecture. Horovod uses Ring-AllReduce, where the amount of data sent is more nearly proportional to the number of cluster nodes, which can be more efficient when training with a cluster where each node has multiple GPUs (and thus multiple worker processes).

Additionally, whereas the parameter server update process described above is asynchronous, in Horovod updates are synchronous. After all processes have completed their calculations for the current batch, gradients calculated by each process circulate around the ring until every process has a complete set of gradients for the batch from all processes.

At that time, each process updates its local model weights, so every process has the same model weights before starting work on the next batch. The following diagram shows how Ring-AllReduce works (from ):

MPI

Horovod employs Message Passing Interface (MPI), a popular standard for managing communication between nodes in a high-performance cluster, and uses NVIDIA’s NCCL library for GPU-level communication.

Usage

The Horovod framework eliminates many of the difficulties of Ring-AllReduce cluster setup and works with several popular deep learning frameworks and APIs. For example, if you are using the popular Keras API, you can use either the reference Keras implementation or tf.keras directly with Horovod without converting to an intermediate API such as tf.Estimator.

mpirun

With Gradient you have full control over mpirun command.

Example mpirun

mpirun --allow-run-as-root -np 2 --hostfile /generated/hostfile python main.py 

Multi-node Distributed:

The only requirement to run it as a distributed load on multiple nodes (only for multi node scenario) is to pass:

--hostfile /generated/hostfile

Creating a multi-node experiment using the CLI

The following command creates and starts a multi-node mpi based experiment within the Gradient Project specified with the --projectId option.

gradient experiments create multinode \
--name distributed-horovod \
--projectId xxxxxxx \
--clusterId xxxxxxx \
--experimentType MPI \
--workerContainer horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
--workerMachineType p3.2xlarge \
--workerCommand "sleep infinity" \
--workerCount 2 \
--masterContainer horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
--masterMachineType p3.2xlarge \
--masterCommand "mpirun --allow-run-as-root --hostfile /generated/hostfile -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib python main.py"  \
--masterCount 1 \
--workspace https://github.com/Paperspace/horovod-distributed-example.git \

More info

Horovod's GitHub has good explanation of the -mca params. They also have good set of .

repo
examples
Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow