Gradient Docs
Gradient HomeHelp DeskCommunitySign up free
1.0.0
1.0.0
  • About Paperspace Gradient
  • Get Started
    • Quick Start
    • Core Concepts
    • Install the Gradient CLI
    • Common Errors
  • Tutorials
    • Tutorials List
      • Getting Started with Notebooks
      • Train a Model with the Web UI
      • Train a Model with the CLI
      • Advanced: Distributed training sample project
      • Registering Models in Gradient
      • Using Gradient Deployments
      • Using Custom Containers
  • Notebooks
    • Overview
    • Using Notebooks
      • The Notebook interface
      • Notebook metrics
      • Share a Notebook
      • Fork a Notebook
      • Notebook Directories
      • Notebook Containers
        • Building a Custom Container
      • Notebook Workspace Include Files
      • Community (Public) Notebooks
    • ML Showcase
    • Run on Gradient (GitHub badge)
  • Projects
    • Overview
    • Managing Projects
    • GradientCI
      • GradientCI V1 (Deprecated)
  • Workflows
    • Overview
      • Getting Started with Workflows
      • Workflow Spec
      • Gradient Actions
  • Experiments
    • Overview
    • Using Experiments
      • Containers
      • Single-node & multi-node CLI options
      • Experiment options
      • Gradient Config File
      • Environment variables
      • Experiment datasets
      • Git Commit Tracking
      • Experiment metrics
        • System Metrics
        • Custom Metrics
      • Experiment Logs
      • Experiment Ports
      • GradientCI Experiments
      • Diff Viewer
      • Hyperparameter Tuning
    • Distributed Training
      • Distributed Machine Learning with Tensorflow
      • Distributed Machine Learning with MPI
        • Distributed Training using Horovod
        • Distributed Training Using ChainerMN
  • Jobs
    • Overview
    • Using Jobs
      • Stop a Job
      • Delete a Job
      • List Jobs
      • Job Logs
      • Job Metrics
        • System Metrics
        • Custom Metrics
      • Job Artifacts
      • Public Jobs
      • Building Docker Containers with Jobs
  • Models
    • Overview
    • Managing Models
      • Example: Prepare a TensorFlow Model for Deployments
      • Model Path, Parameters, & Metadata
    • Public Models
  • Deployments
    • Overview
    • Managing Deployments
      • Deployment Containers
        • Custom Deployment Containers
      • Deployment States
      • Deployment Logs
      • Deployment Metrics
      • A Deployed Model's API Endpoint
        • Gradient + TensorFlow Serving
      • Deployment Autoscaling
      • Optimize Models for Inference
  • Data
    • Types of Storage
      • Managing Data in Gradient
        • Managing Persistent Storage with VMs
    • Storage Providers
    • Versioned Datasets
    • Public Datasets Repository
  • TensorBoards
    • Overview
    • Using Tensorboards
      • TensorBoards getting started with Tensorflow
  • Metrics
    • Metrics Overview
    • View and Query Metrics
    • Push Metrics
  • Secrets
    • Overview
    • Using Secrets
  • Gradient SDK
    • Gradient SDK Overview
      • Projects Client
      • Experiments Client
      • Models Client
      • Deployments Client
      • Jobs Client
    • End to end tutorial
    • Full SDK Reference
  • Instances
    • Instance Types
      • Free Instances (Free Tier)
      • Instance Tiers
  • Gradient Cluster
    • Overview
    • Setup
      • Managed Private Clusters
      • Self-Hosted Clusters
        • Pre-installation steps
        • Gradient Installer CLI
        • Terraform
          • Pre-installation steps
          • Install on AWS
          • Install on bare metal / VMs
          • Install on NVIDIA DGX
        • Let's Encrypt DNS Providers
        • Updating your cluster
    • Usage
  • Tags
    • Overview
    • Using Tags
  • Machines (Paperspace CORE)
    • Overview
    • Using Machines
      • Start a Machine
      • Stop a Machine
      • Restart a Machine
      • Update a Machine
      • Destroy a Machine
      • List Machines
      • Show a Machine
      • Wait For a Machine
      • Check a Machine's utilization
      • Check availability
  • Paperspace Account
    • Overview
    • Public Profiles
    • Billing & Subscriptions
    • Hotkeys
    • Teams
      • Creating a Team
      • Upgrading to a Team Plan
  • Release Notes
    • Product release notes
    • CLI/SDK Release notes
Powered by GitBook
On this page
  • Training & Evaluation
  • Setup Dataset
  • COCO Dataset
  • Run Training on Gradient
  • Train on a single GPU
  • Running distributed multi-node on a Gradient Enterprise private cloud cluster
  • Deploying your model on Gradient
  1. Tutorials
  2. Tutorials List

Advanced: Distributed training sample project

PreviousTrain a Model with the CLINextRegistering Models in Gradient

Last updated 4 years ago

One of the areas that we focus on with Gradient is distributed training which can be extremely valuable in terms of decreasing training time but is notoriously difficult to orchestrate. We put together a sample project which provides example code for both singlenode and multinode (distributed) training to showcase how easy it is to take a basic training job and scale it up across multiple instances on Gradient.

The sample project is an object detection demo based on Detectron using PyTorch and the COCO dataset. It also includes a step at the end to take your trained model and deploy it as an API endpoint. Here is a link the project on GitHub:

Training & Evaluation

We provide an example script in "training/train_net.py" that is made to train your model. You can use this as a reference to write your own training script.

Setup Dataset

The datasets are assumed to exist in a directory /data/DATASET. Under this directory, the script will look for datasets in the structure described below, if needed.

/data/coco/
# Example Code 
dataset_dir = os.path.join(os.getenv("DETECTRON2_DATASETS", "/data"), "coco")

Expected dataset structure for COCO instance/keypoint detection:

coco/
  annotations/
    instances_{train,val}2017.json
    person_keypoints_{train,val}2017.json
  {train,val}2017/
    # image files that are mentioned in the corresponding json

You can download a tiny version of the COCO dataset, with training/download_coco.sh.

COCO Dataset

Probably the most widely used dataset today for object localization is COCO: Common Objects in Context. Provided here are all the files from the 2017 version, along with an additional subset dataset created by fast.ai. Details of each COCO dataset is available from the COCO dataset page. The fast.ai subset contains all images that contain one of five selected categories, restricting objects to just those five categories; the categories are: chair couch tv remote book vase.

Run Training on Gradient

Gradient CLI Installation

pip install gradient
gradient apiKey XXXXXXXXXXXXXXXXXXX

Train on a single GPU

Note: training on a single will take a long time, so be prepared to wait!

gradient experiments run singlenode \
  --name mask_rcnn \
  --projectId <some project> \
  --container devopsbay/detectron2-cuda:v0 \
  --machineType p2.xlarge \
  --command "sudo python training/train_net.py --config-file training/configs/mask_rcnn_R_50_FPN_1x.yaml --num-gpus 1 SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025" \
  --workspace https://github.com/Paperspace/object-detection-segmentation.git \
  --datasetName coco \
  --datasetUri s3://fast-ai-coco/train2017.zip \
  --clusterId <cluster id>

The coco dataset is downloaded to the ./data/coco/traing2017 directory. Model results are stored in the ./models directory.

Running distributed multi-node on a Gradient Enterprise private cloud cluster

gradient experiments run multinode \
  --name mask_rcnn_multinode \
  --projectId <some project> \
  --workerContainer devopsbay/detectron2-cuda:v0 \
  --workerMachineType p2.xlarge \
  --workerCount 7 \
  --parameterServerContainer devopsbay/detectron2-cuda:v0 \
  --parameterServerMachineType p2.xlarge \
  --parameterServerCount 1 \
  --experimentType GRPC \
  --workerCommand "sudo python training/train_net.py --config-file training/configs/mask_rcnn_R_50_FPN_1x.yaml --num-machines 8" \
  --parameterServerCommand "sudo python training/train_net.py --config-file training/configs/mask_rcnn_R_50_FPN_1x.yaml --num-machines 8" \
  --workspace https://github.com/Paperspace/object-detection-segmentation.git \
  --datasetName coco \
  --datasetUri s3://fast-ai-coco/train2017.zip \
  --clusterId <cluster id>

Deploying your model on Gradient

This example will load the previously trained model and launch a web app application with a simple interface for making predictions.

gradient deployments create \
  --name mask_rcnn4 --instanceCount 1 \
  --imageUrl devopsbay/detectron2-cuda:v0 \
  --machineType p2.xlarge \
  --command "sudo python demo/app.py" \
  --workspace https://github.com/Paperspace/object-detection-segmentation.git \               
  --deploymentType Custom \
  --clusterId <cluster id> \
  --modelId <model id>

This demo has built-in support for a few datasets. Please check out

How to install Gradient CLI -

Then make sure to , and then:

In order to run an experiment on a , we need to add a few additional parameters:

Example
docs on using Datasets with Gradient
fast.ai subset
Train images
docs
obtain an API Key
Gradient private cluster
GitHub - Paperspace/object-detection-segmentation: How to run object detection models on Gradient including re-training and inferenceGitHub
Logo