Gradient Docs
Gradient HomeHelp DeskCommunitySign up free
1.0.0
1.0.0
  • About Paperspace Gradient
  • Get Started
    • Quick Start
    • Core Concepts
    • Install the Gradient CLI
    • Common Errors
  • Tutorials
    • Tutorials List
      • Getting Started with Notebooks
      • Train a Model with the Web UI
      • Train a Model with the CLI
      • Advanced: Distributed training sample project
      • Registering Models in Gradient
      • Using Gradient Deployments
      • Using Custom Containers
  • Notebooks
    • Overview
    • Using Notebooks
      • The Notebook interface
      • Notebook metrics
      • Share a Notebook
      • Fork a Notebook
      • Notebook Directories
      • Notebook Containers
        • Building a Custom Container
      • Notebook Workspace Include Files
      • Community (Public) Notebooks
    • ML Showcase
    • Run on Gradient (GitHub badge)
  • Projects
    • Overview
    • Managing Projects
    • GradientCI
      • GradientCI V1 (Deprecated)
  • Workflows
    • Overview
      • Getting Started with Workflows
      • Workflow Spec
      • Gradient Actions
  • Experiments
    • Overview
    • Using Experiments
      • Containers
      • Single-node & multi-node CLI options
      • Experiment options
      • Gradient Config File
      • Environment variables
      • Experiment datasets
      • Git Commit Tracking
      • Experiment metrics
        • System Metrics
        • Custom Metrics
      • Experiment Logs
      • Experiment Ports
      • GradientCI Experiments
      • Diff Viewer
      • Hyperparameter Tuning
    • Distributed Training
      • Distributed Machine Learning with Tensorflow
      • Distributed Machine Learning with MPI
        • Distributed Training using Horovod
        • Distributed Training Using ChainerMN
  • Jobs
    • Overview
    • Using Jobs
      • Stop a Job
      • Delete a Job
      • List Jobs
      • Job Logs
      • Job Metrics
        • System Metrics
        • Custom Metrics
      • Job Artifacts
      • Public Jobs
      • Building Docker Containers with Jobs
  • Models
    • Overview
    • Managing Models
      • Example: Prepare a TensorFlow Model for Deployments
      • Model Path, Parameters, & Metadata
    • Public Models
  • Deployments
    • Overview
    • Managing Deployments
      • Deployment Containers
        • Custom Deployment Containers
      • Deployment States
      • Deployment Logs
      • Deployment Metrics
      • A Deployed Model's API Endpoint
        • Gradient + TensorFlow Serving
      • Deployment Autoscaling
      • Optimize Models for Inference
  • Data
    • Types of Storage
      • Managing Data in Gradient
        • Managing Persistent Storage with VMs
    • Storage Providers
    • Versioned Datasets
    • Public Datasets Repository
  • TensorBoards
    • Overview
    • Using Tensorboards
      • TensorBoards getting started with Tensorflow
  • Metrics
    • Metrics Overview
    • View and Query Metrics
    • Push Metrics
  • Secrets
    • Overview
    • Using Secrets
  • Gradient SDK
    • Gradient SDK Overview
      • Projects Client
      • Experiments Client
      • Models Client
      • Deployments Client
      • Jobs Client
    • End to end tutorial
    • Full SDK Reference
  • Instances
    • Instance Types
      • Free Instances (Free Tier)
      • Instance Tiers
  • Gradient Cluster
    • Overview
    • Setup
      • Managed Private Clusters
      • Self-Hosted Clusters
        • Pre-installation steps
        • Gradient Installer CLI
        • Terraform
          • Pre-installation steps
          • Install on AWS
          • Install on bare metal / VMs
          • Install on NVIDIA DGX
        • Let's Encrypt DNS Providers
        • Updating your cluster
    • Usage
  • Tags
    • Overview
    • Using Tags
  • Machines (Paperspace CORE)
    • Overview
    • Using Machines
      • Start a Machine
      • Stop a Machine
      • Restart a Machine
      • Update a Machine
      • Destroy a Machine
      • List Machines
      • Show a Machine
      • Wait For a Machine
      • Check a Machine's utilization
      • Check availability
  • Paperspace Account
    • Overview
    • Public Profiles
    • Billing & Subscriptions
    • Hotkeys
    • Teams
      • Creating a Team
      • Upgrading to a Team Plan
  • Release Notes
    • Product release notes
    • CLI/SDK Release notes
Powered by GitBook
On this page
  • Cluster Node Requirements
  • Configuration
  • IP networking
  • NFS setup
  • Installation
  • Managing the Kubernetes cluster with KUBECONFIG
  • Updating the Gradient cluster
  • Uninstalling Gradient
  1. Gradient Cluster
  2. Setup
  3. Self-Hosted Clusters
  4. Terraform

Install on bare metal / VMs

PreviousInstall on AWSNextInstall on NVIDIA DGX

Last updated 4 years ago

This version can be used for any Ubuntu-based hosts running on bare metal, VMs in a public cloud other than AWS, or some other infrastructure. Terraform will be used to provision a Kubernetes cluster on the hosts and will require ssh access to all hosts in order to connect. Note that in this scenario auto-scaling is not available – all nodes will be running at all times. You must follow the before continuing.

Cluster Node Requirements

Each node in your Gradient cluster must have:

  • Ubuntu 18.04

  • NFS server available to all nodes

  • Docker installed on all nodes (or set "setup_docker = true" in your main.tf Terraform config below to have gradient-installer set up Docker via Terraform)

  • Default Docker runtime set to nvidia in /etc/docker/daemon.json (or set "setup_nvidia = true" in your main.tf Terraform Config below to have gradient-installer configure this via Terraform)

The following is an example of how the added line for configuring nvidia as your default Docker runtime will appear in `/etc/docker/daemon.json`. Do not remove any pre-existing content when making this change.
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

For each node, you must also:

  • Ensure your SSH user has access to the docker group in /etc/group:

docker:x:999:your-user
  • Ensure your SSH public key is installed on each host

  • Ensure sudo is enabled for the account you're logging into

  • Ensure /etc/ssh/sshd_config has the following setting (and then reload it by running service ssh reload)

AllowTcpForwarding yes

Configuration

Next, create a main.tf file within your local gradient-cluster directory that you created; main.tf will be a sibling file to the backend.tf file that you may have created already. Note: this file must be named main.tf since Terraform looks for this configuration file by name.

In main.tf, copy and paste the Terraform configuration below (note the copy icon in the upper right corner). Be sure to follow the value replacement instructions further below, as well.

SSL Configuration

If you don't want to use automatic SSL, use tls_cert and tls_key entries and be sure the SSL certificate files are located in the directory and filenames specified (or change them in the main.tf file).

You can use either the Let's Encrypt block OR the manual certificate block, but not both.

module "gradient_metal" {
    source = "github.com/paperspace/gradient-installer?ref=master/gradient-metal"

    // name must only have letters, numbers, and dashes
    name = "cluster-name"
    artifacts_access_key_id = "artifacts-access-key-id"
    artifacts_path = "s3://artifacts-bucket"
    artifacts_secret_access_key = "artifacts-secret-access-key"

    cluster_apikey = "cluster-apikey-from-paperspace-com"
    cluster_handle = "cluster-handle-from-paperspace-com"
    domain = "gradient.mycompany.com"

    k8s_master_node = {
        ip = "master_node_ip1"
        // internal-address = "private_master_node_ip1"
        pool-type = "cpu"
        pool-name = "metal-cpu"
    }
    k8s_workers = [
        {
            ip = "worker_ip1"
            // internal-address = "private_worker_ip1"
            pool-type = "gpu"
            pool-name = "metal-gpu"
        },
        {
            ip = "worker_ip2"
            // internal-address = "private_worker_ip2"
            pool-type = "cpu"
            pool-name = "metal-cpu"
        }
    ]
    
    // Additional hostnames or IPs used to access kubernetes
    // k8s_sans = [
    //   "lb.kubernetes.com",
    //   "8.8.8.8" 
    // ]

    // Uncomment to set up docker
    // setup_docker = true 

    // Uncomment to set up nvidia drivers
    // setup_nvidia = true
    // reboot_gpu_nodes = true

    shared_storage_path = "/srv/gradient"
    shared_storage_server = "shared-nfs-storage.com"
    ssh_key_path = "~/.ssh/gradient_rsa"
    ssh_user = "ubuntu"

    // insert a SSL block below - the first example is for Cloudflare DNS

    /*
    // Example using cloudflare, check docs for list of supported DNS providers
    letsencrypt_dns_name = "cloudflare"
    letsencrypt_dns_settings = {
        CF_API_KEY = "[Global cloudflare key]"
        CF_API_EMAIL = "[Cloudflare email address]"
    }
    */

    // or disable automatic SSL by specifying cert files below
    // tls_cert = file("./certs/ssl-bundle.crt")
    // tls_key = file("./certs/ssl.key")
}

Replace the following fields in the configuration above with the appropriate values:

  • name (the same name used when registering the new cluster in the Paperspace web console)

  • artifacts_access_key_id (the key for the bucket that was set up for artifacts storage)

  • artifacts_path (the full s3 path to the bucket)

  • artifacts_secret_access_key

  • cpu_selector (node selector to run CPU workloads, defaults to "metal-cpu")

  • cluster_apikey (provided during registration of the new cluster)

  • cluster_handle (provided during registration of the new cluster)

  • domain (same as what was entered during cluster registration)

  • gpu_selector (node selector to run GPU workloads, defaults to "metal-gpu")

  • master_ip1, worker_ip1, worker_ip2 (see below for IP networking info)

  • shared_storage_path and shared_storage_server (see below for NFS info)

  • ssh_key_path (for the key whose public key is on the nodes being configured)

  • ssh_user (a ssh user who has the above public key in its authorized_keys file)

  • Also, either use automatic SSL or be sure the SSL certificate files are located in your gradient-cluster directory, and replace the filenames in your main.tf configuration to match them as needed.

IP networking

Each node should have an IP address that's accessible from the computer where the Gradient installer is being run. There must be one master node and at least two workers. All worker nodes must be able to access the master node, and the master node must be accessible from the internet.

All nodes must be able to access various hosts on the internet, including Paperspace's hub sites, logging sites, and Docker Hub.

NFS setup

Gradient installer requires a NFS host for runtime file storage. This server should have a high-bandwidth, low-latency connection from the cluster – ideally within the same datacenter or cloud region.

Installation

Next, install and configure the nodes using Terraform:

terraform init
terraform apply

The init step should take less than a minute, and the apply step may take 15 minutes or more.

If NVIDIA Cuda drivers were selected to be installed a reboot of all GPU workers is required

Gradient requires two DNS A records to make external services accessible. Use the IP address of the master node as the target for these records, as shown below.

Example:

*.gradient.mycompany.com [master node ip address]

gradient.mycompany.com [master node ip address]

Managing the Kubernetes cluster with KUBECONFIG

For those familiar with Kubernetes, a file will be generated in the gradient-cluster folder that contains the Kubernetes kubeconfig. To use the generated KUBECONFIG, the computer running kubectl will need to have access to the cluster's master node.

Managing the Kubernetes cluster manually is not required to use Gradient.

Updating the Gradient cluster

If you created a Terraform provider file in S3 during the pre-install steps then updating to the latest version of Gradient is simple: just run terraform apply from the gradient-cluster folder.

Uninstalling Gradient

Uninstallation can be handled by Terraform by running: terraform destroy

The Gradient installer can use Let's Encrypt to create a SSL certificate, verify it by making entries with your DNS provider, and install the certificate on your cluster to secure access to notebooks, model deployments, etc. For this to work, your domains DNS provider must be . To use this functionality, create a block in your main.tf file similar to the one in the example below. Use the letsencrypt_dns_name that matches your provider in the list, and provide the required authentication field(s) as specified in the letsencrypt_dns_settings column.

pre-installation steps
on the supported list