Gradient Docs
Gradient HomeHelp DeskCommunitySign up free
1.0.0
1.0.0
  • About Paperspace Gradient
  • Get Started
    • Quick Start
    • Core Concepts
    • Install the Gradient CLI
    • Common Errors
  • Tutorials
    • Tutorials List
      • Getting Started with Notebooks
      • Train a Model with the Web UI
      • Train a Model with the CLI
      • Advanced: Distributed training sample project
      • Registering Models in Gradient
      • Using Gradient Deployments
      • Using Custom Containers
  • Notebooks
    • Overview
    • Using Notebooks
      • The Notebook interface
      • Notebook metrics
      • Share a Notebook
      • Fork a Notebook
      • Notebook Directories
      • Notebook Containers
        • Building a Custom Container
      • Notebook Workspace Include Files
      • Community (Public) Notebooks
    • ML Showcase
    • Run on Gradient (GitHub badge)
  • Projects
    • Overview
    • Managing Projects
    • GradientCI
      • GradientCI V1 (Deprecated)
  • Workflows
    • Overview
      • Getting Started with Workflows
      • Workflow Spec
      • Gradient Actions
  • Experiments
    • Overview
    • Using Experiments
      • Containers
      • Single-node & multi-node CLI options
      • Experiment options
      • Gradient Config File
      • Environment variables
      • Experiment datasets
      • Git Commit Tracking
      • Experiment metrics
        • System Metrics
        • Custom Metrics
      • Experiment Logs
      • Experiment Ports
      • GradientCI Experiments
      • Diff Viewer
      • Hyperparameter Tuning
    • Distributed Training
      • Distributed Machine Learning with Tensorflow
      • Distributed Machine Learning with MPI
        • Distributed Training using Horovod
        • Distributed Training Using ChainerMN
  • Jobs
    • Overview
    • Using Jobs
      • Stop a Job
      • Delete a Job
      • List Jobs
      • Job Logs
      • Job Metrics
        • System Metrics
        • Custom Metrics
      • Job Artifacts
      • Public Jobs
      • Building Docker Containers with Jobs
  • Models
    • Overview
    • Managing Models
      • Example: Prepare a TensorFlow Model for Deployments
      • Model Path, Parameters, & Metadata
    • Public Models
  • Deployments
    • Overview
    • Managing Deployments
      • Deployment Containers
        • Custom Deployment Containers
      • Deployment States
      • Deployment Logs
      • Deployment Metrics
      • A Deployed Model's API Endpoint
        • Gradient + TensorFlow Serving
      • Deployment Autoscaling
      • Optimize Models for Inference
  • Data
    • Types of Storage
      • Managing Data in Gradient
        • Managing Persistent Storage with VMs
    • Storage Providers
    • Versioned Datasets
    • Public Datasets Repository
  • TensorBoards
    • Overview
    • Using Tensorboards
      • TensorBoards getting started with Tensorflow
  • Metrics
    • Metrics Overview
    • View and Query Metrics
    • Push Metrics
  • Secrets
    • Overview
    • Using Secrets
  • Gradient SDK
    • Gradient SDK Overview
      • Projects Client
      • Experiments Client
      • Models Client
      • Deployments Client
      • Jobs Client
    • End to end tutorial
    • Full SDK Reference
  • Instances
    • Instance Types
      • Free Instances (Free Tier)
      • Instance Tiers
  • Gradient Cluster
    • Overview
    • Setup
      • Managed Private Clusters
      • Self-Hosted Clusters
        • Pre-installation steps
        • Gradient Installer CLI
        • Terraform
          • Pre-installation steps
          • Install on AWS
          • Install on bare metal / VMs
          • Install on NVIDIA DGX
        • Let's Encrypt DNS Providers
        • Updating your cluster
    • Usage
  • Tags
    • Overview
    • Using Tags
  • Machines (Paperspace CORE)
    • Overview
    • Using Machines
      • Start a Machine
      • Stop a Machine
      • Restart a Machine
      • Update a Machine
      • Destroy a Machine
      • List Machines
      • Show a Machine
      • Wait For a Machine
      • Check a Machine's utilization
      • Check availability
  • Paperspace Account
    • Overview
    • Public Profiles
    • Billing & Subscriptions
    • Hotkeys
    • Teams
      • Creating a Team
      • Upgrading to a Team Plan
  • Release Notes
    • Product release notes
    • CLI/SDK Release notes
Powered by GitBook
On this page
  • Configuration
  • Installation
  1. Gradient Cluster
  2. Setup
  3. Self-Hosted Clusters
  4. Terraform

Install on AWS

PreviousPre-installation stepsNextInstall on bare metal / VMs

Last updated 4 years ago

For AWS, the Gradient installer will utilize Terraform to provision a Elastic Kubernetes Service (EKS) cluster. You must follow the before continuing.

Requirements

There are many ways of passing in your credentials in order for Terraform to authenticate with your cloud provider. Most likely, you already have your cloud provider credentials loaded through the AWS CLI. Terraform will automatically detect those credentials during initialization for you. See for more information on setting up credentials and user profiles. The AWS user that's responsible for Gradient installation must have broad read/write privileges across services – ideally administrative privileges in the account.

Do not remove the user later or you will lose access to the cluster.

You will also need to have aws-iam-authenticator installed on the computer or instance where you plan to run the installer.

Configuration

Next, create a main.tf file within your local gradient-cluster directory that you created; main.tf will be a sibling file to the backend.tf file that you may have created already. Note: this file must be named main.tf since Terraform looks for this configuration file by name.

In main.tf, copy and paste the Terraform configuration below (note the copy icon in the upper right corner). Be sure to follow the value replacement instructions further below, as well.

SSL Configuration

The Gradient installer can use Let's Encrypt to create a SSL certificate, verify it by making entries with your DNS provider, and install the certificate on your cluster to secure access to notebooks, model deployments, etc. For this to work, your domains DNS provider must be . To use this functionality, create a block in your main.tf file similar to the one in the example below. Use the letsencrypt_dns_name that matches your provider in the list, and provide the required authentication field(s) as specified in the letsencrypt_dns_settings column.

If you don't want to use automatic SSL, use tls_cert and tls_key entries and be sure the SSL certificate files are located in the directory and filenames specified (or change them in the main.tf file).

You can use either the Let's Encrypt block OR the manual certificate block, but not both.

module "gradient_aws" {
    source = "github.com/paperspace/gradient-installer?ref=master/gradient-aws"

    // name should only have letters, numbers, and dashes
    name = "cluster-name"
    aws_region = "us-east-1"

    artifacts_access_key_id = "artifacts-access-key-id"
    artifacts_path = "s3://artifacts-bucket"
    artifacts_secret_access_key = "artifacts-secret-access-key"
    
    cluster_apikey = "cluster-apikey-from-paperspace-com"
    cluster_handle = "cluster-handle-from-paperspace-com"
    domain = "gradient.mycompany.com"

    // insert a SSL block below - the first example is for Cloudflare DNS
    
    /*
    letsencrypt_dns_name = "cloudflare"
    letsencrypt_dns_settings = {
        CF_API_KEY = "[Global cloudflare key]"
        CF_API_EMAIL = "[Cloudflare email address]"
    }
    */

    // or disable automatic SSL by specifying cert files below
    // tls_cert = file("./certs/ssl-bundle.crt")
    // tls_key = file("./certs/ssl.key")
}

output "ELB_HOSTNAME" {
    value = module.gradient_aws.elb_hostname
}

Replace the following fields in the configuration above with the appropriate values:

  • name (the same name used when registering the new cluster in the Paperspace web console)

  • aws_region (your preferred AWS region)

  • artifacts_access_key_id (the key for the bucket that was set up for artifacts storage)

  • artifacts_path (the full s3 path to the bucket)

  • artifacts_secret_access_key

  • cluster_apikey (provided during registration of the new cluster)

  • cluster_handle (provided during registration of the new cluster)

  • domain (same as what was entered during cluster registration)

  • Also, either use automatic SSL or be sure the SSL certificate files are located in your gradient-cluster directory, and replace the filenames in your main.tf configuration to match them as needed.

Installation

Next, install Gradient using Terraform:

terraform init
terraform apply

The init step should take less than a minute, and the apply step may take 15 minutes or more. At the end of the apply step, the installer will return the AWS hostname of the load balancer in your new cluster.

Gradient requires two DNS CNAME records to make external services accessible. Use the hostname of the load balancer as the target for these records, as shown below.

Example:

*.gradient.mycompany.com [ELB_HOSTNAME]

gradient.mycompany.com [ELB_HOSTNAME]

Hot nodes

By default, hot nodes are set up for experiments, model deployments, notebooks, and tensorboards on one c5.xlarge instance each.

Hot nodes can be reconfigured by setting k8s_node_asg_min_sizes in the main.tf file similar to the example below.

  k8s_node_asg_min_sizes = {
        "experiment-cpu-small"=1,
        "experiment-cpu-medium"=0,
        "experiment-gpu-small"=0,
        "experiment-gpu-medium"=0,
        "experiment-gpu-large"=0

        "model-deployment-cpu-small"=1,
        "model-deployment-cpu-medium"=0,
        "model-deployment-gpu-small"=0,
        "model-deployment-gpu-medium"=0,
        "model-deployment-gpu-large"=0

        "notebook-cpu-small"=1,
        "notebook-cpu-medium"=0,
        "notebook-gpu-small"=0,
        "notebook-gpu-medium"=0,
        "notebook-gpu-large"=0,

        "tensorboard-cpu-small"=1,
        "tensorboard-cpu-medium"=0,
        "tensorboard-gpu-small"=0,
        "tensorboard-gpu-medium"=0,
        "tensorboard-gpu-large"=0
  }

Managing the Kubernetes cluster with KUBECONFIG

Managing the Kubernetes cluster manually is not required to use Gradient.

Updating the Gradient cluster

To update Gradient, run terraform apply from the gradient-cluster folder.

Uninstalling Gradient

Uninstallation can be handled by Terraform by running: terraform destroy

For those familiar with Kubernetes, a file will be generated in the gradient-cluster folder that contains the Kubernetes kubeconfig. To use the generated KUBECONFIG, AWS requires aws-iam-authenticator to be installed:

pre-installation steps
configuring the AWS CLI
https://docs.aws.amazon.com/eks/latest/userguide/install-aws-iam-authenticator.html
on the supported list
https://docs.aws.amazon.com/eks/latest/userguide/install-aws-iam-authenticator.html