Gradient Docs
Gradient HomeHelp DeskCommunitySign up free
1.0.0
1.0.0
  • About Paperspace Gradient
  • Get Started
    • Quick Start
    • Core Concepts
    • Install the Gradient CLI
    • Common Errors
  • Tutorials
    • Tutorials List
      • Getting Started with Notebooks
      • Train a Model with the Web UI
      • Train a Model with the CLI
      • Advanced: Distributed training sample project
      • Registering Models in Gradient
      • Using Gradient Deployments
      • Using Custom Containers
  • Notebooks
    • Overview
    • Using Notebooks
      • The Notebook interface
      • Notebook metrics
      • Share a Notebook
      • Fork a Notebook
      • Notebook Directories
      • Notebook Containers
        • Building a Custom Container
      • Notebook Workspace Include Files
      • Community (Public) Notebooks
    • ML Showcase
    • Run on Gradient (GitHub badge)
  • Projects
    • Overview
    • Managing Projects
    • GradientCI
      • GradientCI V1 (Deprecated)
  • Workflows
    • Overview
      • Getting Started with Workflows
      • Workflow Spec
      • Gradient Actions
  • Experiments
    • Overview
    • Using Experiments
      • Containers
      • Single-node & multi-node CLI options
      • Experiment options
      • Gradient Config File
      • Environment variables
      • Experiment datasets
      • Git Commit Tracking
      • Experiment metrics
        • System Metrics
        • Custom Metrics
      • Experiment Logs
      • Experiment Ports
      • GradientCI Experiments
      • Diff Viewer
      • Hyperparameter Tuning
    • Distributed Training
      • Distributed Machine Learning with Tensorflow
      • Distributed Machine Learning with MPI
        • Distributed Training using Horovod
        • Distributed Training Using ChainerMN
  • Jobs
    • Overview
    • Using Jobs
      • Stop a Job
      • Delete a Job
      • List Jobs
      • Job Logs
      • Job Metrics
        • System Metrics
        • Custom Metrics
      • Job Artifacts
      • Public Jobs
      • Building Docker Containers with Jobs
  • Models
    • Overview
    • Managing Models
      • Example: Prepare a TensorFlow Model for Deployments
      • Model Path, Parameters, & Metadata
    • Public Models
  • Deployments
    • Overview
    • Managing Deployments
      • Deployment Containers
        • Custom Deployment Containers
      • Deployment States
      • Deployment Logs
      • Deployment Metrics
      • A Deployed Model's API Endpoint
        • Gradient + TensorFlow Serving
      • Deployment Autoscaling
      • Optimize Models for Inference
  • Data
    • Types of Storage
      • Managing Data in Gradient
        • Managing Persistent Storage with VMs
    • Storage Providers
    • Versioned Datasets
    • Public Datasets Repository
  • TensorBoards
    • Overview
    • Using Tensorboards
      • TensorBoards getting started with Tensorflow
  • Metrics
    • Metrics Overview
    • View and Query Metrics
    • Push Metrics
  • Secrets
    • Overview
    • Using Secrets
  • Gradient SDK
    • Gradient SDK Overview
      • Projects Client
      • Experiments Client
      • Models Client
      • Deployments Client
      • Jobs Client
    • End to end tutorial
    • Full SDK Reference
  • Instances
    • Instance Types
      • Free Instances (Free Tier)
      • Instance Tiers
  • Gradient Cluster
    • Overview
    • Setup
      • Managed Private Clusters
      • Self-Hosted Clusters
        • Pre-installation steps
        • Gradient Installer CLI
        • Terraform
          • Pre-installation steps
          • Install on AWS
          • Install on bare metal / VMs
          • Install on NVIDIA DGX
        • Let's Encrypt DNS Providers
        • Updating your cluster
    • Usage
  • Tags
    • Overview
    • Using Tags
  • Machines (Paperspace CORE)
    • Overview
    • Using Machines
      • Start a Machine
      • Stop a Machine
      • Restart a Machine
      • Update a Machine
      • Destroy a Machine
      • List Machines
      • Show a Machine
      • Wait For a Machine
      • Check a Machine's utilization
      • Check availability
  • Paperspace Account
    • Overview
    • Public Profiles
    • Billing & Subscriptions
    • Hotkeys
    • Teams
      • Creating a Team
      • Upgrading to a Team Plan
  • Release Notes
    • Product release notes
    • CLI/SDK Release notes
Powered by GitBook
On this page
  • Overview
  • Versions, Tags, and Messages
  • Committed state
  • Creating a Dataset and Dataset Version
  • Using Datasets
  • Jobs
  • Viewing Datasets
  • Viewing Dataset files
  1. Data

Versioned Datasets

PreviousStorage ProvidersNextPublic Datasets Repository

Last updated 4 years ago

Overview

Versioned Datasets are used to manage the flow of data with your machine learning workloads. Datasets have immutable versions that can be used to track your data as it changes. Dataset version can be used as input to Gradient workloads as well as outputs. Data is stored at a and will be cached on a Gradient cluster's shared storage for a period of time so that data will be available readily on repeated usage.

Gradient datasets require a private .

Versions, Tags, and Messages

Datasets have multiple versions that can be referenced. You can specify a message with a new dataset version to provide info around a newly created dataset version. In addition, you can tag a specific dataset version with a custom name as well. Here are the available ways to reference a dataset:

  • [dataset-id]:latest, this will use the latest version of your dataset

  • [dataset-id]:[dataset-version], this will the use the specified dataset-version

  • [dataset-id]:[dataset-tag], this will use the specified dataset version that the dataset-tag points to

Committed state

Dataset versions have a uncommittted and committed state. When a Dataset is uncommitted, you can modify or add files freely. When a Dataset is committed it will be immutable (will not allow any modifications). This allows workloads to be repeatable and deterministic with the provided Datasets.

Creating a Dataset and Dataset Version

$ gradient datasets versions create --id=dst364npcw6ccok --source-path=./some-data/
Created dataset version: dst364npcw6ccok:fo5rp4m
Committed dataset version: dst364npcw6ccok:fo5rp4m

Using Datasets

Jobs

You can use existing Datasets or create new ones with Gradient jobs. In the below scenario the following datasets actions are specified:

  • dst364npcw6ccok:fo5rp4m will be mounted to: /datasets/input-a

  • dst364npcw6ccok:fo34ram will be mounted to: /datasets/input-b

  • A dataset will be mounted to: /datasets/output-a which will create dataset: dst364npcw6ccok:latest

gradient jobs create \
  --clusterId=$CLUSTER \
  --machineType=$MACHINE_TYPE \
  --projectId=$PROJECT \
  --container=bash \
  --command='cat /datasets/input/hello.txt > /datasets/output/hello2.txt && date >> /datasets/output/hello2.txt' \
  --dataset=output-a@dst364npcw6ccok \
  --dataset=input-a@dst364npcw6ccok:fo5rp4m \
  --dataset=input-b@dst364npcw6ccok:fo34ram

Viewing Datasets

$ gradient datasets list
+------+-----------------+-------------------------+
| Name | ID              | Storage Provider        |
+------+-----------------+-------------------------+
| test | dst364npcw6ccok | test1 (splgct3arqdh77c) |
+------+-----------------+-------------------------+

$ gradient datasets details --id=dst364npcw6ccok
+-----------------+-------------------------+
| Name            | test                    |
+-----------------+-------------------------+
| ID              | dst364npcw6ccok         |
| Description     |                         |
| StorageProvider | test1 (splgct3arqdh77c) |
+-----------------+-------------------------+

Viewing Dataset files

$ gradient datasets files list --id=dst364npcw6ccok:fo5rp4m
+-----------+------+
| Name      | Size |
+-----------+------+
| hello.txt | 12   |
+-----------+------+
Storage Provider
Gradient Cluster