Versioned Datasets
Overview
Versioned Datasets are used to manage the flow of data with your machine learning workloads. Datasets have immutable versions that can be used to track your data as it changes. Dataset version can be used as input to Gradient workloads as well as outputs. Data is stored at a Storage Provider and will be cached on a Gradient cluster's shared storage for a period of time so that data will be available readily on repeated usage.
Gradient datasets require a private Gradient Cluster.
Versions, Tags, and Messages
Datasets have multiple versions that can be referenced. You can specify a message with a new dataset version to provide info around a newly created dataset version. In addition, you can tag a specific dataset version with a custom name as well. Here are the available ways to reference a dataset:
[dataset-id]:latest, this will use the latest version of your dataset
[dataset-id]:[dataset-version], this will the use the specified dataset-version
[dataset-id]:[dataset-tag], this will use the specified dataset version that the dataset-tag points to
Committed state
Dataset versions have a uncommittted and committed state. When a Dataset is uncommitted, you can modify or add files freely. When a Dataset is committed it will be immutable (will not allow any modifications). This allows workloads to be repeatable and deterministic with the provided Datasets.
Creating a Dataset and Dataset Version
Using Datasets
Jobs
You can use existing Datasets or create new ones with Gradient jobs. In the below scenario the following datasets actions are specified:
dst364npcw6ccok:fo5rp4m will be mounted to: /datasets/input-a
dst364npcw6ccok:fo34ram will be mounted to: /datasets/input-b
A dataset will be mounted to: /datasets/output-a which will create dataset: dst364npcw6ccok:latest
Viewing Datasets
Viewing Dataset files
Last updated