# Experiment datasets

## About

When executing an experiment in Gradient you may optionally supply one or more datasets that will be downloaded into your experiment's environment prior to execution. These datasets can be downloaded from an S3 object or folder (including the full bucket). Gradient allows teams to run reproducible machine learning experiments by taking advantage of S3 ETags and Version IDs, which combine to allow you to be sure that datasets exactly match between training sets, and to be sure which version of a dataset you are using.

{% hint style="info" %}
Gradient private datasets with versioning are a Gradient Enterprise feature. [Contact Sales](https://info.paperspace.com/contact-sales) for inquiries!
{% endhint %}

### S3 Datasets

Datasets are downloaded and mounted readonly on `/data/DATASET` within your experiment jobs using the supplied AWS credentials. The credentials are optional for public buckets. The name of the dataset is the `basename` of the last item in the s3 path, e.g. `s3://my-bucket/mnist.zip` would have the name `mnist` and `s3://my-bucket` would have the name `my-bucket`. The name maybe overridden with the optional `name` parameter.

```
datasets: [
    {
        "uri": "s3://my-bucket/mnist-modified.zip",
        "awsSecretAccessKey": "secret:<some_secret_name>",
        "awsAccessKeyId": "secret:<some_other_secret_name>",
        "name": "mnist",
    },
]
```

{% hint style="info" %}
We highly recommend the use of the secrets feature on S3 datasets as the values get passed in as plain text. Using secrets as `secret:<some_secret_name` ensure that your credentials are encrypted and protected. You can learn more about using secrets [here](https://docs.paperspace.com/gradient/secrets/using-secrets).
{% endhint %}

{% tabs %}
{% tab title="CLI" %}
You can launch an experiment & specify the desired S3 dataset with e-tags using the CLI as follows.

```
$ gradient experiments run singlenode 
--projectId prda8mhcq 
--workspace https://github.com/Paperspace/mnist-sample.git
... 
--datasetAwsAccessKeyId secret:<some_secret_name> 
--datasetAwsSecretAccessKey secret:<some_other_secret_name>
--datasetName fashion
--datasetUri s3://my-bucket-name/fashion-mnist.zip
```

When launching an experiment using the config.yaml, pass in the multiple datasets using the following structure.

```
datasetUri:
  - "s3://some.dataset/uri"
  - "s3://some.other.dataset/uri"
datasetName:
  - "some dataset name"
  - null
datasetAwsAccessKeyId:
  - none
  - secret:<some_secret_name>
datasetAwsSecretAccessKey:
  -
  - secret:<some_other_secret_name>
datasetVersionId:
datasetEtag:
  - "some etag"
  - "some other etag"
```

{% endtab %}

{% tab title="SDK" %}

```
env = {
        "EPOCHS_EVAL":"10",
        "TRAIN_EPOCHS":"40",
        "MAX_STEPS":"50000",
        "EVAL_SECS":"600",
        "BATCH_SIZE":"100",

    }

single_node_parameters = { 
    "name": "dataset",
    "project_id": project,
    "container": "tensorflow/tensorflow:1.13.1-py3",
    "machine_type": "p2.xlarge",
    "command": "pip install -r requirements.txt && python mnist.py",
    "experiment_env": env,
    "workspace_url": "https://github.com/Paperspace/mnist-sample.git", #can be local directory, a git repo or commit, or s3 bucket
    "cluster_id" : cluster,
    "model_type" : "Tensorflow",
    "model_path" : "/artifacts",
    "datasets" : [{
         "uri": bucket,
         "aws_secret_access_key": os.getenv('ACCESS_KEY'),
         "aws_access_key_id": os.getenv('ACCESS_KEY_ID'),
         "etag": "ee1d4fd1b3a97b5384355941ee99d3e4",
         "name" : "fashion"
         }]
}

client.experiments.run_single_node(**single_node_parameters)
```

{% endtab %}
{% endtabs %}

The datasets will show up in the web interface in the environment tab of the experiment you launch.

![](https://1320806315-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LHZRFUkajubOAmgu6Rd%2F-LygcWgRd7El-7C-aLc_%2F-Lygd4dzqCxLhOYxUnVD%2FScreen%20Shot%202020-01-15%20at%2011.37.06%20PM.png?alt=media\&token=c5c34ce8-eab5-4d45-973b-66a51640b17d)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paperspace.gitbook.io/gradient/master/experiments/using-experiments/experiment-datasets.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
