Understand the workflow involved in registering models
Passing environment variables through Gradient CLI
Persisting model files in Gradient Storage
Registering Tensorflow Models in Gradient
Introduction
Experiments in Gradient can generate machine learning models, which can be interpreted and stored in your Project's Models list. This list holds references to the model and checkpoint files generated during the training period as well as summary metrics associated with the model's performance, such as accuracy and loss.
In this tutorial, we will create an experiment to generate a Keras model based on the Fashion MNIST dataset. We will learn techniques such as passing environment variables to jobs, specifying the right container image, and mentioning the path to store the model artifacts.
The model is trained in Keras but it is finally exported as a TensorFlow model through tf.saved_model.simple_savemethod. This approach seralizes Keras session into a TensorFlow .pb file.
The above command has multiple switches that are important to the job. Let’s understand each of them.
The singlenode parameter runs the job on a single host.
--name assigns a friendly name to the experiment.
--projectId associates the experiment with an existing project.
--experimentEnv passes environment variables to the script. In our code, we decide the number of epochs based on the value defined in the EPOCHS environment variable.
--container parameter points the job to a container image used for the training job. Notice that we are passing an image that can advantage of a GPU-based machine.
--machineType schedules the job in one of the preferred instances. In our case, we are using K80 machine type that comes with an NVIDIA K80 GPU. Since the container and machine type are based on GPU, the job exploits the CUDA and cuDNN for accelerated training.
--command instructs the job to execute the script along with the passed parameters. The script expects the path to store the final model artifacts along with the version number. Since we are using a sub-directory under the /storage directory, the files stored are persisted across experiments. The model files stored here are used to register the TensorFlow model with Gradient. Feel free to explore train.py to understand how environment variables and command line parameters can be used to target Gradient specific features while keeping the code independent.
--modelType Tensorflowswitch indicates that the job generates a valid TensorFlow model which can be managed and served by Gradient. Frameworks other than TensorFlow will be supported in the near future, such as ONNX and Custom.
--modelPath tells Gradient where to look for the model artifacts. This is typically /artifacts or /storage location. We are passing /storage/model directory which was used within the code.
--workspace . tells Gradient to upload your current directory (.) to the experiment. The files in this directory will be the working directory of your experiment.
Within a few seconds of running the command, you should see the logs displayed on the screen.
Archiving your working directory for upload as your experiment workspace...(See https://docs.paperspace.com/gradient/experiments/run-experiments for more information.)
RemovingexistingarchiveCreatingziparchive:train.zip100% (1 of1) |########################################| Elapsed Time: 0:00:00 Time: 0:00:00Finishedcreatingarchive:train.zipUploadingzippedworkspacetoS3100% (3108 of3108) |##################################| Elapsed Time: 0:00:00 ETA: 00:00:00100% (3108 of3108) |##################################| Elapsed Time: 0:00:00 Time: 0:00:00UploadingcompletedNewexperimentcreatedandstartedwithID:e720893n7f5vxAwaitinglogs...js3v54dfgz1zcu 1 Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 4us/step40960/29515 [=========================================] - 0s 3us/stepjs3v54dfgz1zcu 4 Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 2s 0us/step26435584/26421880 [==============================] - 2s 0us/stepjs3v54dfgz1zcu 7 Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
16384/5148 [===============================================================================================] - 0s 0us/step
js3v54dfgz1zcu 9 Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 1s 0us/step4431872/4422102 [==============================] - 1s 0us/stepjs3v54dfgz1zcu 12 2019-06-29 06:30:41.922354: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
js3v54dfgz1zcu 13 2019-06-29 06:30:42.014405: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
js3v54dfgz1zcu 14 2019-06-29 06:30:42.014841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties:
js3v54dfgz1zcu15name:TeslaK80major:3minor:7memoryClockRate(GHz):0.8235js3v54dfgz1zcu16pciBusID:0000:00:04.0js3v54dfgz1zcu17totalMemory:11.17GiBfreeMemory:11.09GiBjs3v54dfgz1zcu 18 2019-06-29 06:30:42.014881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
js3v54dfgz1zcu 19 2019-06-29 06:30:42.345929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
js3v54dfgz1zcu202019-06-2906:30:42.345995:Itensorflow/core/common_runtime/gpu/gpu_device.cc:958]0js3v54dfgz1zcu212019-06-2906:30:42.346006:Itensorflow/core/common_runtime/gpu/gpu_device.cc:971]0:Njs3v54dfgz1zcu 22 2019-06-29 06:30:42.346329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10748 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
js3v54dfgz1zcu23_________________________________________________________________js3v54dfgz1zcu24Layer (type) Output Shape Param #js3v54dfgz1zcu25=================================================================js3v54dfgz1zcu26Conv1 (Conv2D) (None,13,13,8) 80js3v54dfgz1zcu27_________________________________________________________________js3v54dfgz1zcu28flatten (Flatten) (None,1352) 0js3v54dfgz1zcu29_________________________________________________________________js3v54dfgz1zcu30Softmax (Dense) (None,10) 13530js3v54dfgz1zcu31=================================================================js3v54dfgz1zcu32Totalparams:13,610js3v54dfgz1zcu33Trainableparams:13,610js3v54dfgz1zcu34Non-trainableparams:0js3v54dfgz1zcu35_________________________________________________________________js3v54dfgz1zcu36Epoch1/558080/60000 [============================>.] - ETA: 0s - loss: 0.5437 - acc: 0.810360000/60000 [==============================] - 7s 112us/step - loss: 0.5406 - acc: 0.8113js3v54dfgz1zcu39Epoch2/560000/60000 [==============================] - 5s 82us/step - loss: 0.4034 - acc: 0.8597js3v54dfgz1zcu41Epoch3/557536/60000 [===========================>..] - ETA: 0s - loss: 0.3718 - acc: 0.869760000/60000 [==============================] - 5s 88us/step - loss: 0.3715 - acc: 0.8698.8698js3v54dfgz1zcu44Epoch4/554944/60000 [==========================>...] - ETA: 0s - loss: 0.3508 - acc: 0.876060000/60000 [==============================] - 6s 92us/step - loss: 0.3514 - acc: 0.876059js3v54dfgz1zcu47Epoch5/559488/60000 [============================>.] - ETA: 0s - loss: 0.3391 - acc: 0.879460000/60000 [==============================] - 5s 85us/step - loss: 0.3392 - acc: 0.8795.879410000/10000 [==============================] - 0s 45us/stepjs3v54dfgz1zcu51js3v54dfgz1zcu52Modelaccuracy:0.8657js3v54dfgz1zcu53js3v54dfgz1zcu54Modelsavedto/storage/modeljs3v54dfgz1zcu55js3v54dfgz1zcu56PSEOF
Verifying the Creation of Model
We can check if the output of the job is registered as a valid TensorFlow model with the following command.
gradientmodelslist
+------+-----------------+------------+------------+----------------+
| Name | ID | Model Type | Project ID | Experiment ID |
+------+-----------------+------------+------------+----------------+
| None | mosdnkkv1o1xuem | Tensorflow | prioax2c4 | e720893n7f5vx |
+------+-----------------+------------+------------+----------------+
The project id prioax2c4 and experiment id e720893n7f5vx confirm that it is the model associated with the latest experiment.
You can also visit the Models section of Gradient UI to see a list of registered models.
Summary
After registering the model, we can turn that into a deployment to perform inferencing.