Training Built-in Algorithms with SageMaker (Part 4/4)

Download | Structure | Preprocessing (Built-in) | Train Model (Built-in)

**Notes**: \* This notebook should be used with the conda\_amazonei\_mxnet\_p36 kernel \* This notebook is part of a series of notebooks beginning with ``01_download_data``, ``02_structuring_data`` and ``03a_builtin_preprocessing``. \* You can also explore training with TensorFlow and PyTorch by running ``04b_tensorflow_training`` and ``04c_pytorch_training``, respectively.

In this notebook, you will use the SageMaker SDK to create an Estimator for SageMaker’s Built-in Image Classification algorithm and train it on a remote EC2 instance.

Overview

## Dependencies ___

Import packages and check SageMaker version

[ ]:
import boto3
import shutil
import urllib
import pickle
import pathlib
import tarfile
import subprocess
import sagemaker

Load S3 bucket name & category labels

The category_labels file was generated from the first notebook in this series 01_download_data.ipynb. You will need to run that notebook before running the code here.

An S3 bucket for this guide was created in Part 3.

[ ]:
with open("pickled_data/builtin_bucket_name.pickle", "rb") as f:
    bucket_name = pickle.load(f)
    print("Bucket Name: ", bucket_name)

with open("pickled_data/category_labels.pickle", "rb") as f:
    category_labels = pickle.load(f)

## Built-in Image Classification algorithm ___

Create SageMaker training and validation channels

[ ]:
train_data = sagemaker.inputs.TrainingInput(
    s3_data=f"s3://{bucket_name}/data/train",
    content_type="application/x-recordio",
    s3_data_type="S3Prefix",
    input_mode="Pipe",
)

val_data = sagemaker.inputs.TrainingInput(
    s3_data=f"s3://{bucket_name}/data/val",
    content_type="application/x-recordio",
    s3_data_type="S3Prefix",
    input_mode="Pipe",
)

data_channels = {"train": train_data, "validation": val_data}

Configure the algorithm’s hyperparameters

https://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html * num_layers - The built-in image classification algrorithm is based off the ResNet architecture. There are many different versions of this architecture differing by how many layers they use. We’ll use the smallest one for this guide to speed up training. If the algorithm’s accuracy is hitting a plateau and you need better accuracy, increasing the number of layers may help. * use_pretrained_model - This will initialize the weights from a pre-trained model for transfer learning. Otherwise weights are initialized randomly. * augmentation_type - Allows you to add augmentations to your trainingset to help your model generalize better. For small datasets, augmentation can greatly imporve training. * image_shape - The channel, height, width of all the images * num_classes - Number of classes in your dataset * num_training_samples - Total number of images in your training set (used to help calculate progres) * mini_batch_size - The batch size you would like to use during training. * epochs - An epoch refers to one cycle through the training set and having more epochs to train means having more oppotunities to improve accracy. Suitable values range from 5 to 25 epochs depending on your time and budget constraints. Ideally, the right number of epochs is right before your validation accuracy plateaus. * learning_rate: After each batch of training we update the model’s weights to give us the best possible results for that batch. The learning rate controls by how much we should update the weights. Best practices dictate a value between 0.2 and .001, typically never going higher than 1. The higher the learning rate, the faster your training will converge to the optimal weights, but going too fast can lead you to overshoot the target. In this example, we’re using the weights from a pre-trained model so we’d want to start with a lower learning rate because the weights have already been optimized and we don’t want move too far away from them. * precision_dtype - Whether you want to use a 32-bit float data type for the model’s weights or 16-bit. 16-bit can be used if you’re running into memory management issues. However, weights can grow or shrink rapidly so having 32-bit weights make your training more robust to these issues and is typically the default in most frameworks.

[ ]:
num_classes = len(category_labels)
num_training_samples = len(set(pathlib.Path("data_structured/train").rglob("*.jpg")))
[ ]:
hyperparameters = {
    "num_layers": 18,
    "use_pretrained_model": 1,
    "augmentation_type": "crop_color_transform",
    "image_shape": "3,224,224",
    "num_classes": num_classes,
    "num_training_samples": num_training_samples,
    "mini_batch_size": 64,
    "epochs": 5,
    "learning_rate": 0.001,
    "precision_dtype": "float32",
}

Configure the type of algorithm and resources to use

[ ]:
training_image = sagemaker.image_uris.retrieve(
    "image-classification", sagemaker.Session().boto_region_name
)
[ ]:
algo_config = {
    "hyperparameters": hyperparameters,
    "image_uri": training_image,
    "role": sagemaker.get_execution_role(),
    "instance_count": 1,
    "instance_type": "ml.p3.2xlarge",
    "volume_size": 100,
    "max_run": 360000,
    "output_path": f"s3://{bucket_name}/data/output",
}

Create and train the algorithm

[ ]:
algorithm = sagemaker.estimator.Estimator(**algo_config)
[ ]:
algorithm.fit(inputs=data_channels, logs=True)

## Understanding the training output ___

[09/14/2020 05:37:38 INFO 139869866030912] Epoch[0] Batch [20]#011Speed: 111.811 samples/sec#011accuracy=0.452381
[09/14/2020 05:37:54 INFO 139869866030912] Epoch[0] Batch [40]#011Speed: 131.393 samples/sec#011accuracy=0.570503
[09/14/2020 05:38:10 INFO 139869866030912] Epoch[0] Batch [60]#011Speed: 139.540 samples/sec#011accuracy=0.617700
[09/14/2020 05:38:27 INFO 139869866030912] Epoch[0] Batch [80]#011Speed: 144.003 samples/sec#011accuracy=0.644483
[09/14/2020 05:38:43 INFO 139869866030912] Epoch[0] Batch [100]#011Speed: 146.600 samples/sec#011accuracy=0.664991
Training has begun: * Epoch[0]: One epoch corresponds to one training cycle through all the data. Stochastic optimizers like SGD and Adam improve accuracy by running multiple epochs. Random data augmentations is also applied with each new epoch allowing the training algorithm to learn on modified data. * Batch: The number of batches processed by the training algorithm. We specified one batch to be 64 images in the mini_batch_size hyperparameter. For algorithms like SGD, the model get a chance to update itself every batch.
* Speed: the number of images sent to the training algorithm per second. This information is important in determining how changes in your dataset affect the speed of training. * Accuracy: the training accuracy achieved at each interval (in this case, 20 batches).
[09/14/2020 05:38:58 INFO 139869866030912] Epoch[0] Train-accuracy=0.677083
[09/14/2020 05:38:58 INFO 139869866030912] Epoch[0] Time cost=102.745
[09/14/2020 05:39:02 INFO 139869866030912] Epoch[0] Validation-accuracy=0.729492
[09/14/2020 05:39:02 INFO 139869866030912] Storing the best model with validation accuracy: 0.729492
[09/14/2020 05:39:02 INFO 139869866030912] Saved checkpoint to "/opt/ml/model/image-classification-0001.params"

The first epoch of training has ended (for this example we only train for one epoch). The final training accuracy is reported as well as the accuracy on the validation set. Comparing these two number is important in determining if your model is overfit or underfit as well as the bais/variance trade-off. The saved model uses the learned weights from the epoch with the best validation accuracy.

2020-09-14 05:39:03 Uploading - Uploading generated training model
2020-09-14 05:39:15 Completed - Training job completed
Training seconds: 235
Billable seconds: 235

The final model parameters are saved as a .tar.gz in S3 to the directory specified in the output_path of algo_config. Total billable seconds is also reported to help compute the cost of training since you are only charged for the time the EC2 instance is training on the data. Other costs such as S3 storage also apply, but are not included here.

Rollback to default version of SDK

Only do this if you’re done with this guide and want to use the same kernel for other notebooks with an incompatible version of the SageMaker SDK.

[ ]:
# print(f'Original version: {original_sagemaker_version[0]}')
# print(f'Current version:  {sagemaker.__version__}')
# print('')
# print(f'Rolling back to {original_sagemaker_version[0]}. Restart notebook kernel to use this version.')
# print('')
# s = f'sagemaker=={original_sagemaker_version[0]}'
# !{sys.executable} -m pip install {s}

Next Steps

This concludes the Image Data Guide for SageMaker’s Built-in algorithms. If you’d like to deploy your model and get predictions on your test data, all the info you’ll need to get going can be foud here: Deploy Models for Inference

[ ]: