[2]:
!yes | pip uninstall torchvision
!pip install torchvision
Found existing installation: torchvision 0.9.1
Uninstalling torchvision-0.9.1:
  Would remove:
    /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torchvision-0.9.1.dist-info/*
    /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torchvision.libs/libcudart.459720b2.so.10.2
    /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torchvision.libs/libjpeg.ceea7512.so.62
    /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torchvision.libs/libpng16.7f72a3c5.so.16
    /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torchvision.libs/libz.1328edc3.so.1
    /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torchvision/*
Proceed (y/n)?   Successfully uninstalled torchvision-0.9.1
yes: standard output: Broken pipe
Collecting torchvision
  Using cached torchvision-0.9.1-cp36-cp36m-manylinux1_x86_64.whl (17.4 MB)
Requirement already satisfied: torch==1.8.1 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages (from torchvision) (1.8.1)
Requirement already satisfied: numpy in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages (from torchvision) (1.18.1)
Requirement already satisfied: pillow>=4.1.1 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages (from torchvision) (7.0.0)
Requirement already satisfied: typing-extensions in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages (from torch==1.8.1->torchvision) (3.7.4.3)
Requirement already satisfied: dataclasses; python_version < "3.7" in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages (from torch==1.8.1->torchvision) (0.7)
Installing collected packages: torchvision
Successfully installed torchvision-0.9.1

Hyperparameter Tuning using SageMaker PyTorch Container

Kernel Python 3 (PyTorch CPU (or GPU) Optimized) works well with this notebook.

Contents

  1. Background

  2. Setup

  3. Data

  4. Train

  5. Host


Background

MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train and test an MNIST model on SageMaker using PyTorch. It also shows how to use SageMaker Automatic Model Tuning to select appropriate hyperparameters in order to get the best model.

For more information about the PyTorch in SageMaker, please visit sagemaker-pytorch-containers and sagemaker-python-sdk github repositories.


Setup

This notebook was created and tested on an ml.m4.xlarge notebook instance.

Let’s start by creating a SageMaker session and specifying:

  • The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.

  • The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the sagemaker.get_execution_role() with a the appropriate full IAM role arn string(s).

[7]:
import sagemaker
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-pytorch-mnist"

role = sagemaker.get_execution_role()

Data

Getting the data

[5]:
from torchvision.datasets import MNIST
from torchvision import transforms



local_dir = 'data'
MNIST.mirrors = ["https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/"]
MNIST(
    local_dir,
    download=True,
    transform=transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
    )
)


Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-images-idx3-ubyte.gz
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz

Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw

Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-labels-idx1-ubyte.gz
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz

Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw

Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz

Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw

Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz

Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw

Processing...
Done!
[5]:
Dataset MNIST
    Number of datapoints: 60000
    Root location: data
    Split: Train
    StandardTransform
Transform: Compose(
               ToTensor()
               Normalize(mean=(0.1307,), std=(0.3081,))
           )

Uploading the data to S3

We are going to use the sagemaker.Session.upload_data function to upload our datasets to an S3 location. The return value inputs identifies the location – we will use later when we start the training job.

[8]:
inputs = sagemaker_session.upload_data(path="data", bucket=bucket, key_prefix=prefix)
print("input spec (in this case, just an S3 path): {}".format(inputs))
input spec (in this case, just an S3 path): s3://sagemaker-us-west-2-688520471316/sagemaker/DEMO-pytorch-mnist

Train

Training script

The mnist.py script provides all the code we need for training and hosting a SageMaker model (model_fn function to load a model). The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

  • SM_MODEL_DIR: A string representing the path to the directory to write model artifacts to. These artifacts are uploaded to S3 for model hosting.

  • SM_NUM_GPUS: The number of gpus available in the current container.

  • SM_CURRENT_HOST: The name of the current container on the container network.

  • SM_HOSTS: JSON encoded list containing all the hosts .

Supposing one input channel, ‘training’, was used in the call to the fit() method, the following will be set, following the format SM_CHANNEL_[channel_name]:

  • SM_CHANNEL_TRAINING: A string representing the path to the directory containing data in the ‘training’ channel.

For more information about training environment variables, please visit SageMaker Containers.

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an argparse.ArgumentParser instance.

Because the SageMaker imports the training script, you should put your training code in a main guard (if __name__=='__main__':) if you are using the same script to host your model as we do in this example, so that SageMaker does not inadvertently run your training code at the wrong point in execution.

For example, the script run by this notebook:

[9]:
!pygmentize mnist.py
import argparse
import json
import logging
import os
import sys

import sagemaker_containers
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
import torch.utils.data.distributed
from torchvision import datasets, transforms

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))


# Based on https://github.com/pytorch/examples/blob/master/mnist/main.py
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)


def _get_train_data_loader(batch_size, training_dir, is_distributed, **kwargs):
    logger.info("Get train data loader")
    dataset = datasets.MNIST(
        training_dir,
        train=True,
        transform=transforms.Compose(
            [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
        ),
    )
    train_sampler = (
        torch.utils.data.distributed.DistributedSampler(dataset) if is_distributed else None
    )
    return torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=train_sampler is None,
        sampler=train_sampler,
        **kwargs
    )


def _get_test_data_loader(test_batch_size, training_dir, **kwargs):
    logger.info("Get test data loader")
    return torch.utils.data.DataLoader(
        datasets.MNIST(
            training_dir,
            train=False,
            transform=transforms.Compose(
                [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
            ),
        ),
        batch_size=test_batch_size,
        shuffle=True,
        **kwargs
    )


def _average_gradients(model):
    # Gradient averaging.
    size = float(dist.get_world_size())
    for param in model.parameters():
        dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM, group=0)
        param.grad.data /= size


def train(args):
    is_distributed = len(args.hosts) > 1 and args.backend is not None
    logger.debug("Distributed training - {}".format(is_distributed))
    use_cuda = args.num_gpus > 0
    logger.debug("Number of gpus available - {}".format(args.num_gpus))
    kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {}
    device = torch.device("cuda" if use_cuda else "cpu")

    if is_distributed:
        # Initialize the distributed environment.
        world_size = len(args.hosts)
        os.environ["WORLD_SIZE"] = str(world_size)
        host_rank = args.hosts.index(args.current_host)
        dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
        logger.info(
            "Initialized the distributed environment: '{}' backend on {} nodes. ".format(
                args.backend, dist.get_world_size()
            )
            + "Current host rank is {}. Number of gpus: {}".format(dist.get_rank(), args.num_gpus)
        )

    # set the seed for generating random numbers
    torch.manual_seed(args.seed)
    if use_cuda:
        torch.cuda.manual_seed(args.seed)

    train_loader = _get_train_data_loader(args.batch_size, args.data_dir, is_distributed, **kwargs)
    test_loader = _get_test_data_loader(args.test_batch_size, args.data_dir, **kwargs)

    logger.debug(
        "Processes {}/{} ({:.0f}%) of train data".format(
            len(train_loader.sampler),
            len(train_loader.dataset),
            100.0 * len(train_loader.sampler) / len(train_loader.dataset),
        )
    )

    logger.debug(
        "Processes {}/{} ({:.0f}%) of test data".format(
            len(test_loader.sampler),
            len(test_loader.dataset),
            100.0 * len(test_loader.sampler) / len(test_loader.dataset),
        )
    )

    model = Net().to(device)
    if is_distributed and use_cuda:
        # multi-machine multi-gpu case
        model = torch.nn.parallel.DistributedDataParallel(model)
    else:
        # single-machine multi-gpu case or single-machine or multi-machine cpu case
        model = torch.nn.DataParallel(model)

    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

    for epoch in range(1, args.epochs + 1):
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader, 1):
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            loss.backward()
            if is_distributed and not use_cuda:
                # average gradients manually for multi-machine cpu case only
                _average_gradients(model)
            optimizer.step()
            if batch_idx % args.log_interval == 0:
                logger.info(
                    "Train Epoch: {} [{}/{} ({:.0f}%)] Loss: {:.6f}".format(
                        epoch,
                        batch_idx * len(data),
                        len(train_loader.sampler),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )
        test(model, test_loader, device)
    save_model(model, args.model_dir)


def test(model, test_loader, device):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, size_average=False).item()  # sum up batch loss
            pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    logger.info(
        "Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n".format(
            test_loss, correct, len(test_loader.dataset), 100.0 * correct / len(test_loader.dataset)
        )
    )


def model_fn(model_dir):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = torch.nn.DataParallel(Net())
    with open(os.path.join(model_dir, "model.pth"), "rb") as f:
        model.load_state_dict(torch.load(f))
    return model.to(device)


def save_model(model, model_dir):
    logger.info("Saving the model.")
    path = os.path.join(model_dir, "model.pth")
    # recommended way from http://pytorch.org/docs/master/notes/serialization.html
    torch.save(model.cpu().state_dict(), path)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    # Data and model checkpoints directories
    parser.add_argument(
        "--batch-size",
        type=int,
        default=64,
        metavar="N",
        help="input batch size for training (default: 64)",
    )
    parser.add_argument(
        "--test-batch-size",
        type=int,
        default=1000,
        metavar="N",
        help="input batch size for testing (default: 1000)",
    )
    parser.add_argument(
        "--epochs",
        type=int,
        default=10,
        metavar="N",
        help="number of epochs to train (default: 10)",
    )
    parser.add_argument(
        "--lr", type=float, default=0.01, metavar="LR", help="learning rate (default: 0.01)"
    )
    parser.add_argument(
        "--momentum", type=float, default=0.5, metavar="M", help="SGD momentum (default: 0.5)"
    )
    parser.add_argument("--seed", type=int, default=1, metavar="S", help="random seed (default: 1)")
    parser.add_argument(
        "--log-interval",
        type=int,
        default=100,
        metavar="N",
        help="how many batches to wait before logging training status",
    )
    parser.add_argument(
        "--backend",
        type=str,
        default=None,
        help="backend for distributed training (tcp, gloo on cpu and gloo, nccl on gpu)",
    )

    # Container environment
    parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"]))
    parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"])
    parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--data-dir", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
    parser.add_argument("--num-gpus", type=int, default=os.environ["SM_NUM_GPUS"])

    train(parser.parse_args())

Set up hyperparameter tuning job

Note, with the default setting below, the hyperparameter tuning job can take about 20 minutes to complete.

Now that we have prepared the dataset and the script, we are ready to train models. Before we do that, one thing to note is there are many hyperparameters that can dramtically affect the performance of the trained models. For example, learning rate, batch size, number of epochs, etc. Since which hyperparameter setting can lead to the best result depends on the dataset as well, it is almost impossible to pick the best hyperparameter setting without searching for it. Using SageMaker Automatic Model Tuning, we can create a hyperparameter tuning job to search for the best hyperparameter setting in an automated and effective way.

In this example, we are using SageMaker Python SDK to set up and manage a hyperparameter tuning job. Specifically, we specify a range, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune. The hyperparameter tuning job will automatically launch multiple training jobs with different hyperparameter settings, evaluate results of those training jobs based on a predefined “objective metric”, and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will give it a budget (max number of training jobs) and it will complete once that many training jobs have been executed.

Now we will set up the hyperparameter tuning job using SageMaker Python SDK, following below steps: * Create an estimator to set up the PyTorch training job * Define the ranges of hyperparameters we plan to tune, in this example, we are tuning learning_rate and batch size * Define the objective metric for the tuning job to optimize * Create a hyperparameter tuner with above setting, as well as tuning resource configurations

Similar to training a single PyTorch job in SageMaker, we define our PyTorch estimator passing in the PyTorch script, IAM role, and (per job) hardware configuration.

[10]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="mnist.py",
    role=role,
    py_version='py3',
    framework_version="1.8.0",
    instance_count=1,
    instance_type="ml.c5.2xlarge",
    hyperparameters={"epochs": 1, "backend": "gloo"},
)
[16]:
# test training job

estimator.fit({'training': inputs})
2021-06-04 21:24:13 Starting - Starting the training job...
2021-06-04 21:24:35 Starting - Launching requested ML instancesProfilerReport-1622841853: InProgress
......
2021-06-04 21:25:40 Starting - Preparing the instances for training......
2021-06-04 21:26:36 Downloading - Downloading input data...
2021-06-04 21:27:00 Training - Downloading the training image.bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2021-06-04 21:27:15,787 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2021-06-04 21:27:15,789 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-06-04 21:27:15,798 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2021-06-04 21:27:18,824 sagemaker_pytorch_container.training INFO     Invoking user training script.
2021-06-04 21:27:19,104 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-06-04 21:27:19,115 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-06-04 21:27:19,126 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-06-04 21:27:19,135 sagemaker-training-toolkit INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "backend": "gloo",
        "epochs": 1
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "pytorch-training-2021-06-04-21-24-13-048",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-688520471316/pytorch-training-2021-06-04-21-24-13-048/source/sourcedir.tar.gz",
    "module_name": "mnist",
    "network_interface_name": "eth0",
    "num_cpus": 8,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "mnist.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"backend":"gloo","epochs":1}
SM_USER_ENTRY_POINT=mnist.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=mnist
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-688520471316/pytorch-training-2021-06-04-21-24-13-048/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"backend":"gloo","epochs":1},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"pytorch-training-2021-06-04-21-24-13-048","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-688520471316/pytorch-training-2021-06-04-21-24-13-048/source/sourcedir.tar.gz","module_name":"mnist","network_interface_name":"eth0","num_cpus":8,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"mnist.py"}
SM_USER_ARGS=["--backend","gloo","--epochs","1"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_BACKEND=gloo
SM_HP_EPOCHS=1
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages

Invoking script with the following command:

/opt/conda/bin/python3.6 mnist.py --backend gloo --epochs 1


Distributed training - False
Number of gpus available - 0
Get train data loader
Get test data loader
Processes 60000/60000 (100%) of train data
Processes 10000/10000 (100%) of test data
[2021-06-04 21:27:20.550 algo-1:25 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2021-06-04 21:27:20.845 algo-1:25 INFO profiler_config_parser.py:102] User has disabled profiler.
[2021-06-04 21:27:20.845 algo-1:25 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2021-06-04 21:27:20.845 algo-1:25 INFO hook.py:199] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[2021-06-04 21:27:20.846 algo-1:25 INFO hook.py:253] Saving to /opt/ml/output/tensors
[2021-06-04 21:27:20.846 algo-1:25 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
[2021-06-04 21:27:20.873 algo-1:25 INFO hook.py:584] name:module.conv1.weight count_params:250
[2021-06-04 21:27:20.873 algo-1:25 INFO hook.py:584] name:module.conv1.bias count_params:10
[2021-06-04 21:27:20.873 algo-1:25 INFO hook.py:584] name:module.conv2.weight count_params:5000
[2021-06-04 21:27:20.873 algo-1:25 INFO hook.py:584] name:module.conv2.bias count_params:20
[2021-06-04 21:27:20.873 algo-1:25 INFO hook.py:584] name:module.fc1.weight count_params:16000
[2021-06-04 21:27:20.873 algo-1:25 INFO hook.py:584] name:module.fc1.bias count_params:50
[2021-06-04 21:27:20.873 algo-1:25 INFO hook.py:584] name:module.fc2.weight count_params:500
[2021-06-04 21:27:20.873 algo-1:25 INFO hook.py:584] name:module.fc2.bias count_params:10
[2021-06-04 21:27:20.873 algo-1:25 INFO hook.py:586] Total Trainable Params: 21840
[2021-06-04 21:27:20.873 algo-1:25 INFO hook.py:413] Monitoring the collections: losses
[2021-06-04 21:27:20.875 algo-1:25 INFO hook.py:476] Hook is writing from the hook with pid: 25

Train Epoch: 1 [6400/60000 (11%)] Loss: 2.010908
Train Epoch: 1 [12800/60000 (21%)] Loss: 1.009527
Train Epoch: 1 [19200/60000 (32%)] Loss: 0.877578
Train Epoch: 1 [25600/60000 (43%)] Loss: 0.805486
Train Epoch: 1 [32000/60000 (53%)] Loss: 0.635695
Train Epoch: 1 [38400/60000 (64%)] Loss: 0.505831
Train Epoch: 1 [44800/60000 (75%)] Loss: 0.537033
Train Epoch: 1 [51200/60000 (85%)] Loss: 0.532630
Train Epoch: 1 [57600/60000 (96%)] Loss: 0.428416

2021-06-04 21:27:36 Uploading - Uploading generated training modelTest set: Average loss: 0.1919, Accuracy: 9424/10000 (94%)

Saving the model.
INFO:__main__:Train Epoch: 1 [6400/60000 (11%)] Loss: 2.010908
INFO:__main__:Train Epoch: 1 [12800/60000 (21%)] Loss: 1.009527
INFO:__main__:Train Epoch: 1 [19200/60000 (32%)] Loss: 0.877578
INFO:__main__:Train Epoch: 1 [25600/60000 (43%)] Loss: 0.805486
INFO:__main__:Train Epoch: 1 [32000/60000 (53%)] Loss: 0.635695
INFO:__main__:Train Epoch: 1 [38400/60000 (64%)] Loss: 0.505831
INFO:__main__:Train Epoch: 1 [44800/60000 (75%)] Loss: 0.537033
INFO:__main__:Train Epoch: 1 [51200/60000 (85%)] Loss: 0.532630
INFO:__main__:Train Epoch: 1 [57600/60000 (96%)] Loss: 0.428416
/opt/conda/lib/python3.6/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
INFO:__main__:Test set: Average loss: 0.1919, Accuracy: 9424/10000 (94%)

INFO:__main__:Saving the model.

2021-06-04 21:27:33,433 sagemaker-training-toolkit INFO     Reporting training SUCCESS

2021-06-04 21:27:56 Completed - Training job completed
Training seconds: 71
Billable seconds: 71

Once we’ve defined our estimator we can specify the hyperparameters we’d like to tune and their possible values. We have three different types of hyperparameters. - Categorical parameters need to take one value from a discrete set. We define this by passing the list of possible values to CategoricalParameter(list) - Continuous parameters can take any real number value between the minimum and maximum value, defined by ContinuousParameter(min, max) - Integer parameters can take any integer value between the minimum and maximum value, defined by IntegerParameter(min, max)

Note, if possible, it’s almost always best to specify a value as the least restrictive type. For example, tuning learning rate as a continuous value between 0.01 and 0.2 is likely to yield a better result than tuning as a categorical parameter with values 0.01, 0.1, 0.15, or 0.2. We did specify batch size as categorical parameter here since it is generally recommended to be the power of 2.

[17]:
hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    "batch-size": CategoricalParameter([32, 64, 128, 256, 512]),
}

Next we’ll specify the objective metric that we’d like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. In this particular case, our script emits average loss value and we will use it as the objective metric, we also set the objective_type to be ‘minimize’, so that hyperparameter tuning seeks to minize the objective metric when searching for the best hyperparameter setting. By default, objective_type is set to ‘maximize’.

[18]:
objective_metric_name = "average test loss"
objective_type = "Minimize"
metric_definitions = [{"Name": "average test loss", "Regex": "Test set: Average loss: ([0-9\\.]+)"}]

Now, we’ll create a HyperparameterTuner object, to which we pass: - The PyTorch estimator we created above - Our hyperparameter ranges - Objective metric name and definition - Tuning resource configurations such as Number of training jobs to run in total and how many training jobs can be run in parallel.

[19]:
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=9,
    max_parallel_jobs=3,
    objective_type=objective_type,
)

Launch hyperparameter tuning job

And finally, we can start our hyperprameter tuning job by calling .fit() and passing in the S3 path to our train and test dataset.

After the hyperprameter tuning job is created, you should be able to describe the tuning job to see its progress in the next step, and you can go to SageMaker console->Jobs to check out the progress of the progress of the hyperparameter tuning job.

[20]:
tuner.fit({"training": inputs})
.................................................................................................................................................!

Host

Create endpoint

After training, we use the tuner object to build and deploy a PyTorchPredictor. This creates a Sagemaker Endpoint – a hosted prediction service that we can use to perform inference, based on the best model in the tuner. Remember in previous steps, the tuner launched multiple training jobs during tuning and the resulting model with the best objective metric is defined as the best model.

As mentioned above we have implementation of model_fn in the mnist.py script that is required. We are going to use default implementations of input_fn, predict_fn, output_fn and transform_fm defined in sagemaker-pytorch-containers.

The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances, but you need to make sure that you return or save your model as a cpu model similar to what we did in mnist.py. Here we will deploy the model to a single ml.m4.xlarge instance.

[21]:
predictor = tuner.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

2021-06-04 21:45:08 Starting - Preparing the instances for training
2021-06-04 21:45:08 Downloading - Downloading input data
2021-06-04 21:45:08 Training - Training image download completed. Training in progress.
2021-06-04 21:45:08 Uploading - Uploading generated training model
2021-06-04 21:45:08 Completed - Training job completed
-------!

Evaluate

We can now use this predictor to classify hand-written digits.

You will see an empty image box once you’ve executed cell below. Then you can draw a number in it and pixel data will be loaded into a data variable in this notebook, which we can then pass to the predictor.

[ ]:
import gzip
import numpy as np
import random
import os

data_dir = 'data/MNIST/raw'
with gzip.open(os.path.join(data_dir, "t10k-images-idx3-ubyte.gz"), "rb") as f:
    images = np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1, 28, 28).astype(np.float32)

mask = random.sample(range(len(images)), 16) # randomly select some of the test images
mask = np.array(mask, dtype=np.int)
data = images[mask]
[ ]:
response = predictor.predict(np.expand_dims(data, axis=1))
print("Raw prediction result:")
print(response)
print()

labeled_predictions = list(zip(range(10), response[0]))
print("Labeled predictions: ")
print(labeled_predictions)
print()

labeled_predictions.sort(key=lambda label_and_prob: 1.0 - label_and_prob[1])
print("Most likely answer: {}".format(labeled_predictions[0]))

Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it

[ ]:
tuner.delete_endpoint()