AutoGluon Tabular with SageMaker¶
AutoGluon automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy deep learning models on tabular, image, and text data. This notebook shows how to use AutoGluon-Tabular with Amazon SageMaker by creating custom containers.
Prerequisites¶
If using a SageMaker hosted notebook, select kernel conda_mxnet_p36.
[ ]:
# Make sure docker compose is set up properly for local mode
!./setup.sh
[ ]:
# Imports
import os
import boto3
import sagemaker
from time import sleep
from collections import Counter
import numpy as np
import pandas as pd
from sagemaker import get_execution_role, local, Model, utils, fw_utils, s3
from sagemaker.estimator import Estimator
from sagemaker.predictor import RealTimePredictor, csv_serializer, StringDeserializer
from sklearn.metrics import accuracy_score, classification_report
from IPython.core.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell
# Print settings
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 10)
# Account/s3 setup
session = sagemaker.Session()
local_session = local.LocalSession()
bucket = session.default_bucket()
prefix = 'sagemaker/autogluon-tabular'
region = session.boto_region_name
role = get_execution_role()
client = session.boto_session.client(
"sts", region_name=region, endpoint_url=utils.sts_regional_endpoint(region)
)
account = client.get_caller_identity()['Account']
ecr_uri_prefix = utils.get_ecr_image_uri_prefix(account, region)
registry_id = fw_utils._registry_id(region, 'mxnet', 'py3', account, '1.6.0')
registry_uri = utils.get_ecr_image_uri_prefix(registry_id, region)
Build docker images¶
First, build autogluon package to copy into docker image.
[ ]:
if not os.path.exists('package'):
!pip install PrettyTable -t package
!pip install --upgrade boto3 -t package
!pip install bokeh -t package
!pip install --upgrade matplotlib -t package
!pip install autogluon -t package
Now build the training/inference image and push to ECR
[ ]:
training_algorithm_name = 'autogluon-sagemaker-training'
inference_algorithm_name = 'autogluon-sagemaker-inference'
[ ]:
!./container-training/build_push_training.sh {account} {region} {training_algorithm_name} {ecr_uri_prefix} {registry_id} {registry_uri}
!./container-inference/build_push_inference.sh {account} {region} {inference_algorithm_name} {ecr_uri_prefix} {registry_id} {registry_uri}
Get the data¶
[ ]:
# Download and unzip the data
!aws s3 cp --region {region} s3://sagemaker-sample-data-{region}/autopilot/direct_marketing/bank-additional.zip .
!unzip -qq -o bank-additional.zip
!rm bank-additional.zip
local_data_path = './bank-additional/bank-additional-full.csv'
data = pd.read_csv(local_data_path)
# Split train/test data
train = data.sample(frac=0.7, random_state=42)
test = data.drop(train.index)
# Split test X/y
label = 'y'
y_test = test[label]
X_test = test.drop(columns=[label])
Check the data¶
[ ]:
train.head(3)
train.shape
test.head(3)
test.shape
X_test.head(3)
X_test.shape
Upload the data to s3
[ ]:
train_file = 'train.csv'
train.to_csv(train_file,index=False)
train_s3_path = session.upload_data(train_file, key_prefix='{}/data'.format(prefix))
test_file = 'test.csv'
test.to_csv(test_file,index=False)
test_s3_path = session.upload_data(test_file, key_prefix='{}/data'.format(prefix))
X_test_file = 'X_test.csv'
X_test.to_csv(X_test_file,index=False)
X_test_s3_path = session.upload_data(X_test_file, key_prefix='{}/data'.format(prefix))
Hyperparameter Selection¶
The minimum required settings for training is just a target label, fit_args['label'].
Additional optional hyperparameters can be passed to the autogluon.task.TabularPrediction.fit function via fit_args.
Below shows a more in depth example of AutoGluon-Tabular hyperparameters from the example Predicting Columns in a Table - In Depth. Please see fit parameters for further information. Note that in order for hyperparameter ranges to work in SageMaker, values
passed to the fit_args['hyperparameters'] must be represented as strings.
nn_options = {
'num_epochs': "10",
'learning_rate': "ag.space.Real(1e-4, 1e-2, default=5e-4, log=True)",
'activation': "ag.space.Categorical('relu', 'softrelu', 'tanh')",
'layers': "ag.space.Categorical([100],[1000],[200,100],[300,200,100])",
'dropout_prob': "ag.space.Real(0.0, 0.5, default=0.1)"
}
gbm_options = {
'num_boost_round': "100",
'num_leaves': "ag.space.Int(lower=26, upper=66, default=36)"
}
model_hps = {'NN': nn_options, 'GBM': gbm_options}
fit_args = {
'label': 'y',
'presets': ['best_quality', 'optimize_for_deployment'],
'time_limits': 60*10,
'hyperparameters': model_hps,
'hyperparameter_tune': True,
'search_strategy': 'skopt'
}
hyperparameters = {
'fit_args': fit_args,
'feature_importance': True
}
Note: Your hyperparameter choices may affect the size of the model package, which could result in additional time taken to upload your model and complete training. Including 'optimize_for_deployment' in the list of fit_args['presets'] is recommended to greatly reduce upload times.
[ ]:
# Define required label and optional additional parameters
fit_args = {
'label': 'y',
# Adding 'best_quality' to presets list will result in better performance (but longer runtime)
'presets': ['optimize_for_deployment'],
}
# Pass fit_args to SageMaker estimator hyperparameters
hyperparameters = {
'fit_args': fit_args,
'feature_importance': True
}
Train¶
train_instance_type to local .ml.m5.2xlarge.Note: Depending on how many underlying models are trained, train_volume_size may need to be increased so that they all fit on disk.
[ ]:
%%time
instance_type = 'ml.m5.2xlarge'
#instance_type = 'local'
ecr_image = f'{ecr_uri_prefix}/{training_algorithm_name}:latest'
estimator = Estimator(image_name=ecr_image,
role=role,
train_instance_count=1,
train_instance_type=instance_type,
hyperparameters=hyperparameters,
train_volume_size=100)
# Set inputs. Test data is optional, but requires a label column.
inputs = {'training': train_s3_path, 'testing': test_s3_path}
estimator.fit(inputs)
Create Model¶
[ ]:
# Create predictor object
class AutoGluonTabularPredictor(RealTimePredictor):
def __init__(self, *args, **kwargs):
super().__init__(*args, content_type='text/csv',
serializer=csv_serializer,
deserializer=StringDeserializer(), **kwargs)
[ ]:
ecr_image = f'{ecr_uri_prefix}/{inference_algorithm_name}:latest'
if instance_type == 'local':
model = estimator.create_model(image=ecr_image, role=role)
else:
model_uri = os.path.join(estimator.output_path, estimator._current_job_name, "output", "model.tar.gz")
model = Model(model_uri, ecr_image, role=role, sagemaker_session=session, predictor_cls=AutoGluonTabularPredictor)
Batch Transform¶
For local mode, either s3://<bucket>/<prefix>/output/ or file:///<absolute_local_path> can be used as outputs.
By including the label column in the test data, you can also evaluate prediction performance (In this case, passing test_s3_path instead of X_test_s3_path).
[ ]:
output_path = f's3://{bucket}/{prefix}/output/'
# output_path = f'file://{os.getcwd()}'
transformer = model.transformer(instance_count=1,
instance_type=instance_type,
strategy='MultiRecord',
max_payload=6,
max_concurrent_transforms=1,
output_path=output_path)
transformer.transform(test_s3_path, content_type='text/csv', split_type='Line')
transformer.wait()
Endpoint¶
Deploy remote or local endpoint¶
[ ]:
instance_type = 'ml.m5.2xlarge'
#instance_type = 'local'
predictor = model.deploy(initial_instance_count=1,
instance_type=instance_type)
Attach to endpoint (or reattach if kernel was restarted)¶
[ ]:
# Select standard or local session based on instance_type
if instance_type == 'local':
sess = local_session
else:
sess = session
# Attach to endpoint
predictor = AutoGluonTabularPredictor(predictor.endpoint, sagemaker_session=sess)
Predict on unlabeled test data¶
[ ]:
results = predictor.predict(X_test.to_csv(index=False)).splitlines()
# Check output
print(Counter(results))
Predict on data that includes label column¶
Prediction performance metrics will be printed to endpoint logs.
[ ]:
results = predictor.predict(test.to_csv(index=False)).splitlines()
# Check output
print(Counter(results))
Check that classification performance metrics match evaluation printed to endpoint logs as expected¶
[ ]:
y_results = np.array(results)
print("accuracy: {}".format(accuracy_score(y_true=y_test, y_pred=y_results)))
print(classification_report(y_true=y_test, y_pred=y_results, digits=6))
Clean up endpoint¶
[ ]:
predictor.delete_endpoint()