Preprocessing Images for Built-in Algorithms (Part 3/4)
Download | Structure | Preprocessing (Built-in) | Train Model (Built-in)
Notes: * This notebook should be used with the conda_amazonei_mxnet_p36 kernel * This notebook is part of a series of notebooks beginning with 01_download_data and 02_structuring_data. From here on it will focus on SageMaker’s built-in algorithms. The next notebook in this series is 04a_builtin_training. * You can also explore preprocessing with TensorFlow and PyTorch by running 03b_tensorflow_preprocessing and 03c_pytorch_preprocessing, respectively.
In this notebook we will explore the different ways to format your image dataset for SageMaker’s built-in algorithms. The first involves creating a manifest file for the train and validations sets and the other has you creating .REC files (RecordIO format) which are single binary files made up of all the images for the train and validation sets. Since the RecordIO format is preferred, we will upload the .REC files to S3 for training in the nedxt notebook.
Overview
Application/x-recordio format (preferred format)
## Dependencies ___
[ ]:
import uuid
import boto3
import shutil
import urllib
import pickle
import pathlib
import sagemaker
import subprocess
Load Category Labels
The category_labels file was generated from the first notebook in this series 01_download_data.ipynb. You will need to run that notebook before running the code here.
[ ]:
with open("pickled_data/category_labels.pickle", "rb") as f:
category_labels = pickle.load(f)
## Application/x-image format ___
This format is also referred to as “Image Format” or “LST” format. The benefit of using this format is that it doesn’t require any modification or restructuring of your dataset. Instead, you create a manifest of the images for your training set and validation set. These two manifests are separate .lst files which list all the images giving each of them a unique index, the class they belong to and the relative path to the image file from the main training folder. The data in the .lst file
is in tab separated values.
While its the easiest format to use, it requires SageMaker to do more work behind the scenes. For datasets with many images, this will cause training to take longer. For datasets with fewer images, the performance difference isn’t as pronounced.
Below are two examples of how to create your .LST manifest files. One uses your own code and the other uses a script from MXNet. If you want to create .REC files of your images, you should skip to Option 2.
Option 1: Manually generate the .LST files
[ ]:
category_ids = {name: idx for idx, name in enumerate(sorted(category_labels.values()))}
print(category_ids)
[ ]:
image_paths = pathlib.Path("./data_structured").rglob("*.jpg")
for idx, p in enumerate(image_paths):
image_id = f"{idx:010}"
category = category_ids[p.parts[-2]]
path = p.as_posix()
split = p.parts[-3]
with open(f"{split}.lst", "a") as f:
line = f"{image_id}\t{category}\t{path}\n"
f.write(line)
View the contents of the train.lst file
[ ]:
!head train.lst
Option 2: Use im2rec.py script to generate the .LST files
[ ]:
script_url = "https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py"
urllib.request.urlretrieve(script_url, "im2rec.py");
python im2rec.py --list --recursive LST_FILE_PREFIX DATA_DIR * –list - generate an LST file * –recursive - looks inside subfolders for image data * LST_FILE_PREFIX - choose the name you want for the .lst file * DATA_DIR - relative path to directory with the data
[ ]:
!python im2rec.py --list --recursive train data_structured/train
[ ]:
!python im2rec.py --list --recursive val data_structured/val
View the contents of the train.lst file
[ ]:
!head train.lst
## Application/x-recordio (preferred format) ___ This format is commonly referred to as RecordIO. It creates a new file for your each of your training and validation datasets with the
.recsuffix. The.recfile is a single file that contains all of the images in the dataset so it can be streamed directly to the SageMaker training algorithm without the overhead involved with transfering thousands of individual files. For datasets with many images this provides a huge reduction in
training time because SageMaker doesn’t need to download all the image files before it can run the training algorithm. If you use the im2rec.py script, it will also resize the images for you as well. The benefits of resizing the files before saving them in the RecordIO format is that it’ll reduce the amount of data you need to transfer to s3 and will also speed up trainging by doing the resizing ahead of time instead of at training.
1. Run Option 2 from application/x-image above and copy LST files
Once you’ve run Option 2 from above then proceed below.
[ ]:
recordio_dir = pathlib.Path("./data_recordio")
recordio_dir.mkdir(exist_ok=True)
shutil.copy("train.lst", "data_recordio/")
shutil.copy("val.lst", "data_recordio/");
2. Generate .rec files in the RecordIO Format
Once the .lst file is generated, the same im2rec.py script will also generate the .rec file.
python im2rec.py --resize 224 --quality 90 --num-thread 16 LST_FILE_PREFIX DATA_DIR/ * –resize: Have the script resize the files before saving them all to a .rec file. For the image classification algorithm the default dimensions are 224x224. Resizing now will also reduce the size of your .rec file. * –quality: Default settings will save the image data uncompressed. Adding some compression will keep the filesize of your .rec down especially if you’re not resizing
them. * –num_thread: Set how many threads to parallelize the work * –LST_FILE_PREFIX: Name of the .lst you’re referencing for creating the .rec file * –DATA_DIR: Relative path directory which holds the data listed in the .lst file
Training dataset
[ ]:
!python im2rec.py --resize 224 --quality 90 --num-thread 16 data_recordio/train data_structured/train
Validation dataset
[ ]:
!python im2rec.py --resize 224 --quality 90 --num-thread 16 data_recordio/val data_structured/val
## Upload the data to S3 ___ In order for SageMaker’s built-in algrorithms to train on the data, it must be stored in an S3 bucket. Here, we will create a bucket, but you can use an existing bucket if you like by replacing the
bucket_namevariable in the first line of theelsestatement below.
Create a bucket for your project
[ ]:
if pathlib.Path("pickled_data/builtin_bucket_name.pickle").exists():
with open("pickled_data/builtin_bucket_name.pickle", "rb") as f:
bucket_name = pickle.load(f)
print("Bucket Name:", bucket_name)
else:
bucket_name = f"sagemaker-builtin-ic-{str(uuid.uuid4())}"
s3 = boto3.resource("s3")
region = sagemaker.Session().boto_region_name
bucket_config = {"LocationConstraint": region}
s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration=bucket_config)
with open("pickled_data/builtin_bucket_name.pickle", "wb") as f:
pickle.dump(bucket_name, f)
print("Bucket Name:", bucket_name)
Upload .rec files to S3
[ ]:
s3_uploader = sagemaker.s3.S3Uploader()
data_path = recordio_dir / "train.rec"
data_s3_uri = s3_uploader.upload(
local_path=data_path.as_posix(), desired_s3_uri=f"s3://{bucket_name}/data/train"
)
[ ]:
data_path = recordio_dir / "val.rec"
data_s3_uri = s3_uploader.upload(
local_path=data_path.as_posix(), desired_s3_uri=f"s3://{bucket_name}/data/val"
)
Rollback to default version of SDK and TensorFlow
Only do this if you’re done with this guide and want to use the same kernel for other notebooks with an incompatible version of the SageMaker SDK or TensorFlow.
[ ]:
# print(f'Original version: {original_sagemaker_version[0]}')
# print(f'Current version: {sagemaker.__version__}')
# print('')
# print(f'Rolling back to {original_sagemaker_version[0]}')
# print('Restart notebook kernel to use changes.')
# print('')
# s = f'sagemaker=={original_sagemaker_version[0]}'
# !{sys.executable} -m pip install -q {s}
Next Steps
Now that the training and validation data has be uploaded to S3, the next notebook will use SageMaker’s built-in Image Classification algorithm to train a deep learning model to classify the animal images.
[ ]: