Structuring Your Data (Part 2/4)

Download | Structure | Preprocessing | Train Model

In this notebook, you will properly structure your image files for ingestion by SageMaker Built-in Algorithms, TensorFlow or PyTorch data loaders. To do this, we will split out data into train, validation and test sets. Then, we will use Python to create the new folder structure and copy the files into the correct set and label folder.

[3]:

import shutil
import pickle
import numpy as np
from tqdm import tqdm
from pathlib import Path

## Proper folder structure ___

Although most tools can accommodate data in any file structure with enough tinkering, it makes most sense to use the sensible defaults that frameworks like MXNet, TensorFlow and PyTorch all share to make data ingestion as smooth as possible. By default, most tools will look for image data in the file structure depicted below:

+-- train
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|
+-- val
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|
+-- test
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg

You will notice that the COCO dataset does not come structured like above so we must use the annotation data to help restructure the folders of the COCO dataset so they match the pattern above. Once the new directory structures are created you can use your desired framework’s data loading tool to gracefully load and define transformation for your image data. Many datasets may already be in this structure in which case you can skip this guide.

## Load annotation category labels ___ The sample_annos and category_labels files were generated from the first notebook in this series 01_download_data.ipynb. You will need to run that notebook before running the code here.

[4]:

with open("pickled_data/sample_annos.pickle", "rb") as f:
    sample_annos = pickle.load(f)

with open("pickled_data/category_labels.pickle", "rb") as f:
    category_labels = pickle.load(f)

## Make train, validation and test splits ___ We should divide our data into train, validation and test splits. A typical split ratio is 80/10/10. Our image classification algorithm will train on the first 80% (training) and evaluate its performance at each epoch with the next 10% (validation) and we’ll give our model’s final accuracy results using the last 10% (test). It’s important that before we split the data we make sure to shuffle it randomly so that class distribution among splits is

roughly proportional.

[5]:

np.random.seed(0)
image_ids = sorted(list(sample_annos.keys()))
np.random.shuffle(image_ids)
first_80 = int(len(image_ids) * 0.8)
next_10 = int(len(image_ids) * 0.9)
train_ids, val_ids, test_ids = np.split(image_ids, [first_80, next_10])

## Make new folder structure and copy image files ___ This new folder structure can then be read by data loaders for SageMaker’s built-in algorithms, TensorFlow or PyTorch for easy loading of the image data into your framework of choice.

[ ]:

unstruct_dir = Path("data_sample_2750")
struct_dir = Path("data_structured")
struct_dir.mkdir(exist_ok=True, parents=True)

for name, split in zip(["train", "val", "test"], [train_ids, val_ids, test_ids]):
    split_dir = struct_dir / name
    split_dir.mkdir(exist_ok=True)
    for image_id in tqdm(split):
        category_dir = split_dir / f'{category_labels[sample_annos[image_id]["category_id"]]}'
        category_dir.mkdir(exist_ok=True)
        source_path = (unstruct_dir / sample_annos[image_id]["file_name"]).as_posix()
        target_path = (category_dir / sample_annos[image_id]["file_name"]).as_posix()
        shutil.copy(source_path, target_path)

Next steps

Now that the images can be easily loaded by the framework of your choice, the next step is to choose a framework. In this series, we cover SageMaker’s built-in algorithms, TensorFlow and PyTorch. You can choose the next notebook depending on the framework you want to learn more about. Once the data is loaded into the framework, we’ll cover preprocessing, file formats, transformations and augmentations.