Training a TensorFlow image classifier on AWS
Amazon SageMaker is an amazing service that allows you to build complete, end-to-end machine learning pipelines, from model training to production deployment. It comes with a huge set of built-in models that you can use straight away, but it also allows you to train your own custom models via Docker images, giving you total control over the fun part of the process (i.e., model architecture and training logic) and yet abstracting away the boring part (model serving, instance creation, load balancing, scaling, etc).
In this tutorial series, I’m going to show you how you can use Amazon SageMaker and Docker to train and save a deep learning model, and then deploy your trained model as an inference endpoint.
In the first part (this post), we’ll go through all the steps necessary to train a Tensorflow model using Amazon SageMaker, and in the second part (coming soon, stay tuned 😊), you’ll learn how to create prediction endpoints that you can later use in your apps.
You will need AWS CLI and Docker installed on your machine (optional: install the Kaggle CLI to download the dataset). Make sure you have access to an AWS account and have the necessary credentials in your .aws/credentials
file (you can use the aws configure
command to set up an AWS profile). Don’t forget to run export AWS_PROFILE=<profile name>
to tell the AWS CLI which profile you’ll be using.
For this project, you will need to have access to the following AWS services:
- S3 to store your data and model weights;
- Elastic Container Registry (ECR) to store Docker images for your training containers;
- IAM to create roles;
- and, finally, Amazon SageMaker that will use all the abovementioned services to launch training jobs.
To begin, create a new folder on your machine for this project, cd into it, and let’s go!
Step 1: Upload the dataset to AWS S3
You might want to skip this step if you prefer to download the data directly from your code for each training, but for the purpose of this tutorial, we are going to use S3 as our data source.
Let’s first create an S3 bucket where we’ll store our training data with the following command (or you can simply use the AWS console). You can choose any available name (S3 bucket names are globally unique), but it should preferably contain the word “sagemaker” - i.e., something like tutorial-sagemaker-data-12345 - it will make the following steps a little easier.
aws s3 mb s3://YOUR-DATA-BUCKET-NAME
While we’re here, let’s also create another bucket where we’ll store the training output:
aws s3 mb s3://YOUR-OUTPUT-BUCKET-NAME
Then let’s download the Flowers Recognition dataset from Kaggle and upload it to the created S3 bucket:
kaggle datasets download -d alxmamaev/flowers-recognition
unzip flowers-recognition.zip
aws s3 sync flowers s3://YOUR-DATA-BUCKET-NAME
Step 2: Write a training script
We’ll train a simple TensorFlow image classifier using the following script:
Let’s also add a requirements.txt file with the Python dependencies for our script. In our case, it looks like this:
Step 3: Create a Docker image for your training container
Now we need to create a Docker image that will be used to create a training instance on AWS. We are going to write a Dockerfile representing what’s going to be installed/copied to this training instance. Personally, I found this part to be the longest, but its complexity totally depends on your training needs, i.e., whether you’d like to train your model on a GPU, or does it require any special libs to be installed, etc.
There aren’t really any constraints as to what you can do in this Dockerfile. However, there are a number of directories reserved by Sagemaker for its purposes, such as for storing input data or saving output files after the training job has finished. Specifically, your training algorithm needs to look for data in the /opt/ml/input
folder, and store model artifacts (and whatever other output you’d like to keep for later) in /opt/ml/model
. SageMaker will copy the training data we’ve uploaded to the S3 to the input folder, and copy everything from the model folder to the output S3 bucket.
Here’s an example of a Dockerfile that you might use for training Tensorflow models on GPU instances, which is largely based on what I found in this repo. Feel free to modify it to use a different version of CUDA, or a different version of Python, or to install whatever you need for training your model.
As an option, you can build an image from one of the Dockerfiles that best suits your case first, and include it in the FROM directive of your training Dockerfile. This way you only have to add your files and install your project dependencies. Here’s a list of repositories with Dockerfiles for TensorFlow, MXNet, Chainer, and PyTorch.
Step 3.1: Push the Docker Image to Elastic Container Registry
To use our Dockerfile to build training containers, we need to build the training image and upload it to the AWS container registry called Elastic Container Registry, where SageMaker will look for it.
In order to push images to ECR, we first need to retrieve an authentication token and use it to log in from the Docker client. To do this, run the command below (you’ll need to fill in your AWS account ID and region):
aws ecr get-login-password | docker login --username AWS --password-stdin AWS_ACCOUNT_ID.dkr.ecr.REGION.amazonaws.com
You should see a message with something like “Login Succeeded”.
Let’s create an ECR repository named “sagemaker-images” where we’ll store the training images. You can choose any other name, but - again - it will be easier for the following steps if the name contains the word “sagemaker” in it.
aws ecr create-repository --repository-name sagemaker-images
This should output the new repository information. We’re interested in the value of the repositoryUri field, which we’ll use to tag the Docker image.
Now we can build the Docker image and push it to the new repo :
docker build -t REPOSITORY_URI .
docker push REPOSITORY_URI
You can open the Elastic Container Registry service in the AWS console and check that the image has been successfully pushed.
Step 4: Create a role for SageMaker training jobs
In order to launch training jobs on SageMaker, we first need to create a role that will give SageMaker permissions to access all the necessary resources. To keep things simple, we’ll create a role based on the AmazonSagemakerFullAccess policy. This policy gives SageMaker access to all S3 buckets and ECR images, along with some other resources, as long as they include the word “sagemaker” in their name. This means that if your data bucket or image repository is named differently, you will need to add additional role policies to give SageMaker access to your resources. With this in mind, I encourage you to read more about SageMaker Roles and only add permissions that you need, especially in a production setting.
Copy the following into a file named role-policy.json:
Now run the two commands below to create a role and attach a policy to it:
aws iam create-role --role-name SagemakerRole --assume-role-policy-document file://./role-policy.json
(Put down the value of the Arn from the output: you’ll need it later on.)
aws iam attach-role-policy --role-name SagemakerRole --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
If necessary, create additional permissions:
aws iam put-role-policy --role-name SagemakerRole --policy-document file://./s3-policy.json
Step 5: Configure the SageMaker training job
To train a model with Amazon SageMaker, we need to create a training job. Create a file named training-job-config.json
from the template below and fill the blanks with your data. This file will contain the description of the training job:
Here are the fields you’ll need to fill:
TrainingJobName -
a name for your training job that will be used as its identifier;AlgorithmSpecification/TrainingImage -
your uploaded image URL;RoleArn -
the ARN of the role created earlier;InputDataConfig/ChannelName -
the name of the channel. Each input channel will be mounted as a separate folder in/opt/ml/input;
InputDataConfig/ContentType -
the MIME type of your data;InputDataConfig/DataSource/S3DataSource/S3Uri -
the URI of the S3 resource with data;OutputDataConfig/S3OutputPath -
the S3 path where you would like to store model artifacts;ResourceConfig/InstanceType -
the type of instance you want to use for training. You can compare the capacities of the supported instances here (GPU instances can be expensive, so don’t forget to check the pricing!);ResourceConfig/InstanceCount -
the number of training instances;ResourceConfig/VolumeSizeInGB -
size of storage you want to provision.
This can also be done manually from the AWS console, but it’s always a good idea to have all the settings you need in one file, especially if you plan on launching a lot of experiments.
Here are a few (but not all) other useful settings you might want to include:
StoppingCondition -
you can set stopping conditions, such as maximum run time in seconds. By default, Amazon SageMaker shuts down the training instances after 1 day.AlgorithmSpecification/MetricDefinitions -
you can specify a list of regexes that your logs will be parsed for during training, and saved as metrics that you can track using CloudWatch.TensorBoardOutputConfig -
allows you to specify an S3 bucket for storing Tensorboard logs.
You can learn more about the training job configuration in the SageMaker API Reference.
Step 6: Launch SageMaker Training Job
The following command will launch training (finally 😊):
aws sagemaker create-training-job --cli-input-json file://training-job-config.json
You can now monitor the status of the job from the AWS console, where you can also see the training logs and instance metrics.
That’s how training with SageMaker containers works in a nutshell! This post only covered the basics, so I encourage you to read the documentation and learn about all different configs that might be useful for your specific case.
After the training job is finished, you can go ahead and download your model weights from the output S3 bucket. Don’t delete these files just yet! You’ll need them for the next part of this series, where you’ll learn how to create an inference endpoint for your trained model and use it to make predictions.
If there were any parts of this tutorial that didn’t work for you, please let me know in the comments! Good luck and see you in the next chapter!
If you are looking for Data Engineering experts, don't hesitate to contact us !