Connect with us


Amazon Simple Queue Service (SQS) – 15 Years and Still Queueing!

Time sure does fly! I wrote about the production launch of Amazon Simple Queue Service (SQS) back in 2006, with no inkling that I would still be blogging fifteen years later, or that this service would still be growing rapidly while remaining fundamental to the architecture of so many different types of web-scale applications. The…



Time sure does fly! I wrote about the production launch of Amazon Simple Queue Service (SQS) back in 2006, with no inkling that I would still be blogging fifteen years later, or that this service would still be growing rapidly while remaining fundamental to the architecture of so many different types of web-scale applications.

The first beta of SQS was announced with little fanfare in late 2004. Even though we have added many features since that beta, the original description (“a reliable, highly scalable hosted queue for buffering messages between distributed application components”) still applies. Our customers think of SQS as an infinite buffer in the cloud and use SQS queues to implement the connections between the functional parts of their application architecture.

Over the years, we have worked hard to keep the SQS interface simple and easy to use, even though there’s a lot of complexity inside. In order to make SQS scalable, durable, and reliable, messages are stored in a fleet that consists of thousands of servers in each AWS Region. Within a region, we save three copies of each message, taking care to distribute the messages across storage nodes and Availability Zones. In addition to this built-in redundant storage, SQS is self-healing and resilient to host failures & network interruptions. Even though SQS is now 15 years old, we continue to improve our scaling and load management applications so that you can always count on SQS to handle your mission-critical workloads.

SQS runs at an incredible scale. For example:

Amazon Product Fulfillment – During Prime Day 2021, traffic to SQS set a new record, peaking at 47.7 million messages per second.

Rapid7 – Amazon customer Rapid7 makes extensive use of SQS. According to Jeff Myers (Platform Software Architect):

Amazon SQS provides us with a simple to use, highly performant, and highly scalable messaging service without any management headaches. It is a crucial component of our architecture that has allowed us to scale to handle 10s of billions of messages per day.

You can visit the Amazon SQS home page to read about other high-scale use cases from NASA, BMW Group, Capital One, and other AWS customers.

Serverless Office Hours
Be sure to join us today (July 13th) for Serverless Office Hours at Noon PT on Rumor has it that there will be cake!

Back in Time
Here’s a quick recap of some of the major SQS milestones:

2006 – Production launch. An unlimited number of queues per account and items per queue, with each item up to 8 KB in length. Pay as you go pricing based on messages sent and data transferred.

2007 – Additional functions including control of the visibility timeout and access to the approximate number of queued messages.

2009 – SQS in Europe, and additional control over queue access via AddPermission and RemovePermission. This launch also marked the debut of what we then called the Access Policy Language, which has evolved into today’s IAM Policies.

2010 – A new free tier (100K requests/month then, since expanded to 1 million requests/month), configurable maximum message length (up to 64 KB), and configurable message retention time.

2011 – Additional CloudWatch Metrics for each SQS queue. Later that year we added batch operations (SendMessageBatch and DeleteMessageBatch), delay queues, and message timers.

2012 – Support for long polling, along with SDK support for request batching and client-side buffering.

2013 – Support for even larger payloads (256 KB) and a 50% price reduction.

2014 – Support for dead letter queues, to accept messages that have become stuck due to a timeout or to a processing error within your code.

2015 – Support for extended payloads (up to 2 GB) using the Extended Client Library.

2016 – Support for FIFO queues with exactly-once processing and deduplication and another price reduction.

2017 – Support for server-side encryption of messages and cost allocation tags.

2018 – Support for Amazon VPC Endpoints using AWS PrivateLink and the ability to invoke Lambda functions.

2019 – Support for Tag-on-Create and X-Ray tracing.

2020 – Support for 1-minute metrics for more fine-grained queue monitoring, a new console experience, and result pagination for the ListQueues and ListDeadLetterSourceQueues functions.

2021 – Tiered pricing so that you save money as your usage grows, and a High Throughput mode for FIFO queues.

Today, SQS remains simple, scalable, cost-effective, and highly reliable.

From the AWS Heroes
We asked some of the AWS Heroes to reflect on the success of SQS and to share some of their success stories. Here’s what we learned:

Eric Hammond (Serverless Hero) uses AWS Lambda Dead Letter Queues. They put alarms on the queues and send internal emails as alerts when problems need to be investigated.

Tom McLaughlin (Serverless Hero) has been using SQS since 2015. He said “My favorite use case is anytime someone wants a queue and I don’t want to manage a queuing platform. Which is always.”

Ken Collins (Serverless Hero) was not entirely sure how long he had been using SQS, and offered a range of 2 to 8 years! He uses it to power the Lambdakiq gem, which implements ActiveJob on SQS & Lambda.

Kyuhyun Byun (Serverless Hero) has been using SQS to power a push system that must sustain massive amounts of traffic, and tells us “Thanks to SQS, I don’t consider building a queuing system anymore.”

Prashanth HN (Serverless Hero) has been using SQS since 2017, and considers himself “late to the party.” He used SQS as part of his first serverless application, and finds that it is ideal for connecting services with differing throughput.

Ben Kehoe (Serverless Hero) told us that “I first saw the power of SQS when a colleague showed me how to retain state in a fleet of EC2 Spot instances by storing the state in SQS when an instance was getting shut down, and having new instances check the queue for state on startup.”

Jeremy Daly (Serverless Hero) started using SQS in 2010 as a lightweight queue that fed a facial recognition process on user-supplied images. Today, he often uses it as a buffer to throttle requests to downstream services that are not yet serverless.

Casey Lee (Container Hero) also started to use SQS in 2010 as a replacement for ActiveMQ between loosely coupled Java services. Casey implements auto scaling based on queue depth, and has found it to be an accurate way to handle the load.

Vlad Ionecsu (Container Hero) began his AWS journey with SQS back in 2014. Vlad found that the API was very easy to understand, and used SQS to power his thesis project.

Sheen Brisals (Serverless Hero) started to use SQS in 2018 while building a proof-of-concept that also used Lambda and S3. Sheen appreciates the ability to adjust the characteristics of each queue to create a good match for the message processing functions, and also makes use of both high and low priority queues.

Gojko Adzic (Serverless Hero) began to use SQS in 2013 as a task dispatch for exporters in MindMup. This online mind-mapping application allows large groups of users to collaborate, and requires strict ordering of updates to each document. Gojko used FIFO queues to allow them to process messages for different documents in parallel, with sequential order within each document.

Sebastian Müller (Serverless Hero) started to use SQS in 2016 by building a notification center for website-builder . The center ensures that customers are kept aware of events (orders, support messages, and contact requests) on a timely basis.

Luca Bianchi (Serverless Hero) first used SQS in 2012. He decoupled a pair of microservices running on AWS Elastic Beanstalk, and also created a fan-out processing system for a gamification platform. Today, his favorite SQS use case stores inference jobs and makes them available to a worker process running on Amazon SageMaker.

Peter Hanssens (Serverless Hero) uses SQS to offload tasks that do not need to be processed immediately. Several years ago, while assisting some data scientists, he created an event-driven batch-processing system that used a Lambda function to check a queue every 5 minutes and fire up EC2 instances to build models, while keeping strict control over the maximum number of running instances.

Serkan Ozal (Serverless Hero) has been using SQS since 2013 or so. He focuses on asynchronous message processing and counts on the ability of SQS to handle a huge number of events. He also makes uses of the message visibility feature so that he can re-process messages if necessary.

Matthieu Napoli (Serverless Hero) has been using SQS for about five years, starting out with EC2, and as a replacement for other queues. As he says, “Paired with Lambda, it gives massive parallelism out of the box without having to think too much about it. Plus the built-in failure handling makes it very robust.”

As you can see, there are a multitude of use cases for SQS.

SQS Resources
If you are not already using SQS, then it is time to get in the queue and get started. Here are some resources to help you find your way:

Happy queueing!






Continue Reading
Click to comment

Leave a Reply

Your email address will not be published.


Use a custom image to bring your own development environment to RStudio on Amazon SageMaker

RStudio on Amazon SageMaker is the industry’s first fully managed RStudio Workbench in cloud. You can quickly launch the familiar RStudio integrated development environment (IDE), and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale. RStudio on…




RStudio on Amazon SageMaker is the industry’s first fully managed RStudio Workbench in cloud. You can quickly launch the familiar RStudio integrated development environment (IDE), and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale. RStudio on SageMaker already comes with a built-in image preconfigured with R programming and data science tools; however, you often need to customize your IDE environment. Starting today, you can bring your own custom image with packages and tools of your choice, and make them available to all the users of RStudio on SageMaker in a few clicks.

Bringing your own custom image has several benefits. You can standardize and simplify the getting started experience for data scientists and developers by providing a starter image, preconfigure the drivers required for connecting to data stores, or pre-install specialized data science software for your business domain. Furthermore, organizations that have previously hosted their own RStudio Workbench may have existing containerized environments that they want to continue to use in RStudio on SageMaker.

In this post, we share step-by-step instructions to create a custom image and bring it to RStudio on SageMaker using the AWS Management Console or AWS Command Line Interface (AWS CLI). You can get your first custom IDE environment up and running in few simple steps. For more information on the content discussed in this post, refer to Bring your own RStudio image.

Solution overview

When a data scientist starts a new session in RStudio on SageMaker, a new on-demand ML compute instance is provisioned and a container image that defines the runtime environment (operating system, libraries, R versions, and so on) is run on the ML instance. You can provide your data scientists multiple choices for the runtime environment by creating custom container images and making them available on the RStudio Workbench launcher, as shown in the following screenshot.

The following diagram describes the process to bring your custom image. First you build a custom container image from a Dockerfile and push it to a repository in Amazon Elastic Container Registry (Amazon ECR). Next, you create a SageMaker image that points to the container image in Amazon ECR, and attach that image to your SageMaker domain. This makes the custom image available for launching a new session in RStudio.


To implement this solution, you must have the following prerequisites:

We provide more details on each in this section.

RStudio on SageMaker domain

If you have an existing SageMaker domain with RStudio enabled prior to April 7, 2022, you must delete and recreate the RStudioServerPro app under the user profile name domain-shared to get the latest updates for bring your own custom image capability. The AWS CLI commands are as follows. Note that this action interrupts RStudio users on SageMaker.

aws sagemaker delete-app –domain-id –app-type RStudioServerPro –app-name default –user-profile-name domain-shared aws sagemaker create-app –domain-id –app-type RStudioServerPro –app-name default –user-profile-name domain-shared

If this is your first time using RStudio on SageMaker, follow the step-by-step setup process described in Get started with RStudio on Amazon SageMaker, or run the following AWS CloudFormation template to set up your first RStudio on SageMaker domain. If you already have a working RStudio on SageMaker domain, you can skip this step.

The following RStudio on SageMaker CloudFormation template requires an RStudio license approved through AWS License Manager. For more about licensing, refer to RStudio license. Also note that only one SageMaker domain is permitted per AWS Region, so you’ll need to use an AWS account and Region that doesn’t have an existing domain.

  1. Choose Launch Stack.
    Launch stack button
    The link takes you to the us-east-1 Region, but you can change to your preferred Region.
  2. In the Specify template section, choose Next.
  3. In the Specify stack details section, for Stack name, enter a name.
  4. For Parameters, enter a SageMaker user profile name.
  5. Choose Next.
  6. In the Configure stack options section, choose Next.
  7. In the Review section, select I acknowledge that AWS CloudFormation might create IAM resources and choose Next.
  8. When the stack status changes to CREATE_COMPLETE, go to the Control Panel on the SageMaker console to find the domain and the new user.

IAM policies to interact with Amazon ECR

To interact with your private Amazon ECR repositories, you need the following IAM permissions in the IAM user or role you’ll use to build and push Docker images:

{ “Version”:”2012-10-17″, “Statement”:[ { “Sid”: “VisualEditor0”, “Effect”:”Allow”, “Action”:[ “ecr:CreateRepository”, “ecr:BatchGetImage”, “ecr:CompleteLayerUpload”, “ecr:DescribeImages”, “ecr:DescribeRepositories”, “ecr:UploadLayerPart”, “ecr:ListImages”, “ecr:InitiateLayerUpload”, “ecr:BatchCheckLayerAvailability”, “ecr:PutImage” ], “Resource”: “*” } ] }

To initially build from a public Amazon ECR image as shown in this post, you need to attach the AWS-managed AmazonElasticContainerRegistryPublicReadOnly policy to your IAM user or role as well.

To build a Docker container image, you can use either a local Docker client or the SageMaker Docker Build CLI tool from a terminal within RStudio on SageMaker. For the latter, follow the prerequisites in Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks to set up the IAM permissions and CLI tool.

AWS CLI versions

There are minimum version requirements for the AWS CLI tool to run the commands mentioned in this post. Make sure to upgrade AWS CLI on your terminal of choice:

  • AWS CLI v1 >= 1.23.6
  • AWS CLI v2 >= 2.6.2

Prepare a Dockerfile

You can customize your runtime environment in RStudio in a Dockerfile. Because the customization depends on your use case and requirements, we show you the essentials and the most common customizations in this example. You can download the full sample Dockerfile.

Install RStudio Workbench session components

The most important software to install in your custom container image is RStudio Workbench. We download from the public S3 bucket hosted by RStudio PBC. There are many version releases and OS distributions for use. The version of the installation needs to be compatible with the RStudio Workbench version used in RStudio on SageMaker, which is 1.4.1717-3 at the time of writing. The OS (argument OS in the following snippet) needs to match the base OS used in the container image. In our sample Dockerfile, the base image we use is Amazon Linux 2 from an AWS-managed public Amazon ECR repository. The compatible RStudio Workbench OS is centos7.

FROM … ARG RSW_VERSION=1.4.1717-3 ARG RSW_NAME=rstudio-workbench-rhel ARG OS=centos7 ARG RSW_DOWNLOAD_URL=${OS}/x86_64 RUN RSW_VERSION_URL=`echo -n “${RSW_VERSION}” | sed ‘s/+/-/g’` && curl -o rstudio-workbench.rpm ${RSW_DOWNLOAD_URL}/${RSW_NAME}-${RSW_VERSION_URL}-x86_64.rpm && yum install -y rstudio-workbench.rpm

You can find all the OS release options with the following command:

aws s3 ls s3://rstudio-ide-build/server/

Install R (and versions of R)

The runtime for your custom RStudio container image needs at least one version of R. We can first install a version of R and make it the default R by creating soft links to /usr/local/bin/:

# Install main R version ARG R_VERSION=4.1.3 RUN curl -O${R_VERSION}-1-1.x86_64.rpm && yum install -y R-${R_VERSION}-1-1.x86_64.rpm && yum clean all && rm -rf R-${R_VERSION}-1-1.x86_64.rpm RUN ln -s /opt/R/${R_VERSION}/bin/R /usr/local/bin/R && ln -s /opt/R/${R_VERSION}/bin/Rscript /usr/local/bin/Rscript

Data scientists often need multiple versions of R so that they can easily switch between projects and code base. RStudio on SageMaker supports easy switching between R versions, as shown in the following screenshot.

RStudio on SageMaker automatically scans and discovers versions of R in the following directories:

/usr/lib/R /usr/lib64/R /usr/local/lib/R /usr/local/lib64/R /opt/local/lib/R /opt/local/lib64/R /opt/R/* /opt/local/R/*

We can install more versions in the container image, as shown in the following snippet. They will be installed in /opt/R/.

RUN curl -O && yum install -y R-4.0.5-1-1.x86_64.rpm && yum clean all && rm -rf R-4.0.5-1-1.x86_64.rpm RUN curl -O && yum install -y R-3.6.3-1-1.x86_64.rpm && yum clean all && rm -rf R-3.6.3-1-1.x86_64.rpm RUN curl -O && yum install -y R-3.5.3-1-1.x86_64.rpm && yum clean all && rm -rf R-3.5.3-1-1.x86_64.rpm

Install RStudio Professional Drivers

Data scientists often need to access data from sources such as Amazon Athena and Amazon Redshift within RStudio on SageMaker. You can do so using RStudio Professional Drivers and RStudio Connections. Make sure you install the relevant libraries and drivers as shown in the following snippet:

# Install RStudio Professional Drivers —————————————-# RUN yum update -y && yum install -y unixODBC unixODBC-devel && yum clean all ARG DRIVERS_VERSION=2021.10.0-1 RUN curl -O${DRIVERS_VERSION}.el7.x86_64.rpm && yum install -y rstudio-drivers-${DRIVERS_VERSION}.el7.x86_64.rpm && yum clean all && rm -f rstudio-drivers-${DRIVERS_VERSION}.el7.x86_64.rpm && cp /opt/rstudio-drivers/odbcinst.ini.sample /etc/odbcinst.ini RUN /opt/R/${R_VERSION}/bin/R -e ‘install.packages(“odbc”, repos=””)’

Install custom libraries

You can also install additional R and Python libraries so that data scientists don’t need to install them on the fly:

RUN /opt/R/${R_VERSION}/bin/R -e “install.packages(c(‘reticulate’, ‘readr’, ‘curl’, ‘ggplot2’, ‘dplyr’, ‘stringr’, ‘fable’, ‘tsibble’, ‘dplyr’, ‘feasts’, ‘remotes’, ‘urca’, ‘sodium’, ‘plumber’, ‘jsonlite’), repos=’’)” RUN /opt/python/${PYTHON_VERSION}/bin/pip install –upgrade ‘boto3>1.0<2.0' 'awscli>1.0<2.0' 'sagemaker[local]<3' 'sagemaker-studio-image-build' 'numpy'

When you’ve finished your customization in a Dockerfile, it’s time to build a container image and push it to Amazon ECR.

Build and push to Amazon ECR

You can build a container image from the Dockerfile from a terminal where the Docker engine is installed, such as your local terminal or AWS Cloud9. If you’re building it from a terminal within RStudio on SageMaker, you can use SageMaker Studio Image Build. We demonstrate the steps for both approaches.

In a local terminal where the Docker engine is present, you can run the following commands from where the Dockerfile is. You can use the sample script

IMAGE_NAME=r-4.1.3-rstudio-1.4.1717-3 # the name for SageMaker Image REPO=rstudio-custom # ECR repository name TAG=$IMAGE_NAME # login to your Amazon ECR aws ecr get-login-password | docker login –username AWS –password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION} # create a repo aws ecr create-repository –repository-name ${REPO} # build a docker image and push it to the repo docker build . -t ${REPO}:${TAG} -t ${ACCOUNT_ID}.dkr.ecr.${REGION}${REPO}:${TAG} docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}${REPO}:${TAG}

In a terminal on RStudio on SageMaker, run the following commands:

pip install sagemaker-studio-image-build sm-docker build . –repository ${REPO}:${IMAGE_NAME}

After these commands, you have a repository and a Docker container image in Amazon ECR for our next step, in which we attach the container image for use in RStudio on SageMaker. Note the image URI in Amazon ECR for later use.

Update RStudio on SageMaker through the console

RStudio on SageMaker allows runtime customization through the use of a custom SageMaker image. A SageMaker image is a holder for a set of SageMaker image versions. Each image version represents a container image that is compatible with RStudio on SageMaker and stored in an Amazon ECR repository. To make a custom SageMaker image available to all RStudio users within a domain, you can attach the image to the domain following the steps in this section.

  1. On the SageMaker console, navigate to the Custom SageMaker Studio images attached to domain page, and choose Attach image.
  2. Select New image, and enter your Amazon ECR image URI.
  3. Choose Next.
  4. In the Image properties section, provide an Image name (required), Image display name (optional), Description (optional), IAM role, and tags.
    The image display name, if provided, is shown in the session launcher in RStudio on SageMaker. If the Image display name field is left empty, the image name is shown in RStudio on SageMaker instead.
  5. Leave EFS mount path and Advanced configuration (User ID and Group ID) as default because RStudio on SageMaker manages the configuration for us.
  6. In the Image type section, select RStudio image.
  7. Choose Submit.

You can now see a new entry in the list. It’s worth noting that, with the introduction of the support of custom RStudio images, you can see a new Usage type column in the table to denote whether an image is an RStudio image or an Amazon SageMaker Studio image.

It may take up to 5–10 minutes for the custom images to be available in the session launcher UI. You can then launch a new R session in RStudio on SageMaker with your custom images.

Over time, you may want to retire old and outdated images. To remove the custom images from the list of custom images in RStudio, select the images in the list and choose Detach.

Choose Detach again to confirm.

Update RStudio on SageMaker via the AWS CLI

The following sections describe the steps to create a SageMaker image and attach it for use in RStudio on SageMaker on the SageMaker console and using the AWS CLI. You can use the sample script

Create the SageMaker image and image version

The first step is to create a SageMaker image from the custom container image in Amazon ECR by running the following two commands:

ROLE_ARN= DISPLAY_NAME=RSession-r-4.1.3-rstudio-1.4.1717-3 aws sagemaker create-image –image-name ${IMAGE_NAME} –display-name ${DISPLAY_NAME} –role-arn ${ROLE_ARN} aws sagemaker create-image-version –image-name ${IMAGE_NAME} –base-image “${ACCOUNT_ID}.dkr.ecr.${REGION}${REPO}:${TAG}”

Note that the custom image displayed in the session launcher in RStudio on SageMaker is determined by the input of –display-name. If the optional display name is not provided, the input of –image-name is used instead. Also note that the IAM role allows SageMaker to attach an Amazon ECR image to RStudio on SageMaker.

Create an AppImageConfig

In addition to a SageMaker image, which captures the image URI from Amazon ECR, an app image configuration (AppImageConfig) is required for use in a SageMaker domain. We simplify the configuration for an RSessionApp image so we can just create a placeholder configuration with the following command:

IMAGE_CONFIG_NAME=r-4-1-3-rstudio-1-4-1717-3 aws sagemaker create-app-image-config –app-image-config-name ${IMAGE_CONFIG_NAME}

Attach to a SageMaker domain

With the SageMaker image and the app image configuration created, we’re ready to attach the custom container image to the SageMaker domain. To make a custom SageMaker image available to all RStudio users within a domain, you attach the image to the domain as a default user setting. All existing users and any new users will be able to use the custom image.

For better readability, we place the following configuration into the JSON file default-user-settings.json:

“DefaultUserSettings”: { “RSessionAppSettings”: { “CustomImages”: [ { “ImageName”: “r-4.1.3-rstudio-2022”, “AppImageConfigName”: “r-4-1-3-rstudio-2022” }, { “ImageName”: “r-4.1.3-rstudio-1.4.1717-3”, “AppImageConfigName”: “r-4-1-3-rstudio-1-4-1717-3” } ] } } }

In this file, we can specify the image and AppImageConfig name pairs in a list in DefaultUserSettings.RSessionAppSettings.CustomImages. This preceding snippet assumes two custom images are being created.

Then run the following command to update the SageMaker domain:

aws sagemaker update-domain –domain-id –cli-input-json file://default-user-settings.json

After you update the domaim, it may take up to 5–10 minutes for the custom images to be available in the session launcher UI. You can then launch a new R session in RStudio on SageMaker with your custom images.

Detach images from a SageMaker domain

You can detach images simply by removing the ImageName and AppImageConfigName pairs from default-user-settings.json and updating the domain.

For example, updating the domain with the following default-user-settings.json removes r-4.1.3-rstudio-2022 from the R session launching UI and leaves r-4.1.3-rstudio-1.4.1717-3 as the only custom image available to all users in a domain:

{ “DefaultUserSettings”: { “RSessionAppSettings”: { “CustomImages”: [ { “ImageName”: “r-4.1.3-rstudio-1.4.1717-3”, “AppImageConfigName”: “r-4-1-3-rstudio-1-4-1717-3” } ] } } }

Clean up

To safely remove images and resources in the SageMaker domain, complete the following steps in Clean up image resources.

To safely remove the RStudio on SageMaker and the SageMaker domain, complete the following steps in Delete an Amazon SageMaker Domain to delete any RSessionGateway app, user profile and the domain.

To safely remove images and repositories in Amazon ECR, complete the following steps in Deleting an image.

Finally, to delete the CloudFormation template:

  1. On the AWS CloudFormation console, choose Stacks.
  2. Select the stack you deployed for this solution.
  3. Choose Delete.


RStudio on SageMaker makes it simple for data scientists to build ML and analytic solutions in R at scale, and for administrators to manage a robust data science environment for their developers. Data scientists want to customize the environment so that they can use the right libraries for the right job and achieve the desired reproducibility for each ML project. Administrators need to standardize the data science environment for regulatory and security reasons. You can now create custom container images that meet your organizational requirements and allow data scientists to use them in RStudio on SageMaker.

We encourage you to try it out. Happy developing!

About the Authors

Michael Hsieh is a Senior AI/ML Specialist Solutions Architect. He works with customers to advance their ML journey with a combination of AWS ML offerings and his ML domain knowledge. As a Seattle transplant, he loves exploring the great Mother Nature the city has to offer, such as the hiking trails, scenery kayaking in the SLU, and the sunset at Shilshole Bay.

Declan Kelly is a Software Engineer on the Amazon SageMaker Studio team. He has been working on Amazon SageMaker Studio since its launch at AWS re:Invent 2019. Outside of work, he enjoys hiking and climbing.

Sean MorganSean Morgan is an AI/ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time, Sean is an active open-source contributor and maintainer, and is the special interest group lead for TensorFlow Add-ons.


Continue Reading


Diagnose model performance before deployment for Amazon Fraud Detector

With the growth in adoption of online applications and the rising number of internet users, digital fraud is on the rise year over year. Amazon Fraud Detector provides a fully managed service to help you better identify potentially fraudulent online activities using advanced machine learning (ML) techniques, and more than 20 years of fraud detection…




With the growth in adoption of online applications and the rising number of internet users, digital fraud is on the rise year over year. Amazon Fraud Detector provides a fully managed service to help you better identify potentially fraudulent online activities using advanced machine learning (ML) techniques, and more than 20 years of fraud detection expertise from Amazon.

To help you catch fraud faster across multiple use cases, Amazon Fraud Detector offers specific models with tailored algorithms, enrichments, and feature transformations. The model training is fully automated and hassle-free, and you can follow the instructions in the user guide or related blog posts to get started. However, with trained models, you need to decide whether the model is ready for deployment. This requires certain knowledge in ML, statistics, and fraud detection, and it may be helpful to know some typical approaches.

This post will help you to diagnose model performance and pick the right model for deployment. We walk through the metrics provided by Amazon Fraud Detector, help you diagnose potential issues, and provide suggestions to improve model performance. The approaches are applicable to both Online Fraud Insights (OFI) and Transaction Fraud Insights (TFI) model templates.

Solution overview

This post provides an end-to-end process to diagnose your model performance. It first introduces all the model metrics shown on the Amazon Fraud Detector console, including AUC, score distribution, confusion matrix, ROC curve, and model variable importance. Then we present a three-step approach to diagnose model performance using different metrics. Finally, we provide suggestions to improve model performance for typical issues.


Before diving deep into your Amazon Fraud Detector model, you need to complete the following prerequisites:

  1. Create an AWS account.
  2. Create an event dataset for model training.
  3. Upload your data to Amazon Simple Storage Service (Amazon S3) or ingest your event data into Amazon Fraud Detector.
  4. Build an Amazon Fraud Detector model.

Interpret model metrics

After model training is complete, Amazon Fraud Detector evaluates your model using part of the modeling data that wasn’t used in model training. It returns the evaluation metrics on the Model version page for that model. Those metrics reflect the model performance you can expect on real data after deploying to production.

The following screenshot shows example model performance returned by Amazon Fraud Detector. You can choose different thresholds on score distribution (left), and the confusion matrix (right) is updated accordingly.

You can use the following findings to check performance and decide on strategy rules:

  • AUC (area under the curve) – The overall performance of this model. A model with AUC of 0.50 is no better than a coin flip because it represents random chance, whereas a “perfect” model will have a score of 1.0. The higher AUC, the better your model can distinguish between frauds and legitimates.
  • Score distribution – A histogram of model score distributions assuming an example population of 100,000 events. Amazon Fraud Detector generates model scores between 0–1000, where the lower the score, the lower the fraud risk. Better separation between legitimate (green) and fraud (blue) populations typically indicates a better model. For more details, see Model scores.
  • Confusion matrix – A table that describes model performance for the selected given score threshold, including true positive, true negative, false positive, false negative, true positive rate (TPR), and false positive rate (FPR). The count on the table assumes an example population of 100,0000 events. For more details, see Model performance metrics.
  • ROC (Receiver Operator Characteristic) curve – A plot that illustrates the diagnostic ability of the model, as shown in the following screenshot. It plots the true positive rate as a function of false positive rate over all possible model score thresholds. View this chart by choosing Advanced Metrics. If you have trained multiple versions of one model, you can select different FPR thresholds to check the performance change.
  • Model variable importance – The rank of model variables based on their contribution to the generated model, as shown in the following screenshot. The model variable with the highest value is more important to the model than the other model variables in the dataset for that model version, and is listed at the top by default. For more details, see Model variable importance.

Diagnose model performance

Before deploying your model into production, you should use the metrics Amazon Fraud Detector returned to understand the model performance and diagnose the possible issues. The common problems of ML models can be divided into two main categories: data-related issues and model-related issues. Amazon Fraud Detector has taken care of the model-related issues by carefully using validation and testing sets to evaluate and tune your model on the backend. You can complete the following steps to validate if your model is ready for deployment or has possible data-related issues:

  1. Check overall model performance (AUC and score distribution).
  2. Review business requirements (confusion matrix and table).
  3. Check model variable importance.

Check overall model performance: AUC and score distribution

More accurate prediction of future events is always the primary goal of a predictive model. The AUC returned by Amazon Fraud Detector is calculated on a properly sampled test set not used in training. In general, a model with an AUC greater than 0.9 is considered to be a good model.

If you observe a model with performance less than 0.8, it usually means the model has room for improvement (we discuss common issues for low model performance later in this post). Note that the definition of “good” performance highly depends on your business and the baseline model. You can still follow the steps in this post to improve your Amazon Fraud Detector model even though its AUC is greater than 0.8.

On the other hand, if the AUC is over 0.99, it means the model can almost perfectly separate the fraud and legitimate events on the test set. This is sometimes a “too good to be true” scenario (we discuss common issues for very high model performance later in this post).

Besides the overall AUC, the score distribution can also tell you how well the model is fitted. Ideally, you should see the bulk of legitimate and fraud located on the two ends of the scale, which indicates the model score can accurately rank the events on the test set.

In the following example, the score distribution has an AUC of 0.96.

If the legitimate and fraud distribution overlapped or concentrated in the center, it probably means the model doesn’t perform well on distinguishing fraud events from legitimate events, which might indicate historical data distribution changed or that you need more data or features.

The following is an example of score distribution with an AUC of 0.64.

If you can find a split point that can almost perfectly split fraud and legitimate events, there is a high chance that the model has a label leakage issue or the fraud patterns are too easy to detect, which should catch your attention.

In the following example, the score distribution has an AUC of 1.0.

Review business requirements: Confusion matrix and table

Although AUC is a convenient indicator of model performance, it may not directly translate to your business requirement. Amazon Fraud Detector also provides metrics such as fraud capture rate (true positive rate), percentage of legitimate events that are incorrectly predicted as fraud (false positive rate), and more, which are more commonly used as business requirements. After you train a model with a reasonably good AUC, you need to compare the model with your business requirement with those metrics.

The confusion matrix and table provide you with an interface to review the impact and check if it meets your business needs. Note that the numbers depend on the model threshold, where events with scores larger than then threshold are classified as fraud and events with scores lower than the threshold are classified as legit. You can choose which threshold to use depending on your business requirements.

For example, if your goal is to capture 73% of frauds, then (as shown in the example below) you can choose a threshold such as 855, which allows you to capture 73% of all fraud. However, the model will also mis-classify 3% legitimate events to be fraudulent. If this FPR is acceptable for your business, then the model is good for deployment. Otherwise, you need to improve the model performance.

Another example is if the cost for blocking or challenging a legitimate customer is extremely high, then you want a low FPR and high precision. In that case, you can choose a threshold of 950, as shown in the following example, which will miss-classify 1% of legitimate customers as fraud, and 80% of identified fraud will actually be fraudulent.

In addition, you can choose multiple thresholds and assign different outcomes, such as block, investigate, pass. If you can’t find proper thresholds and rules that satisfy all your business requirements, you should consider training your model with more data and attributes.

Check model variable importance

The Model variable importance pane displays how each variable contributes to your model. If one variable has a significantly higher importance value than the others, it might indicate label leakage or that the fraud patterns are too easy to detect. Note that the variable importance is aggregated back to your input variables. If you observe slightly higher importance of IP_ADDRESS, CARD_BIN, EMAIL_ADDRESS, PHONE_NUMBER, BILLING_ZIP, or SHIPPING_ZIP, it might because of the power of enrichment.

The following example shows model variable importance with a potential label leakage using investigation_status.

Model variable importance also gives you hints of what additional variables could potentially bring lift to the model. For example, if you observe low AUC and seller-related features show high importance, you might consider collecting more order features such as SELLER_CATEGORY, SELLER_ADDRESS, and SELLER_ACTIVE_YEARS, and add those variables to your model.

Common issues for low model performance

In this section, we discuss common issues you may encounter regarding low model performance.

Historical data distribution changed

Historical data distribution drift happens when you have a big business change or a data collection issue. For example, if you recently launched your product in a new market, the IP_ADDRESS, EMAIL, and ADDRESS related features could be completely different, and the fraud modus operandi could also change. Amazon Fraud Detector uses EVENT_TIMESTAMP to split data and evaluate your model on the appropriate subset of events in your dataset. If your historical data distribution changes significantly, the evaluation set could be very different from the training data, and the reported model performance could be low.

You can check the potential data distribution change issue by exploring your historical data:

  1. Use the Amazon Fraud Detector Data Profiler tool to check if the fraud rate and the missing rate of the label changed over time.
  2. Check if the variable distribution over time changed significantly, especially for features with high variable importance.
  3. Check the variable distribution over time by target variables. If you observe significantly more fraud events from one category in recent data, you might want to check if the change is reasonable using your business judgments.

If you find the missing rate of the label is very high or the fraud rate consistently dropped during the most recent dates, it might be an indicator of labels not fully matured. You should exclude the most recent data or wait longer to collect the accurate labels, and then retrain your model.

If you observe a sharp spike of fraud rate and variables on specific dates, you might want to double-check if it is an outlier or data collection issue. In that case, you should delete those events and retrain the model.

If you find the outdated data can’t represent your current and future business, you should exclude the old period of data from training. If you’re using stored events in Amazon Fraud Detector, you can simply retrain a new version and select the proper date range while configuring the training job. That may also indicate that the fraud modus operandi in your business changes relatively quickly over time. After model deployment, you may need to re-train your model frequently.

Improper variable type mapping

Amazon Fraud Detector enriches and transforms the data based on the variable types. It’s important that you map your variables to the correct type so that Amazon Fraud Detector model can take the maximum value of your data. For example, if you map IP to the CATEGORICAL type instead of IP_ADDRESS, you don’t get the IP-related enrichments in the backend.

In general, Amazon Fraud Detector suggests the following actions:

  1. Map your variables to specific types, such as IP_ADDRESS, EMAIL_ADDRESS, CARD_BIN, and PHONE_NUMBER, so that Amazon Fraud Detector can extract and enrich additional information.
  2. If you can’t find the specific variable type, map it to one of the three generic types: NUMERIC, CATEGORICAL, or FREE_FORM_TEXT.
  3. If a variable is in text form and has high cardinality, such as a customer review or product description, you should map it to the FREE_FORM_TEXT variable type so that Amazon Fraud Detector extracts text features and embeddings on the backend for you. For example, if you map url_string to FREE_FORM_TEXT, it’s able to tokenize the URL and extract information to feed into the downstream model, which will help it learn more hidden patterns from the URL.

If you find any of your variable types are mapped incorrectly in variable configuration, you can change your variable type and then retrain the model.

Insufficient data or features

Amazon Fraud Detector requires at least 10,000 records to train an Online Fraud Insights (OFI) or Transaction Fraud Insights (TFI) model, with at least 400 of those records identified as fraudulent. TFI also requires that both fraudulent records and legitimate records come from at least 100 different entities each to ensure the diversity of the dataset. Additionally, Amazon Fraud Detector requires the modeling data to have at least two variables. Those are the minimum data requirements to build a useful Amazon Fraud Detector model. However, using more records and variables usually helps the ML models better learn the underlying patterns from your data. When you observe a low AUC or can’t find thresholds that meet your business requirement, you should consider retraining your model with more data or add new features to your model. Usually, we find EMAIL_ADDRESS, IP, PAYMENT_TYPE, BILLING_ADDRESS, SHIPPING_ADDRESS, and DEVICE related variables are important in fraud detection.

Another possible cause is that some of your variables contain too many missing values. To see if that is happening, check the model training messages and refer to Troubleshoot training data issues for suggestions.

Common issues for very high model performance

In this section, we discuss common issues related to very high model performance.

Label leakage

Label leakage occurs when the training datasets use information that would not be expected to be available at prediction time. It overestimates the model’s utility when run in a production environment.

High AUC (close to 1), perfectly separated score distribution, and significantly higher variable importance of one variable could be indicators of potential label leakage issues. You can also check the correlation between the features and the label using the Data Profiler. The Feature and label correlation plot shows the correlation between each feature and the label. If one feature has over 0.99 correlation with the label, you should check if the feature is used properly based on business judgments. For example, to build a risk model to approve or decline a loan application, you shouldn’t use the features like AMOUNT_PAID, because the payments happen after the underwriting process. If a variable isn’t available at the time you make prediction, you should remove that variable from model configuration and retrain a new model.

The following example shows the correlation between each variable and label. investigation_status has a high correlation (close to 1) with the label, so you should double-check if there is a label leakage issue.

Simple fraud patterns

When the fraud patterns in your data are simple, you might also observe very high model performance. For example, suppose all the fraud events in the modeling data come through the same Internal Service Provider; it’s straightforward for the model to pick the IP-related variables and return a “perfect” model with high importance of IP.

Simple fraud patterns don’t always indicate a data issue. It could be true that the fraud modus operandi in your business is easy to capture. However, before making a conclusion, you need to make sure the labels used in model training are accurate, and the modeling data covers as many fraud patterns as possible. For example, if you label your fraud events based on rules, such as labeling all applications from a specific BILLING_ZIP plus PRODUCT_CATEGORY as fraud, the model can easily catch those frauds by simulating the rules and achieving a high AUC.

You can check the label distribution across different categories or bins of each feature using the Data Profiler. For example, if you observe that most fraud events come from one or a few product categories, it might be an indicator of simple fraud patterns, and you need to confirm that it’s not a data collection or process mistake. If the feature is like CUSTOMER_ID, you should exclude the feature in model training.

The following example shows label distribution across different categories of product_category. All fraud comes from two product categories.

Improper data sampling

Improper data sampling may happen when you sampled and only sent part of your data to Amazon Fraud Detector. If the data isn’t sampled properly and isn’t representative of the traffic in production, the reported model performance will be inaccurate and the model could be useless for production prediction. For example, if all fraud events in the modeling data are sampled from Asia and all legit events are sampled from the US, the model might learn to separate fraud and legit based on BILLING_COUNTRY. In that case, the model is not generic to be applied to other populations.

Usually, we suggest sending all the latest events without sampling. Based on the data size and fraud rate, Amazon Fraud Detector does sampling before model training for you. If your data is too large (over 100 GB) and you decide to sample and send only a subset, you should randomly sample your data and make sure the sample is representative of the entire population. For TFI, you should sample your data by entity, which means if one entity is sampled, you should include all its history so that the entity level aggregates are calculated correctly. Note that if you only send a subset of data to Amazon Fraud Detector, the real-time aggregates during inference might be inaccurate if the previous events of the entities aren’t sent.

Another improper data sampling could be only using a short period of data, like one day’s data, to build the model. The data might be biased, especially if your business or fraud attacks have seasonality. We usually recommend including at least two cycles’ (such as 2 weeks or 2 months) worth of data in the modeling to ensure the diversity of fraud types.


After diagnosing and resolving all the potential issues, you should get a useful Amazon Fraud Detector model and be confident about its performance. For the next step, you can create a detector with the model and your business rules, and be ready to deploy it to production for a shadow mode evaluation.


How to exclude variables for model training

After the deep dive, you might identify a variable leak target information, and want to exclude it from model training. You can retrain a model version excluding the variables you don’t want by completing the following steps:

  1. On the Amazon Fraud Detector console, in the navigation pane, choose Models.
  2. On the Models page, choose the model you want to retrain.
  3. On the Actions menu, choose Train new version.
  4. Select the date range you want to use and choose Next.
  5. On the Configure training page, deselect the variable you don’t want to use in model training.
  6. Specify your fraud labels and legitimate labels and how you want Amazon Fraud Detector to use unlabeled events, then choose Next.
  7. Review the model configuration and choose Create and train model.

How to change event variable type

Variables represent data elements used in fraud prevention. In Amazon Fraud Detector, all variables are global and are shared across all events and models, which means one variable could be used in multiple events. For example, IP could be associated with sign-in events, and it could also be associated with transaction events. Naturally, Amazon Fraud Detector locked the variable type and data type once a variable is created. To delete an existing variable, you need to first delete all associated event types and models. You can check the resources associated with the specific variable by navigating to Amazon Fraud Detector, choosing Variables in the navigation pane, and choosing the variable name and Associated resources.

Delete the variable and all associated event types

To delete the variable, complete the following steps:

  1. On the Amazon Fraud Detector console, in the navigation pane, choose Variables.
  2. Choose the variable you want to delete.
  3. Choose Associated resources to view a list of all the event types used this variable.
    You need to delete those associated event types before deleting the variable.
  4. Choose the event types in the list to go to the associated event type page.
  5. Choose Stored events to check if any data is stored under this event type.
  6. If there are events stored in Amazon Fraud Detector, choose Delete stored events to delete the stored events.
    When the delete job is complete, the message “The stored events for this event type were successfully deleted” appears.
  7. Choose Associated resources.
    If detectors and models are associated with this event type, you need to delete those resources first.
  8. If detectors are associated, complete the following steps to delete all associated detectors:
    1. Choose the detector to go to the Detector details page.
    2. In the Model versions pane, choose the detector’s version.
    3. On the detector version page, choose Actions.
    4. If the detector version is active, choose Deactivate, choose Deactivate this detector version without replacing it with a different version, and choose Deactivate detector version.
    5. After the detector version is deactivated, choose Actions and then Delete.
    6. Repeat these steps to delete all detector versions.
    7. On the Detector details page, choose Associated rules.
    8. Choose the rule to delete.
    9. Choose Actions and Delete rule version.
    10. Enter the rule name to confirm and choose Delete version.
    11. Repeat these steps to delete all associated rules.
    12. After all detector versions and associated rules are deleted, go to the Detector details page, choose Actions, and choose Delete detector.
    13. Enter the detector’s name and choose Delete detector.
    14. Repeat these steps to delete the next detector.
  9. If any models are associated with the event type, complete the following steps to delete them:
    1. Choose the name of the model.
    2. In the Model versions pane, choose the version.
    3. If the model status is Active, choose Actions and Undeploy model version.
    4. Enter undeploy to confirm and choose Undeploy model version.
      The status changes to Undeploying. The process takes a few minutes to complete.
    5. After the status becomes Ready to deploy, choose Actions and Delete.
    6. Repeat these steps to delete all model versions.
    7. On the Model details page, choose Actions and Delete model.
    8. Enter the name of the model and choose Delete model.
    9. Repeat these steps to delete the next model.
  10. After all associated detectors and models are deleted, choose Actions and Delete event type on the Event details page.
  11. Enter the name of the event type and choose Delete event type.
  12. In the navigation pane, choose Variables, and choose the variable you want to delete.
  13. Repeat the earlier steps to delete all event types associated with the variable.
  14. On the Variable details page, choose Actions and Delete.
  15. Enter the name of the variable and choose Delete variable.

Create a new variable with the correct variable type

After you have deleted the variable and all associated event types, stored events, models, and detectors from Amazon Fraud Detector, you can create a new variable of the same name and map it to the correct variable type.

  1. On the Amazon Fraud Detector console, in the navigation pane, choose Variables.
  2. Choose Create.
  3. Enter the variable name you want to modify (the one you deleted earlier).
  4. Select the correct variable type you want to change to.
  5. Choose Create variable.

Upload data and retrain the model

After you update the variable type, you can upload the data again and train a new model. For instructions, refer to Detect online transaction fraud with new Amazon Fraud Detector features.

How to add new variables to an existing event type

To add new variables to the existing event type, complete the following steps:

  1. Add the new variables to the previous training CVS file.
  2. Upload the new training data file to an S3 bucket. Note the Amazon S3 location of your training file (for example, s3://bucketname/path/to/some/object.csv) and your role name.
  3. On the Amazon Fraud Detector console, in the navigation pane, choose Events.
  4. On the Event types page, choose the name of the event type you want to add variables.
  5. On the Event type details page, choose Actions, then Add variables.
  6. Under Choose how to define this event’s variables, choose Select variables from a training dataset.
  7. For IAM role, select an existing IAM role or create a new role to access data in Amazon S3.
  8. For Data location, enter the S3 location of the new training file and choose Upload.
    The new variables not present in the existing event type should show up in the list.
  9. Choose Add variables.

Now, the new variables have been added to the existing event type. If you’re using stored events in Amazon Fraud Detector, the new variables of the stored events are still missing. You need to import the training data with the new variables to Amazon Fraud Detector and then retrain a new model version. When uploading the new training data with the same EVENT_ID and EVENT_TIMESTAMP, the new event variables overwrite the previous event variables stored in Amazon Fraud Detector.

About the Authors

Julia Xu is a Research Scientist with Amazon Fraud Detector. She is passionate about solving customer challenges using Machine Learning techniques. In her free time, she enjoys hiking, painting, and exploring new coffee shops.

Hao Zhou is a Research Scientist with Amazon Fraud Detector. He holds a PhD in electrical engineering from Northwestern University, USA. He is passionate about applying machine learning techniques to combat fraud and abuse.

Abhishek Ravi is a Senior Product Manager with Amazon Fraud Detector. He is passionate about leveraging technical capabilities to build products that delight customers.


Continue Reading


Hyperparameter optimization for fine-tuning pre-trained transformer models from Hugging Face

Large attention-based transformer models have obtained massive gains on natural language processing (NLP). However, training these gigantic networks from scratch requires a tremendous amount of data and compute. For smaller NLP datasets, a simple yet effective strategy is to use a pre-trained transformer, usually trained in an unsupervised fashion on very large datasets, and fine-tune…




Large attention-based transformer models have obtained massive gains on natural language processing (NLP). However, training these gigantic networks from scratch requires a tremendous amount of data and compute. For smaller NLP datasets, a simple yet effective strategy is to use a pre-trained transformer, usually trained in an unsupervised fashion on very large datasets, and fine-tune it on the dataset of interest. Hugging Face maintains a large model zoo of these pre-trained transformers and makes them easily accessible even for novice users.

However, fine-tuning these models still requires expert knowledge, because they’re quite sensitive to their hyperparameters, such as learning rate or batch size. In this post, we show how to optimize these hyperparameters with the open-source framework Syne Tune for distributed hyperparameter optimization (HPO). Syne Tune allows us to find a better hyperparameter configuration that achieves a relative improvement between 1-4% compared to default hyperparameters on popular GLUE benchmark datasets. The choice of the pre-trained model itself can also be considered a hyperparameter and therefore be automatically selected by Syne Tune. On a text classification problem, this leads to an additional boost in accuracy of approximately 5% compared to the default model. However, we can automate more decisions a user needs to make; we demonstrate this by also exposing the type of instance as a hyperparameter that we later use to deploy the model. By selecting the right instance type, we can find configurations that optimally trade off cost and latency.

For an introduction to Syne Tune please refer to Run distributed hyperparameter and neural architecture tuning jobs with Syne Tune.

Hyperparameter optimization with Syne Tune

We will use the GLUE benchmark suite, which consists of nine datasets for natural language understanding tasks, such as textual entailment recognition or sentiment analysis. For that, we adapt Hugging Face’s training script. GLUE datasets come with a predefined training and evaluation set with labels as well as a hold-out test set without labels. Therefore, we split the training set into a training and validation sets (70%/30% split) and use the evaluation set as our holdout test dataset. Furthermore, we add another callback function to Hugging Face’s Trainer API that reports the validation performance after each epoch back to Syne Tune. See the following code:

import transformers from import Reporter class SyneTuneReporter(transformers.trainer_callback.TrainerCallback): def __init__(self): = Reporter() def on_evaluate(self, args, state, control, **kwargs): results = kwargs[‘metrics’].copy() results[‘step’] = state.global_step results[‘epoch’] = int(state.epoch)**results)

We start with optimizing typical training hyperparameters: the learning rate, warmup ratio to increase the learning rate, and the batch size for fine-tuning a pretrained BERT (bert-base-cased) model, which is the default model in the Hugging Face example. See the following code:

config_space = dict() config_space[‘learning_rate’] = loguniform(1e-6, 1e-4) config_space[‘per_device_train_batch_size’] = randint(16, 48) config_space[‘warmup_ratio’] = uniform(0, 0.5)

As our HPO method, we use ASHA, which samples hyperparameter configurations uniformly at random and iteratively stops the evaluation of poorly performing configurations. Although more sophisticated methods utilize a probabilistic model of the objective function, such as BO or MoBster exists, we use ASHA for this post because it comes without any assumptions on the search space.

In the following figure, we compare the relative improvement in test error over Hugging Faces’ default hyperparameter configuration.

For simplicity, we limit the comparison to MRPC, COLA, and STSB, but we also observe similar improvements also for other GLUE datasets. For each dataset, we run ASHA on a single ml.g4dn.xlarge Amazon SageMaker instance with a runtime budget of 1,800 seconds, which corresponds to approximately 13, 7, and 9 full function evaluations on these datasets, respectively. To account for the intrinsic randomness of the training process, for example caused by the mini-batch sampling, we run both ASHA and the default configuration for five repetitions with an independent seed for the random number generator and report the average and standard deviation of the relative improvement across the repetitions. We can see that, across all datasets, we can in fact improve predictive performance by 1-3% relative to the performance of the carefully selected default configuration.

Automate selecting the pre-trained model

We can use HPO to not only find hyperparameters, but also automatically select the right pre-trained model. Why do we want to do this? Because no a single model outperforms across all datasets, we have to select the right model for a specific dataset. To demonstrate this, we evaluate a range of popular transformer models from Hugging Face. For each dataset, we rank each model by its test performance. The ranking across datasets (see the following Figure) changes and not one single model that scores the highest on every dataset. As reference we also show the absolute test performance of each model and dataset in the following figure.

To automatically select the right model, we can cast the choice of the model as categorical parameters and add this to our hyperparameter search space:

config_space[‘model_name_or_path’] = choice([‘bert-base-cased’, ‘bert-base-uncased’, ‘distilbert-base-uncased’, ‘distilbert-base-cased’, ‘roberta-base’, ‘albert-base-v2’, ‘distilroberta-base’, ‘xlnet-base-cased’, ‘albert-base-v1’])

Although the search space is now larger, that doesn’t necessarily mean that it’s harder to optimize. The following figure shows the test error of the best observed configuration (based on the validation error) on the MRPC dataset of ASHA over time when we search in the original space (blue line) (with a BERT-base-cased pre-trained model) or in the new augmented search space (orange line). Given the same budget, ASHA is able to find a much better performing hyperparameter configuration in the extended search space than in the smaller space.

Automate selecting the instance type

In practice, we might not just care about optimizing predictive performance. We might also care about other objectives, such as training time, (dollar) cost, latency, or fairness metrics. We also need to make other choices beyond the hyperparameters of the model, for example selecting the instance type.

Although the instance type doesn’t influence predictive performance, it strongly impacts the (dollar) cost, training runtime, and latency. The latter becomes particularly important when the model is deployed. We can phrase HPO as a multi-objective optimization problem, where we aim to optimize multiple objectives simultaneously. However, no single solution optimizes all metrics at the same time. Instead, we aim to find a set of configurations that optimally trade off one objective vs. the other. This is called the Pareto set.

To analyze this setting further, we add the choice of the instance type as an additional categorical hyperparameter to our search space:

config_space[‘st_instance_type’] = choice([‘ml.g4dn.xlarge’, ‘ml.g4dn.2xlarge’, ‘ml.p2.xlarge’, ‘ml.g4dn.4xlarge’, ‘ml.g4dn.8xlarge’, ‘ml.p3.2xlarge’])

We use MO-ASHA, which adapts ASHA to the multi-objective scenario by using non-dominated sorting. In each iteration, MO-ASHA also selects for each configuration also the type of instance we want to evaluate it on. To run HPO on a heterogeneous set of instances, Syne Tune provides the SageMaker backend. With this backend, each trial is evaluated as an independent SageMaker training job on its own instance. The number of workers defines how many SageMaker jobs we run in parallel at a given time. The optimizer itself, MO-ASHA in our case, runs either on the local machine, a Sagemaker notebook or on a separate SageMaker training job. See the following code:

backend = SageMakerBackend( sm_estimator=HuggingFace( entry_point=str(‘’), source_dir=os.getcwd(), base_job_name=’glue-moasha’, # instance-type given here are override by Syne Tune with values sampled from `st_instance_type`. instance_type=’ml.m5.large’, instance_count=1, py_version=”py38″, pytorch_version=’1.9′, transformers_version=’4.12′, max_run=3600, role=get_execution_role(), ), )

The following figures show the latency vs test error on the left and latency vs cost on the right for random configurations sampled by MO-ASHA (we limit the axis for visibility) on the MRPC dataset after running it for 10,800 seconds on four workers. Color indicates the instance type. The dashed black line represents the Pareto set, meaning the set of points that dominate all other points in at least one objective.

We can observe a trade-off between latency and test error, meaning the best configuration with the lowest test error doesn’t achieve the lowest latency. Based on your preference, you can select a hyperparameter configuration that sacrifices on test performance but comes with a smaller latency. We also see the trade off between latency and cost. By using a smaller ml.g4dn.xlarge instance, for example, we only marginally increase latency, but pay a fourth of the cost of an ml.g4dn.8xlarge instance.


In this post, we discussed hyperparameter optimization for fine-tuning pre-trained transformer models from Hugging Face based on Syne Tune. We saw that by optimizing hyperparameters such as learning rate, batch size, and the warm-up ratio, we can improve upon the carefully chosen default configuration. We can also extend this by automatically selecting the pre-trained model via hyperparameter optimization.

With the help of Syne Tune’s SageMaker backend, we can treat the instance type as an hyperparameter. Although the instance type doesn’t affect performance, it has a significant impact on the latency and cost. Therefore, by casting HPO as a multi-objective optimization problem, we’re able to find a set of configurations that optimally trade off one objective vs. the other. If you want to try this out yourself, check out our example notebook.

About the Authors

Aaron Klein is an Applied Scientist at AWS.

Matthias Seeger is a Principal Applied Scientist at AWS.

David Salinas is a Sr Applied Scientist at AWS.

Emily Webber joined AWS just after SageMaker launched, and has been trying to tell the world about it ever since! Outside of building new ML experiences for customers, Emily enjoys meditating and studying Tibetan Buddhism.

Cedric Archambeau is a Principal Applied Scientist at AWS and Fellow of the European Lab for Learning and Intelligent Systems.


Continue Reading


Copyright © 2021 Today's Digital.