Connect with us


Detect mitotic figures in whole slide images with Amazon Rekognition

Even after more than a hundred years after its introduction, histology remains the gold standard in tumor diagnosis and prognosis. Anatomic pathologists evaluate histology to stratify cancer patients into different groups depending on their tumor genotypes and phenotypes, and their clinical outcome [1,2]. However, human evaluation of histological slides is subjective and not repeatable [3].…



[]Even after more than a hundred years after its introduction, histology remains the gold standard in tumor diagnosis and prognosis. Anatomic pathologists evaluate histology to stratify cancer patients into different groups depending on their tumor genotypes and phenotypes, and their clinical outcome [1,2]. However, human evaluation of histological slides is subjective and not repeatable [3]. Furthermore, histological assessment is a time-consuming process that requires highly trained professionals.

[]With significant technological advances in the last decade, techniques such as whole slide imaging (WSI) and deep learning (DL) are now widely available. WSI is the scanning of conventional microscopy glass slides to produce a single, high-resolution image from those slides. This allows for the digitization and collection of large sets of pathology images, which would have been prohibitively time-consuming and expensive. The availability of such datasets creates new and innovative ways of accelerating diagnosis by using techniques such as machine learning (ML) to aid pathologists to accelerate diagnoses by quickly identifying features of interest.

[]In this post, we will explore how developers without previous ML experience can use Amazon Rekognition Custom Labels to train a model that classifies cellular features. Amazon Rekognition Custom Labels is a feature of Amazon Rekognition that enables you to build your own specialized ML-based image analysis capabilities to detect unique objects and scenes integral to your specific use case. In particular, we use a dataset containing whole slide images of canine mammary carcinoma [1] to demonstrate how to process these images and train a model that detects mitotic figures. This dataset has been used with permission from Prof. Dr. Marc Aubreville, who has kindly agreed to allow us to use it for this post. For more information, see the Acknowledgments section at the end of this post.

Solution Overview

[]The solution consists of two components:

  • An Amazon Rekognition Custom Labels model — To enable Amazon Rekognition to detect mitotic figures, we complete the following steps:
    • Sample the WSI dataset to produce adequately sized images using Amazon SageMaker Studio and a Python code running on a Jupyter notebook. Studio is a web-based, integrated development environment (IDE) for ML that provides all the tools you need to take your models from experimentation to production while boosting your productivity. We will use Studio to split the images into smaller ones to train our model.
    • Train a Amazon Rekognition Custom Labels model to recognize mitotic figures in hematoxylin-eosin samples using the data prepared in the previous step.
  • A frontend application — To demonstrate how to use a model like the one we trained in the previous step, we complete the following steps:

[]The following diagram illustrates the solution architecture.


[]All the necessary resources to deploy the implementation discussed in this post and the code for the whole section are available on GitHub. You can clone or fork the repository, make any changes you desire, and run it yourself.

[]In the next steps, we walk through the code to understand the different steps involved in obtaining and preparing the data, training the model, and using it from a sample application.


[]When running the steps in this walkthrough, you incur small costs from using the following AWS services:

  • Amazon Rekognition
  • AWS Fargate
  • Application Load Balancer
  • AWS Secrets Manager

[]Additionally, if no longer within the Free Tier period or conditions, you may incur costs from the following services:

  • CodePipeline
  • CodeBuild
  • Amazon ECR
  • Amazon SageMaker

[]If you complete the cleanup steps correctly after finishing this walkthrough, you may expect costs to be less than 10 USD, if the Amazon Rekognition Custom Labels model and the web application run for one hour or less.


[]To complete all steps, you need the following:

Training the mitotic figure classification model

[]We run all the steps required to train the model from a Studio notebook. If you have never used Studio before, you may need to onboard first. For more information, see Onboard Quickly to Amazon SageMaker Studio.

[]Some of the following steps require more RAM than what is available in a standard ml.t3.medium notebook. Make sure that you have selected an ml.m5.large notebook. You should see a 2 vCPU + 8 GiB indication on the upper right corner of the page.


[]The code for this section is available as a Jupyter notebook file.

[]After onboarding to Studio, follow these instructions to grant Studio the necessary permissions to call Amazon Rekognition on your behalf.


[]To begin with, we need to complete the following steps:

  1. Update Linux packages and install the required dependencies, such as OpenSlide: !apt update > /dev/null && apt dist-upgrade -y > /dev/null !apt install -y build-essential openslide-tools python-openslide libgl1-mesa-glx > /dev/null
  2. Install the fastai and SlideRunner libraries using pip: !pip install SlideRunner SlideRunner_dataAccess fastai==1.0.61 > /dev/null
  3. Download the dataset (we provide a script to do this automatically): from dataset import download_dataset download_dataset()

Process the dataset

[]We will begin by importing some of the packages that we use throughout the data preparation stage. Then, we download and load the annotation database for this dataset. This database contains the positions in the whole slide images of the mitotic figures (the features we want to classify). See the following code:

%reload_ext autoreload %autoreload 2 import os from typing import List import urllib import numpy as np from SlideRunner.dataAccess.database import Database from pathlib import Path DATABASE_URL = ‘’ DATABASE_FILENAME = ‘MITOS_WSI_CMC_MEL.sqlite’ Path(“./databases”).mkdir(parents=True, exist_ok=True) local_filename, headers = urllib.request.urlretrieve( DATABASE_URL, filename=os.path.join(‘databases’, DATABASE_FILENAME), ) []Because we’re using SageMaker, we create a new SageMaker session object to ease tasks such as uploading our dataset to an Amazon Simple Storage Service (Amazon S3) bucket. We also use the S3 bucket that SageMaker creates by default to upload our processed image files.

[]The slidelist_test array contains the IDs of the slides that we use as part of the test dataset to evaluate the performance of the trained model. See the following code:

import sagemaker sm_session = sagemaker.Session() size=512 bucket_name = sm_session.default_bucket() database = Database()‘databases’, DATABASE_FILENAME)) slidelist_test = [’14’,’18’,’3′,’22’,’10’,’15’,’21’] []The next step is to obtain a set of areas of training and test slides, along with the labels in them, from which we can take smaller areas to use to train our model. The code for get_slides is in the file in GitHub.

from sampling import get_slides image_size = 512 lbl_bbox, training_slides, test_slides, files = get_slides(database, slidelist_test, negative_class=1, size=image_size) []We want to randomly sample from the training and test slides. We use the lists of training and test slides and randomly select n_training_images times a file for training, and n_test_images times a file for test:

n_training_images = 500 n_test_images = int(0.2 * n_training_images) training_files = list([ (y, files[y]) for y in np.random.choice( [x for x in training_slides], n_training_images) ]) test_files = list([ (y, files[y]) for y in np.random.choice( [x for x in test_slides], n_test_images) ]) []Next, we create a directory for training images and one for test images:

Path(“rek_slides/training”).mkdir(parents=True, exist_ok=True) Path(“rek_slides/test”).mkdir(parents=True, exist_ok=True) []Before we produce the smaller images needed to train the model, we need some helper code that produces the metadata needed to describe the training and test data. The following code makes sure that a given bounding box surrounding the features of interest (mitotic figures) are well within the zone we’re cutting, and produces a line of JSON that describes the image and the features in it in Amazon SageMaker Ground Truth format, which is the format Amazon Rekognition Custom Labels requires. For more information about this manifest file for object detection, see Object localization in manifest files.

def check_bbox(x_start: int, y_start: int, bbox) -> bool: return (bbox._left > x_start and bbox._right < x_start + image_size and bbox._top > y_start and bbox._bottom < y_start + image_size) def get_annotation_json_line(filename, channel, annotations, labels): objects = list([{'confidence' : 1} for i in range(0, len(annotations))]) return json.dumps({ 'source-ref': f's3://{bucket_name}/data/{channel}/{filename}', 'bounding-box': { 'image_size': [{ 'width': size, 'height': size, 'depth': 3 }], 'annotations': annotations, }, 'bounding-box-metadata': { 'objects': objects, 'class-map': dict({ x: str(x) for x in labels }), 'type': 'groundtruth/object-detection', 'human-annotated': 'yes', 'creation-date':, 'job-name': 'rek-pathology', } }) def generate_annotations(x_start: int, y_start: int, bboxes, labels, filename: str, channel: str): annotations = [] for bbox in bboxes: if check_bbox(x_start, y_start, bbox): # Get coordinates relative to this slide. x0 = bbox.left - x_start y0 = - y_start annotation = { 'class_id': 1, 'top': y0, 'left': x0, 'width': bbox.right - bbox.left, 'height': bbox.bottom - } annotations.append(annotation) return get_annotation_json_line(filename, channel, annotations, labels) []With the generate_annotations function in place, we can write the code to produce the training and test images:

import datetime import json import random from fastai import * from import * from tqdm.notebook import tqdm # Margin size, in pixels, for training images. This is the space we leave on # each side for the bounding box(es) to be well into the image. margin_size = 64 training_annotations = [] test_annotations = [] def check_bbox(x_start: int, y_start: int, bbox) -> bool: return (bbox._left > x_start and bbox._right < x_start + image_size and bbox._top > y_start and bbox._bottom < y_start + image_size) def generate_images(file_list) -> None: for f_idx in tqdm(range(0, len(file_list)), desc=’Writing training images…’): slide_idx, f = file_list[f_idx] bboxes = lbl_bbox[slide_idx][0] labels = lbl_bbox[slide_idx][1] # Calculate the minimum and maximum horizontal and vertical positions # that bounding boxes should have within the image. x_min = min(map(lambda x: x.left, bboxes)) – margin_size y_min = min(map(lambda x:, bboxes)) – margin_size x_max = max(map(lambda x: x.right, bboxes)) + margin_size y_max = max(map(lambda x: x.bottom, bboxes)) + margin_size result = False while not result: x_start = random.randint(x_min, x_max – image_size) y_start = random.randint(y_min, y_max – image_size) for bbox in bboxes: if check_bbox(x_start, y_start, bbox): result = True break filename = f’slide_{f_idx}.png’ channel = ‘test’ if slide_idx in test_slides else ‘training’ annotation = generate_annotations(x_start, y_start, bboxes, labels, filename, channel) if channel == ‘training’: training_annotations.append(annotation) else: test_annotations.append(annotation) img = Image(pil2tensor(f.get_patch(x_start, y_start) / 255., np.float32))’rek_slides/{channel}/{filename}’) generate_images(training_files) generate_images(test_files) []The last step towards having all of the required data is to write a manifest.json file for each of the datasets:

with open(‘rek_slides/training/manifest.json’, ‘w’) as mf: mf.write(“n”.join(training_annotations)) with open(‘rek_slides/test/manifest.json’, ‘w’) as mf: mf.write(“n”.join(test_annotations))

Transfer the files to S3

[]We use the upload_data method that the SageMaker session object exposes to upload the images and manifest files to the default SageMaker S3 bucket:

import sagemaker sm_session = sagemaker.Session() data_location = sm_session.upload_data( ‘./rek_slides’, bucket=bucket_name, )

Train an Amazon Rekognition Custom Labels model

[]With the data already in Amazon S3, we can get to training a custom model. We use the Boto3 library to create an Amazon Rekognition client and create a project:

import boto3 project_name = ‘rek-mitotic-figures-workshop’ rek = boto3.client(‘rekognition’) response = rek.create_project(ProjectName=project_name) # If you have already created the project, use the describe_projects call to # retrieve the project ARN. # response = rek.describe_projects()[‘ProjectDescriptions’][0] project_arn = response[‘ProjectArn’] []With the project ready to use, now you need a project version that points to the training and test datasets in Amazon S3. Each version ideally points to different datasets (or different versions of it). This enables us to have different versions of a model, compare their performance, and switch between them as needed. See the following code:

version_name = ‘1’ output_config = { ‘S3Bucket’: bucket_name, ‘S3KeyPrefix’: ‘output’, } training_dataset = { ‘Assets’: [ { ‘GroundTruthManifest’: { ‘S3Object’: { ‘Bucket’: bucket_name, ‘Name’: ‘data/training/manifest.json’ } }, }, ] } testing_dataset = { ‘Assets’: [ { ‘GroundTruthManifest’: { ‘S3Object’: { ‘Bucket’: bucket_name, ‘Name’: ‘data/test/manifest.json’ } }, }, ] } def describe_project_versions(): describe_response = rek.describe_project_versions( ProjectArn=project_arn, VersionNames=[version_name], ) for model in describe_response[‘ProjectVersionDescriptions’]: print(f”Status: {model[‘Status’]}”) print(f”Message: {model[‘StatusMessage’]}”) return describe_response response = rek.create_project_version( VersionName=version_name, ProjectArn=project_arn, OutputConfig=output_config, TrainingData=training_dataset, TestingData=testing_dataset, ) waiter = rek.get_waiter(‘project_version_training_completed’) waiter.wait( ProjectArn=project_arn, VersionNames=[version_name], ) describe_response = describe_project_versions() []After we create the project version, Amazon Rekognition automatically starts the training process. The training time depends on several features, such as the size of the images and the number of them, the number of classes, and so on. In this case, for 500 images, the training takes about 90 minutes to finish.

Test the model

[]After training, every model in Amazon Rekognition Custom Labels is in the STOPPED state. To use it for inference, you need to start it. We get the project version ARN from the project version description and pass it over to the start_project_version. Notice the MinInferenceUnits parameter — we start with one inference unit. The actual maximum number of transactions per second (TPS) that this inference unit supports depends on the complexity of your model. To learn more about TPS, refer to this blog post.

model_arn = describe_response[‘ProjectVersionDescriptions’][0][‘ProjectVersionArn’] response = rek.start_project_version( ProjectVersionArn=model_arn, MinInferenceUnits=1, ) waiter = rek.get_waiter(‘project_version_running’) waiter.wait( ProjectArn=project_arn, VersionNames=[version_name], ) []When your project version is listed as RUNNING, you can start sending images to Amazon Rekognition for inference.


[]We use one of the files in the test dataset to test the newly started model. You can use any suitable PNG or JPEG file instead.

from matplotlib import pyplot as plt from PIL import Image, ImageDraw # We’ll use one of our test images to try out our model. with open(‘./rek_slides/test/slide_0.png’, ‘rb’) as image_file: # Send the image data to the model. response = rek.detect_custom_labels( ProjectVersionArn=model_arn, Image={ ‘Bytes’: image_bytes } ) img = draw = ImageDraw.Draw(img) for custom_label in response[‘CustomLabels’]: geometry = custom_label[‘Geometry’][‘BoundingBox’] w = geometry[‘Width’] * img.width h = geometry[‘Height’] * img.height l = geometry[‘Left’] * img.width t = geometry[‘Top’] * img.height draw.rectangle([l, t, l + w, t + h], outline=(0, 0, 255, 255), width=5) plt.imshow(np.asarray(img))

Streamlit application

[]To demonstrate the integration with Amazon Rekognition, we use a very simple Python application. We use the Streamlit library to build a spartan user interface, where we prompt the user to upload an image file.

[]We use the Boto3 library and the detect_custom_labels method, together with the project version ARN, to invoke the inference endpoint. The response is a JSON document that contains the positions and classes of the different objects detected in the image. In our case, these are the mitotic figures that the algorithm has found in the image we sent to the endpoint. See the following code:

import os import boto3 import io import streamlit as st from PIL import Image, ImageDraw rek_client = boto3.client(‘rekognition’) uploaded_file = st.file_uploader(‘Image file’) if uploaded_file is not None: image_bytes = result = rek_client.detect_custom_labels( ProjectVersionArn=’‘, Image={ ‘Bytes’: image_bytes } ) img = draw = ImageDraw.Draw(img) st.write(result[‘CustomLabels’]) for custom_label in result[‘CustomLabels’]: st.write(f”Label {custom_label[‘Name’]}, confidence {custom_label[‘Confidence’]}”) geometry = custom_label[‘Geometry’][‘BoundingBox’] w = geometry[‘Width’] * img.width h = geometry[‘Height’] * img.height l = geometry[‘Left’] * img.width t = geometry[‘Top’] * img.height st.write(f”Left, top = ({l}, {t}), width, height = ({w}, {h})”) draw.rectangle([l, t, l + w, t + h], outline=(0, 0, 255, 255), width=5) st_img = st.image(img)

Deploy the application to AWS

[]To deploy the application, we use an AWS CDK script. The whole project can be found on GitHub . Let’s look at the different resources deployed by the script.

Create an Amazon ECR repository

[]As the first step towards setting up our deployment, we create an Amazon ECR repository, where we can store our application container images:

aws ecr create-repository –repository-name rek-wsi

Create and store your GitHub token in AWS Secrets Manager

[]CodePipeline needs a GitHub Personal Access Token to monitor your GitHub repository for changes and pull code. To create the token, follow the instructions in the GitHub documentation. The token requires the following GitHub scopes:

  • The repo scope, which is used for full control to read and pull artifacts from public and private repositories into a pipeline.
  • The admin:repo_hook scope, which is used for full control of repository hooks.

[]After creating the token, store it in a new secret in AWS Secrets Manager as follows:

aws secretsmanager create-secret –name rek-wsi/github –secret-string “{“oauthToken”:”YOUR-TOKEN-VALUE-HERE”}”

Write configuration parameters to AWS Systems Manager Parameter Store

[]The AWS CDK script reads some configuration parameters from AWS Systems Manager Parameter Store, such as the name and owner of the GitHub repository, and target account and Region. Before launching the AWS CDK script, you need to create these parameters in your own account.

[]You can do that by using the AWS CLI. Simply invoke the put-parameter command with a name, a value, and the type of the parameter:

aws ssm put-parameter –name –value –type []The following is a list of all parameters required by the AWS CDK script. All of them are of type String:

  • /rek_wsi/prod/accountId — The ID of the account where we deploy the application.
  • /rek_wsi/prod/ecr_repo_name — The name of the Amazon ECR repository where the container images are stored.
  • /rek_wsi/prod/github/branch — The branch in the GitHub repository from which CodePipeline needs to pull the code.
  • /rek_wsi/prod/github/owner — The owner of the GitHub repository.
  • /rek_wsi/prod/github/repo — The name of the GitHub repository where our code is stored.
  • /rek_wsi/prod/github/token — The name or ARN of the secret in Secrets Manager that contains your GitHub authentication token. This is necessary for CodePipeline to be able to communicate with GitHub.
  • /rek_wsi/prod/region — The region where we will deploy the application.

[]Notice the prod segment in all parameter names. Although we do not need this level of detail for such a simple example, it will enable to reuse this approach with other projects where different environments may be necessary.

Resources created by the AWS CDK script

[]We need our application, running in a Fargate task, to have permissions to invoke Amazon Rekognition. So we first create an AWS Identity and Access Management (IAM) Task Role with the RekognitionReadOnlyPolicy policy attached to it. Notice that the assumed_by parameter in the following code takes the service principal. This is because we’re using Amazon ECS as the orchestrator, so we need Amazon ECS to assume the role and pass the credentials to the Fargate task.

streamlit_task_role = iam.Role( self, ‘StreamlitTaskRole’, assumed_by=iam.ServicePrincipal(‘’), description=’ECS Task Role assumed by the Streamlit task deployed to ECS+Fargate’, managed_policies=[ iam.ManagedPolicy.from_managed_policy_arn( self, ‘RekognitionReadOnlyPolicy’, managed_policy_arn=’arn:aws:iam::aws:policy/AmazonRekognitionReadOnlyAccess’ ), ], ) []Once built, our application container image sits in a private Amazon ECR repository. We need an object that describes it that we can pass when creating the Fargate service:

ecs_container_image = ecs.ContainerImage.from_ecr_repository( repository=ecr.Repository.from_repository_name(self, ‘ECRRepo’, ‘rek-wsi’), tag=’latest’ ) []We create a new VPC and cluster for this application. You can modify this part to use your own VPC by using the from_lookup method of the Vpc class:

vpc = ec2.Vpc(self, ‘RekWSI’, max_azs=3) cluster = ecs.Cluster(self, ‘RekWSICluster’, vpc=vpc) []Now that we have a VPC and cluster to deploy to, we create the Fargate service. We use 0.25 vCPU and 512 MB RAM for this task, and we place a public Application Load Balancer (ALB) in front of it. Once deployed, we use the ALB CNAME to access the application. See the following code:

fargate_service = ecs_patterns.ApplicationLoadBalancedFargateService( self, ‘RekWSIECSApp’, cluster=cluster, cpu=256, memory_limit_mib=512, desired_count=1, task_image_options=ecs_patterns.ApplicationLoadBalancedTaskImageOptions( image=ecs_container_image, container_port=8501, task_role=streamlit_task_role, ), public_load_balancer=True, ) []To automatically build and deploy a new container image every time we push code to our main branch, we create a simple pipeline consisting of a GitHub source action and a build step. Here is where we use the secrets we stored in AWS Secrets Manager and AWS Systems Manager Parameter Store in the previous steps.

pipeline = codepipeline.Pipeline(self, ‘RekWSIPipeline’) # Create an artifact that points at the code pulled from GitHub. source_output = codepipeline.Artifact() # Create a source stage that pulls the code from GitHub. The repo parameters are # stored in SSM, and the OAuth token in Secrets Manager. source_action = codepipeline_actions.GitHubSourceAction( action_name=’GitHub’, output=source_output, oauth_token=SecretValue.secrets_manager( ssm.StringParameter.value_from_lookup(self, ‘/rek_wsi/prod/github/token’), json_field=’oauthToken’), trigger=codepipeline_actions.GitHubTrigger.WEBHOOK, owner=ssm.StringParameter.value_from_lookup(self, ‘/rek_wsi/prod/github/owner’), repo=ssm.StringParameter.value_from_lookup(self, ‘/rek_wsi/prod/github/repo’), branch=ssm.StringParameter.value_from_lookup(self, ‘/rek_wsi/prod/github/branch’), ) # Add the source stage to the pipeline. pipeline.add_stage( stage_name=’GitHub’, actions=[source_action] ) []CodeBuild needs permissions to push container images to Amazon ECR. To grant these permissions, we add the AmazonEC2ContainerRegistryFullAccess policy to a bespoke IAM role that the CodeBuild service principal can assume:

# Create an IAM role that grants CodeBuild access to Amazon ECR to push containers. build_role = iam.Role( self, ‘RekWsiCodeBuildAccessRole’, assumed_by=iam.ServicePrincipal(‘’), ) # Permissions are granted through an AWS managed policy, AmazonEC2ContainerRegistryFullAccess. managed_ecr_policy = iam.ManagedPolicy.from_managed_policy_arn( self, ‘cb_ecr_policy’, managed_policy_arn=’arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess’, ) build_role.add_managed_policy(policy=managed_ecr_policy) []The CodeBuild project logs into the private Amazon ECR repository, builds the Docker image with the Streamlit application, and pushes the image into the repository together with an appspec.yaml and an imagedefinitions.json file.

[]The appspec.yaml file describes the task (port, Fargate platform version, and so on), while the imagedefinitions.json file maps the names of the container images to their corresponding Amazon ECR URI. See the following code:

container_name = fargate_service.task_definition.default_container.container_name build_project = codebuild.PipelineProject( self, ‘RekWSIProject’, build_spec=codebuild.BuildSpec.from_object({ ‘version’: ‘0.2’, ‘phases’: { ‘pre_build’: { ‘commands’: [ ‘env’, ‘COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)’, ‘export TAG=${COMMIT_HASH:=latest}’, ‘aws ecr get-login-password –region $AWS_DEFAULT_REGION | ‘ ‘docker login –username AWS ‘ ‘–password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$’, ] }, ‘build’: { ‘commands’: [ # Build the Docker image ‘cd streamlit_app && docker build -t $IMAGE_REPO_NAME:$IMAGE_TAG .’, # Tag the image ‘docker tag $IMAGE_REPO_NAME:$IMAGE_TAG ‘ ‘$AWS_ACCOUNT_ID.dkr.ecr.$$IMAGE_REPO_NAME:$IMAGE_TAG’, ] }, ‘post_build’: { ‘commands’: [ # Push the container into ECR. ‘docker push ‘ ‘$AWS_ACCOUNT_ID.dkr.ecr.$$IMAGE_REPO_NAME:$IMAGE_TAG’, # Generate imagedefinitions.json ‘cd ..’, “printf ‘[{“name”:”%s”,”imageUri”:”%s”}]’ ” f”{container_name} ” “$AWS_ACCOUNT_ID.dkr.ecr.$$IMAGE_REPO_NAME:$IMAGE_TAG ” “> imagedefinitions.json”, ‘ls -l’, ‘pwd’, ‘sed -i s”|REGION_NAME|$AWS_DEFAULT_REGION|g” appspec.yaml’, ‘sed -i s”|ACCOUNT_ID|$AWS_ACCOUNT_ID|g” appspec.yaml’, ‘sed -i s”|TASK_NAME|$IMAGE_REPO_NAME|g” appspec.yaml’, f’sed -i s”|CONTAINER_NAME|{container_name}|g” appspec.yaml’, ] } }, ‘artifacts’: { ‘files’: [ ‘imagedefinitions.json’, ‘appspec.yaml’, ], }, }), environment=codebuild.BuildEnvironment( build_image=codebuild.LinuxBuildImage.STANDARD_5_0, privileged=True, ), environment_variables={ ‘AWS_ACCOUNT_ID’: codebuild.BuildEnvironmentVariable(value=self.account), ‘IMAGE_REPO_NAME’: codebuild.BuildEnvironmentVariable( value=ssm.StringParameter.value_from_lookup(self, ‘/rek_wsi/prod/ecr_repo_name’)), ‘IMAGE_TAG’: codebuild.BuildEnvironmentVariable(value=’latest’), }, role=build_role, ) []Finally, we put the different pipeline stages together. The last action is the EcsDeployAction, which takes the container image built in the previous stage and does a rolling update of the tasks in our ECS cluster:

# Create an artifact to store the build output. build_output = codepipeline.Artifact() # Create a build action that ties the build project, the source artifact from the # previous stage, and the output artifact together. build_action = codepipeline_actions.CodeBuildAction( action_name=’Build’, project=build_project, input=source_output, outputs=[build_output], ) # Add the build stage to the pipeline. pipeline.add_stage( stage_name=’Build’, actions=[build_action] ) deploy_action = codepipeline_actions.EcsDeployAction( action_name=’Deploy’, service=fargate_service.service, # image_file=build_output input=build_output, ) pipeline.add_stage( stage_name=’Deploy’, actions=[deploy_action], )


[]To avoid incurring future costs, clean up the resources you created as part of this solution.

Amazon Rekognition Custom Labels model

[]Before you shut down your Studio notebook, make sure you stop the Amazon Rekognition Custom Labels model. If you don’t, it continues to incur costs.

rek.stop_project_version( ProjectVersionArn=model_arn, ) []Alternatively, you can use the Amazon Rekognition console to stop the service:

  1. On the Amazon Rekognition console, choose Use Custom Labels in the navigation pane.
  2. Choose Projects in the navigation pane.
  3. Choose version 1 of the rek-mitotic-figures-workshop project.
  4. On the Use Model tab, choose Stop.


Streamlit application

[]To destroy all resources associated to the Streamlit application, run the following code from the AWS CDK application directory:

AWS Secrets Manager

[]To delete the GitHub token, follow the instructions in the documentation.


[]In this post, we walked through the necessary steps to train a Amazon Rekognition Custom Labels model for a digital pathology application using real-world data. We then learned how to use the model from a simple application deployed from a CI/CD pipeline to Fargate.

[]Amazon Rekognition Custom Labels enables you to build ML-enabled healthcare applications that you can easily build and deploy using services like Fargate, CodeBuild, and CodePipeline.

[]Can you think of any applications to help researchers, doctors, or their patients to make their lives easier? If so, use the code in this walkthrough to build your next application. And if you have any questions, please share them in the comments section.


[]We would like to thank Prof. Dr. Marc Aubreville for kindly giving us permission to use the MITOS_WSI_CMC dataset for this blog post. The dataset can be found on GitHub.


[][1] Aubreville, M., Bertram, C.A., Donovan, T.A. et al. A completely annotated whole slide image dataset of canine breast cancer to aid human breast cancer research. Sci Data 7, 417 (2020).

[][2] Khened, M., Kori, A., Rajkumar, H. et al. A generalized deep learning framework for whole-slide image segmentation and analysis. Sci Rep 11, 11579 (2021).

[][3] PNAS March 27, 2018 115 (13) E2970-E2979; first published March 12, 2018;

About the Author

[]Pablo Nuñez Pölcher, MSc, is a Senior Solutions Architect working for the Public Sector team with Amazon Web Services. Pablo focuses on helping healthcare public sector customers build new, innovative products on AWS in accordance with best practices. He received his M.Sc. in Biological Sciences from Universidad de Buenos Aires. In his spare time, he enjoys cycling and tinkering with ML-enabled embedded devices.

[]Razvan Ionasec, PhD, MBA, is the technical leader for healthcare at Amazon Web Services in Europe, Middle East, and Africa. His work focuses on helping healthcare customers solve business problems by leveraging technology. Previously, Razvan was the global head of artificial intelligence (AI) products at Siemens Healthineers in charge of AI-Rad Companion, the family of AI-powered and cloud-based digital health solutions for imaging. He holds 30+ patents in AI/ML for medical imaging and has published 70+ international peer-reviewed technical and clinical publications on computer vision, computational modeling, and medical image analysis. Razvan received his PhD in Computer Science from the Technical University Munich and MBA from University of Cambridge, Judge Business School.


Continue Reading
Click to comment

Leave a Reply

Your email address will not be published.


Build a risk management machine learning workflow on Amazon SageMaker with no code

Since the global financial crisis, risk management has taken a major role in shaping decision-making for banks, including predicting loan status for potential customers. This is often a data-intensive exercise that requires machine learning (ML). However, not all organizations have the data science resources and expertise to build a risk management ML workflow. Amazon SageMaker…




Since the global financial crisis, risk management has taken a major role in shaping decision-making for banks, including predicting loan status for potential customers. This is often a data-intensive exercise that requires machine learning (ML). However, not all organizations have the data science resources and expertise to build a risk management ML workflow.

Amazon SageMaker is a fully managed ML platform that allows data engineers and business analysts to quickly and easily build, train, and deploy ML models. Data engineers and business analysts can collaborate using the no-code/low-code capabilities of SageMaker. Data engineers can use Amazon SageMaker Data Wrangler to quickly aggregate and prepare data for model building without writing code. Then business analysts can use the visual point-and-click interface of Amazon SageMaker Canvas to generate accurate ML predictions on their own.

In this post, we show how simple it is for data engineers and business analysts to collaborate to build an ML workflow involving data preparation, model building, and inference without writing code.

Solution overview

Although ML development is a complex and iterative process, you can generalize an ML workflow into the data preparation, model development, and model deployment stages.

Data Wrangler and Canvas abstract the complexities of data preparation and model development, so you can focus on delivering value to your business by drawing insights from your data without being an expert in code development. The following architecture diagram highlights the components in a no-code/low-code solution.

Amazon Simple Storage Service (Amazon S3) acts as our data repository for raw data, engineered data, and model artifacts. You can also choose to import data from Amazon Redshift, Amazon Athena, Databricks, and Snowflake.

As data scientists, we then use Data Wrangler for exploratory data analysis and feature engineering. Although Canvas can run feature engineering tasks, feature engineering usually requires some statistical and domain knowledge to enrich a dataset into the right form for model development. Therefore, we give this responsibility to data engineers so they can transform data without writing code with Data Wrangler.

After data preparation, we pass model building responsibilities to data analysts, who can use Canvas to train a model without having to write any code.

Finally, we make single and batch predictions directly within Canvas from the resulting model without having to deploy model endpoints ourselves.

Dataset overview

We use SageMaker features to predict the status of a loan using a modified version of Lending Club’s publicly available loan analysis dataset. The dataset contains loan data for loans issued through 2007–2011. The columns describing the loan and the borrower are our features. The column loan_status is the target variable, which is what we’re trying to predict.

To demonstrate in Data Wrangler, we split the dataset in two CSV files: part one and part two. We’ve removed some columns from Lending Club’s original dataset to simplify the demo. Our dataset contains over 37,000 rows and 21 feature columns, as described in the following table.

Column name Description
loan_status Current status of the loan (target variable).
loan_amount The listed amount of the loan applied for by the borrower. If the credit department reduces the loan amount, it’s reflected in this value.
funded_amount_by_investors The total amount committed by investors for that loan at that time.
term The number of payments on the loan. Values are in months and can be either 36 or 60.
interest_rate Interest rate on the loan.
installment The monthly payment owed by the borrower if the loan originates.
grade LC assigned loan grade.
sub_grade LC assigned loan subgrade.
employment_length Employment length in years. Possible values are between 0–10, where 0 means less than one year and 10 means ten or more years.
home_ownership The home ownership status provided by the borrower during registration. Our values are RENT, OWN, MORTGAGE, and OTHER.
annual_income The self-reported annual income provided by the borrower during registration.
verification_status Indicates if income was verified or not by the LC.
issued_amount The month at which the loan was funded.
purpose A category provided by the borrower for the loan request.
dti A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
earliest_credit_line The month the borrower’s earliest reported credit line was opened.
inquiries_last_6_months The number of inquiries in the past 6 months (excluding auto and mortgage inquiries).
open_credit_lines The number of open credit lines in the borrower’s credit file.
derogatory_public_records The number of derogatory public records.
revolving_line_utilization_rate Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
total_credit_lines The total number of credit lines currently in the borrower’s credit file.

We use this dataset for our data preparation and model training.


Complete the following prerequisite steps:

  1. Upload both loan files to an S3 bucket of your choice.
  2. Make sure you have the necessary permissions. For more information, refer to Get Started with Data Wrangler.
  3. Set up a SageMaker domain configured to use Data Wrangler. For instructions, refer to Onboard to Amazon SageMaker Domain.

Import the data

Create a new Data Wrangler data flow from the Amazon SageMaker Studio UI.

Import data from Amazon S3 by selecting the CSV files from the S3 bucket where you placed your dataset. After you import both files, you can see two separate workflows in the Data flow view.

You can choose several sampling options when importing your data in a Data Wrangler flow. Sampling can help when you have a dataset that is too large to prepare interactively, or when you want to preserve the proportion of rare events in your sampled dataset. Because our dataset is small, we don’t use sampling.

Prepare the data

For our use case, we have two datasets with a common column: id. As a first step in data preparation, we want to combine these files by joining them. For instructions, refer to Transform Data.

We use the Join data transformation step and use the Inner join type on the id column.

As a result of our join transformation, Data Wrangler creates two additional columns: id_0 and id_1. However, these columns are unnecessary for our model building purposes. We drop these redundant columns using the Manage columns transform step.

We’ve imported our datasets, joined them, and removed unnecessary columns. We’re now ready to enrich our data through feature engineering and prepare for model building.

Perform feature engineering

We used Data Wrangler for preparing data. You can also use the Data Quality and Insights Report feature within Data Wrangler to verify your data quality and detect abnormalities in your data. Data scientists often need to use these data insights to efficiently apply the right domain knowledge to engineering features. For this post, we assume we’ve completed these quality assessments and can move on to feature engineering.

In this step, we apply a few transformations to numeric, categorical, and text columns.

We first normalize the interest rate to scale the values between 0–1. We do this using the Process numeric transform to scale the interest_rate column using a min-max scaler. The purpose for normalization (or standardization) is to eliminate bias from our model. Variables that are measured at different scales won’t contribute equally to the model learning process. Therefore, a transformation function like a min-max scaler transform helps normalize features.

To convert a categorial variable into a numeric value, we use one-hot encoding. We choose the Encode categorical transform, then choose One-hot encode. One-hot encoding improves an ML model’s predictive ability. This process converts a categorical value into a new feature by assigning a binary value of 1 or 0 to the feature. As a simple example, if you had one column that held either a value of yes or no, one-hot encoding would convert that column to two columns: a Yes column and a No column. A yes value would have 1 in the Yes column and a 0 in the No column. One-hot encoding makes our data more useful because numeric values can more easily determine a probability for our predictions.

Finally, we featurize the employer_title column to transform its string values into a numerical vector. We apply the Count Vectorizer and a standard tokenizer within the Vectorize transform. Tokenization breaks down a sentence or series of text into words, whereas a vectorizer converts text data into a machine-readable form. These words are represented as vectors.

With all feature engineering steps complete, we can export the data and output the results into our S3 bucket. Alternatively, you can export your flow as Python code, or a Jupyter notebook to create a pipeline with your view using Amazon SageMaker Pipelines. Consider this when you want to run your feature engineering steps at scale or as part of an ML pipeline.

We can now use the Data Wrangler output file as our input for Canvas. We reference this as a dataset in Canvas to build our ML model.

In our case, we exported our prepared dataset to the default Studio bucket with an output prefix. We reference this dataset location when loading the data into Canvas for model building next.

Build and train your ML model with Canvas

On the SageMaker console, launch the Canvas application. To build an ML model from the prepared data in the previous section, we perform the following steps:

  1. Import the prepared dataset to Canvas from the S3 bucket.

We reference the same S3 path where we exported the Data Wrangler results from the previous section.

  1. Create new model in Canvas and name it loan_prediction_model.
  2. Select the imported dataset and add it to the model object.

To have Canvas build a model, we must select the target column.

  1. Because our goal is to predict the probability of a lender’s ability to repay a loan, we choose the loan_status column.

Canvas automatically identifies the type of ML problem statement. At the time of writing, Canvas supports regression, classification, and time series forecasting problems. You can specify the type of problem or have Canvas automatically infer the problem from your data.

  1. Choose your option to start the model building process: Quick build or Standard build.

The Quick build option uses your dataset to train a model within 2–15 minutes. This is useful when you’re experimenting with a new dataset to determine if the dataset you have will be sufficient to make predictions. We use this option for this post.

The Standard build option choses accuracy over speed and uses approximately 250 model candidates to train the model. The process usually takes 1–2 hours.

After the model is built, you can review the results of the model. Canvas estimates that your model is able to predict the right outcome 82.9% of the time. Your own results may vary due to the variability in training models.

In addition, you can dive deep into details analysis of the model to learn more about the model.

Feature importance represents the estimated importance of each feature in predicting the target column. In this case, the credit line column has the most significant impact in predicting if a customer will pay back the loan amount, followed by interest rate and annual income.

The confusion matrix in the Advanced metrics section contains information for users that want a deeper understanding of their model performance.

Before you can deploy your model for production workloads, use Canvas to test the model. Canvas manages our model endpoint and allows us to make predictions directly in the Canvas user interface.

  1. Choose Predict and review the findings on either the Batch prediction or Single prediction tab.

In the following example, we make a single prediction by modifying values to predict our target variable loan_status in real time

We can also select a larger dataset and have Canvas generate batch predictions on our behalf.


End-to-end machine learning is complex and iterative, and often involves multiple personas, technologies, and processes. Data Wrangler and Canvas enable collaboration between teams without requiring these teams to write any code.

A data engineer can easily prepare data using Data Wrangler without writing any code and pass the prepared dataset to a business analyst. A business analyst can then easily build accurate ML models with just a few click using Canvas and get accurate predictions in real time or in batch.

Get started with Data Wrangler using these tools without having to manage any infrastructure. You can set up Canvas quickly and immediately start creating ML models to support your business needs.

About the Authors

Peter Chung is a Solutions Architect for AWS, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions in both the public and private sectors. He holds all AWS certifications as well as two GCP certifications.

 Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.

Dan Ferguson is a Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably.


Continue Reading


Optimize F1 aerodynamic geometries via Design of Experiments and machine learning

FORMULA 1 (F1) cars are the fastest regulated road-course racing vehicles in the world. Although these open-wheel automobiles are only 20–30 kilometers (or 12–18 miles) per-hour faster than top-of-the-line sports cars, they can speed around corners up to five times as fast due to the powerful aerodynamic downforce they create. Downforce is the vertical force…




FORMULA 1 (F1) cars are the fastest regulated road-course racing vehicles in the world. Although these open-wheel automobiles are only 20–30 kilometers (or 12–18 miles) per-hour faster than top-of-the-line sports cars, they can speed around corners up to five times as fast due to the powerful aerodynamic downforce they create. Downforce is the vertical force generated by the aerodynamic surfaces that presses the car towards the road, increasing the grip from the tires. F1 aerodynamicists must also monitor the air resistance or drag, which limits straight-line speed.

The F1 engineering team is in charge of designing the next generation of F1 cars and putting together the technical regulation for the sport. Over the last 3 years, they have been tasked with designing a car that maintains the current high levels of downforce and peak speeds, but is also not adversely affected by driving behind another car. This is important because the previous generation of cars can lose up to 50% of their downforce when racing closely behind another car due to the turbulent wake generated by wings and bodywork.

Instead of relying on time-consuming and costly track or wind tunnel tests, F1 uses Computational Fluid Dynamics (CFD), which provides a virtual environment to study the flow of fluids (in this case the air around the F1 car) without ever having to manufacture a single part. With CFD, F1 aerodynamicists test different geometry concepts, assess their aerodynamic impact, and iteratively optimize their designs. Over the past 3 years, the F1 engineering team has collaborated with AWS to set up a scalable and cost-efficient CFD workflow that has tripled the throughput of CFD runs and cut the turnaround time per run by half.

F1 is in the process of looking into AWS machine learning (ML) services such as Amazon SageMaker to help optimize the design and performance of the car by using the CFD simulation data to build models with additional insights. The aim is to uncover promising design directions and reduce the number of CFD simulations, thereby reducing the time taken to converge to optimal designs.

In this post, we explain how F1 collaborated with the AWS Professional Services team to develop a bespoke Design of Experiments (DoE) workflow powered by ML to advise F1 aerodynamicists on which design concepts to test in CFD to maximize learning and performance.

Problem statement

When exploring new aerodynamic concepts, F1 aerodynamicists sometimes employ a process called Design of Experiments (DoE). This process systematically studies the relationship between multiple factors. In the case of a rear wing, this might be wing chord, span, or camber, with respect to aerodynamic metrics such as downforce or drag. The goal of a DoE process is to efficiently sample the design space and minimize the number of candidates tested before converging to an optimal result. This is achieved by iteratively changing multiple design factors, measuring the aerodynamic response, studying the impact and relationship between factors, and then continuing testing in the most optimum or informative direction. In the following figure, we present an example rear wing geometry that F1 has kindly shared with us from their UNIFORM baseline. Four design parameters which F1 aerodynamicists could investigate in a DoE routine are labeled.

In this project, F1 worked with AWS Professional Services to investigate using ML to enhance DoE routines. Traditional DoE methods require a well-populated design space in order to understand the relationship between design parameters and therefore rely on a large number of upfront CFD simulations. ML regression models could use the results from previous CFD simulations to predict the aerodynamic response given the set of design parameters, as well as give you an indication of the relative importance of each design variable. You could use these insights to predict optimal designs and help designers converge to optimum solutions with fewer upfront CFD simulations. Secondly, you could use data science techniques to understand which regions in the design space haven’t been explored and could potentially hide optimal designs.

To illustrate the bespoke ML-powered DoE workflow, we walk through a real example of designing a front wing.

Designing a front wing

F1 cars rely on wings such as the front and rear wings to generate most of their downforce, which we refer to throughout this example by the coefficient Cz. Throughout this example, the downforce values have been normalized. In this example, F1 aerodynamicists used their domain expertise to parameterize the wing geometry as follows (refer to the following figure for a visual representation):

  • LE-Height – Leading edge height
  • Min-Z – Minimum ground clearance
  • Mid-LE-Angle – Leading edge angle of the third element
  • TE-Angle – Trailing edge angle
  • TE-Height – Trailing edge height

This front wing geometry was shared by F1 and is part of the UNIFORM baseline.

These parameters were selected because they are sufficient to describe the main aspects of the geometry efficiently and because in the past, aerodynamic performance has shown notable sensitivity with respect to these parameters. The goal of this DoE routine was to find the combination of the five design parameters that would maximize aerodynamic downforce (Cz). The design freedom is also limited by setting maximum and minimum values to the design parameters, as shown in the following table.

. Minimum Maximum
TE-Height 250.0 300.0
TE-Angle 145.0 165.0
Mid-LE-Angle 160.0 170.0
Min-Z 5.0 50.0
LE-Height 100.0 150.0

Having established the design parameters, the target output metric, and the bounds of our design space, we have all we need to get started with the DoE routine. A workflow diagram of our solution is presented in the following image. In the following section, we dive deep into the different stages.

Initial sampling of the design space

The first step of the DoE workflow is to run in CFD an initial set of candidates that efficiently sample the design space and allow us to build the first set of ML regression models to study the influence of each feature. First, we generate a pool of N samples using Latin Hypercube Sampling (LHS) or a regular grid method. Then, we select k candidates to test in CFD by means of a greedy inputs algorithm, which aims to maximize the exploration of the design space. Starting with a baseline candidate (the current design), we iteratively select candidates furthest away from all the previously tested candidates. Suppose that we already tested k designs; for the remaining design candidates, we find the minimum distance d with respect to the tested k designs:

The greedy inputs algorithm selects the candidate that maximizes the distance in the feature space to the previously tested candidates:

In this DoE, we selected three greedy inputs candidates and ran those in CFD to assess their aerodynamic downforce (Cz). The greedy inputs candidates explore the bounds of the design space and at this stage, none of them proved superior to the baseline candidate in terms of aerodynamic downforce (Cz). The results of this initial round of CFD testing together with the design parameters are displayed in the following table.

. TE-Height TE-Angle Mid-LE-Angle Min-Z LE-Height Normalized Cz
Baseline 292.25 154.86 166 5 130 0.975
GI 0 250 165 160 50 100 0.795
GI 1 300 145 170 50 100 0.909
GI 2 250 145 170 5 100 0.847

Initial ML regression models

The goal of the regression model is to predict Cz for any combination of the five design parameters. With such a small dataset, we prioritized simple models, applied model regularization to avoid overfitting, and combined the predictions of different models where possible. The following ML models were constructed:

  • Ordinary Least Squares (OLS)
  • Support Vector Regression (SVM) with an RBF kernel
  • Gaussian Process Regression (GP) with a Matérn kernel
  • XGBoost

In addition, a two-level stacked model was built, where the predictions of the GP, SVM, and XGBoost models are assimilated by a Lasso algorithm to produce the final response. This model is referred to throughout this post as the stacked model. To rank the predictive capabilities of the five models we described, a repeated k-fold cross validation routine was implemented.

Generating the next design candidate to test in CFD

Selecting which candidate to test next requires careful consideration. The F1 aerodynamicist must balance the benefit of exploiting options predicted by the ML model to provide high downforce with the cost of failing to explore uncharted regions of the design space, which may provide even higher downforce. For that reason, in this DoE routine, we propose three candidates: one performance-driven and two exploration-driven. The purpose of the exploration-driven candidates is also to provide additional data points to the ML algorithm in regions of the design space where the uncertainty around the prediction is highest. This in turn leads to more accurate predictions in the next round of design iteration.

Genetic algorithm optimization to maximize downforce

To obtain the candidate with the highest expected aerodynamic downforce, we could run a prediction over all possible design candidates. However, this wouldn’t be efficient. For this optimization problem, we use a genetic algorithm (GA). The goal is to efficiently search through a huge solution space (obtained via the ML prediction of Cz) and return the most optimal candidate. GAs are advantageous when the solution space is complex and non-convex, so that classical optimization methods such as gradient descent are an ineffective means to find a global solution. GA is a subset of evolutionary algorithms and inspired by concepts from natural selection, genetic crossover, and mutation to solve the search problem. Over a series of iterations (known as generations), the best candidates of an initially randomly selected set of design candidates are combined (much like reproduction). Eventually, this mechanism allows you to find the most optimal candidates in an efficient manner. For more information about GAs, refer to Using genetic algorithms on AWS for optimization problems.

Generating exploration-driven candidates

In generating what we term exploration-driven candidates, a good sampling strategy must be able to adapt to a situation of effect sparsity, where only a subset of the parameters significantly affects the solution. Therefore, the sampling strategy should spread out the candidates across the input design space but also avoid unnecessary CFD runs, changing variables that have little effect on performance. The sampling strategy must take into account the response surface predicted by the ML regressor. Two sampling strategies were employed to obtain exploration-driven candidates.

In the case of Gaussian Process Regressors (GP), the standard deviation of the predicted response surface can be used as an indication of the uncertainty of the model. The sampling strategy consists of selecting out of the pool of N samples , the candidate that maximizes . By doing so, we’re sampling in the region of the design space where the regressor is least confident about its prediction. In mathematical terms, we select the candidate that satisfies the following equation:

Alternatively, we employ a greedy inputs and outputs sampling strategy, which maximizes both the distances in the feature space and in the response space between the proposed candidate and the already tested designs. This tackles the effect sparsity situation because candidates that modify a design parameter of little relevance have a similar response, and therefore the distances in the response surface are minimal. In mathematical terms, we select the candidate that satisfies the following equation, where the function f is the ML regression model:

Candidate selection, CFD testing, and optimization loop

At this stage, the user is presented with both performance-driven and exploration-driven candidates. The next step consists of selecting a subset of the proposed candidates, running CFD simulations with those design parameters, and recording the aerodynamic downforce response.

After this, the DoE workflow retrains the ML regression models, runs the genetic algorithm optimization, and proposes a new set of performance-driven and exploration-driven candidates. The user runs a subset of the proposed candidates and continues iterating in this fashion until the stopping criteria is met. The stopping criteria is generally met when a candidate deemed optimum is obtained.


In the following figure, we record the normalized aerodynamic downforce (Cz) from the CFD simulation (blue) and the one predicted beforehand using the ML regression model of choice (pink) for each iteration of the DoE workflow. The goal was to maximize aerodynamic downforce (Cz). The first four runs (to the left of the red line) were the baseline and the three greedy inputs candidates outlined previously. From there on, a combination of performance-driven and exploration-driven candidates were tested. In particular, the candidates at iterations 6 and 8 were exploratory candidates, both showing lower levels of downforce than the baseline candidate (iteration 1). As expected, as we recorded more candidates, the ML prediction became increasingly accurate, as denoted by the decreasing distance between the predicted and actual Cz. At iteration 9, the DoE workflow managed to find a candidate with a similar performance to the baseline, and at iteration 12, the DoE workflow was concluded when the performance-driven candidate surpassed the baseline.

The final design parameters together with the resultant normalized downforce value is presented in the following table. The normalized downforce level for the baseline candidate was 0.975, whereas the optimum candidate for the DoE workflow recorded a normalized downforce level of 1.000. This is an important 2.5% relative increase.

For context, a traditional DoE approach with five variables would require 25 upfront CFD simulations before achieving a good enough fit to predict an optimum. On the other hand, this active learning approach converged to an optimum in 12 iterations.

. TE-Height TE-Angle Mid-LE-Angle Min-Z LE-Height Normalized Cz
Baseline 292.25 154.86 166 5 130 0.975
Optimal 299.97 156.79 166.27 5.01 135.26 1.000

Feature importance

Understanding the relative feature importance for a predictive model can provide a useful insight into the data. It can help feature selection with less important variables being removed, thereby reducing the dimensionality of the problem and potentially improving the predictive powers of the regression model, particularly in the small data regime. In this design problem, it provides F1 aerodynamicists an insight into which variables are the most sensitive and therefore require more careful tuning.

In this routine, we implemented a model-agnostic technique called permutation importance. The relative importance of each variable is measured by calculating the increase in the model’s prediction error after randomly shuffling the values for that variable alone. If a feature is important for the model, the prediction error increases greatly, and vice versa for lesser important features. In the following figure, we present the permutation importance for a Gaussian Process Regressor (GP) predicting aerodynamic downforce (Cz). The trailing edge height (TE-Height) was deemed the most important.


In this post, we explained how F1 aerodynamicists are using ML regression models in DoE workflows when designing novel aerodynamic geometries. The ML-powered DoE workflow developed by AWS Professional Services provides insights into which design parameters will maximize performance or explore uncharted regions in the design space. As opposed to iteratively testing candidates in CFD in a grid search fashion, the ML-powered DoE workflow is able to converge to optimal design parameters in fewer iterations. This saves both time and resources because fewer CFD simulations are required.

Whether you’re a pharmaceutical company looking to speed up chemical composition optimization or a manufacturing company looking to find the design dimensions for the most robust designs, DoE workflows can help reach optimal candidates more efficiently. AWS Professional Services is ready to supplement your team with specialized ML skills and experience to develop the tools to streamline DoE workflows and help you achieve better business outcomes. For more information, see AWS Professional Services, or reach out through your account manager to get in touch.

About the Authors

Pablo Hermoso Moreno is a Data Scientist in the AWS Professional Services Team. He works with clients across industries using Machine Learning to tell stories with data and reach more informed engineering decisions faster. Pablo’s background is in Aerospace Engineering and having worked in the motorsport industry he has an interest in bridging physics and domain expertise with ML. In his spare time, he enjoys rowing and playing guitar.


Continue Reading


Detect social media fake news using graph machine learning with Amazon Neptune ML

In recent years, social media has become a common means for sharing and consuming news. However, the spread of misinformation and fake news on these platforms has posed a major challenge to the well-being of individuals and societies. Therefore, it is imperative that we develop robust and automated solutions for early detection of fake news…




In recent years, social media has become a common means for sharing and consuming news. However, the spread of misinformation and fake news on these platforms has posed a major challenge to the well-being of individuals and societies. Therefore, it is imperative that we develop robust and automated solutions for early detection of fake news on social media. Traditional approaches rely purely on the news content (using natural language processing) to mark information as real or fake. However, the social context in which the news is published and shared can provide additional insights into the nature of fake news on social media and improve the predictive capabilities of fake news detection tools. In this post, we demonstrate how to use Amazon Neptune ML to detect fake news based on the content and social context of the news on social media.

Neptune ML is a new capability of Amazon Neptune that uses graph neural networks (GNNs), a machine learning (ML) technique purpose-built for graphs, to make easy, fast, and accurate predictions using graph data. Making accurate predictions on graphs with billions of relationships requires expertise. Existing ML approaches such as XGBoost can’t operate effectively on graphs because they’re designed for tabular data. As a result, using these methods on graphs can take time, require specialized skills, and produce suboptimal predictions.

Neptune ML uses the Deep Graph Library (DGL), an open-source library to which AWS contributes, and Amazon SageMaker to build and train GNNs, including Relational Graph Convolutional Networks (R-GCNs) for tasks such as node classification, node regression, link prediction, or edge classification.

The DGL makes it easy to apply deep learning to graph data, and Neptune ML automates the heavy lifting of selecting and training the best ML model for graph data. It provides fast and memory-efficient message passing primitives for training GNNs. Neptune ML uses the DGL to automatically choose and train the best ML model for your workload. This enables you to make ML-based predictions on graph data in hours instead of weeks. For more information, see Amazon Neptune ML for machine learning on graphs.

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to prepare, build, train, and deploy ML models quickly.

Overview of GNNs

GNNs are neural networks that take graphs as input. These models operate on the relational information in data to produce insights not possible in other neural network architectures and algorithms. A graph (sometimes called a network) is a data structure that highlights the relationships between components in the data. It consists of nodes (or vertices) and edges (or links) that act as connections between the nodes. Such a data structure has an advantage when dealing with entities that have multiple relationships. Graph data structures have been around for centuries, with a wide variety of modern use cases.

GNNs are emerging as an important class of deep learning (DL) models. GNNs learn embeddings on nodes, edges, and graphs. GNNs have been around for about 20 years, but interest in them has dramatically increased in the last 5 years. In this time, we’ve seen new architectures emerge, novel applications realized, and new platforms and libraries enter the scene. There are several potential research and industry use cases for GNNs, including the following:

  • Computer vision – Generating scene graphs
  • Forecasting – Predicting traffic volume
  • Node classification – Implementing targeted campaigns, detecting fake news
  • Graph classification – Predicting the properties of a chemical compound
  • Link prediction – Building recommendation systems
  • Other – Predicting adversarial attacks


For this post, we use the BuzzFeed dataset from the 2018 version of FakeNewsNet. The BuzzFeed dataset consists of a sample of news articles shared on Facebook from nine news agencies over 1 week leading up to the 2016 US election. Every post and the corresponding news article have been fact-checked by BuzzFeed journalists. The following table summarizes key statistics about the BuzzFeed dataset from FakeNewsNet.

Category Amount
Users 15,257
Authors 126
Publishers 28
Social Links 634,750
Engagements 25,240
News Articles 182
Fake News 91
Real News 91

To get the raw data, you can complete the following steps:

  1. Clone the FakeNewsNet repository from GitHub.
  2. Check out the old version branch.
  3. Change the directory to Data/BuzzFeed.

Each row in the Users.txt file provides a UUID for the corresponding user.

Each row in the News.txt file provides a name and ID for the corresponding news in the dataset.

In the BuzzFeedNewsUser.txt file, the news_id in the first column is posted or shared by the user_id in the second column n times, where n is the value in the third column.

In the BuzzFeedUserUser.txt file, the user_id in the first column follows the user_id in the second column.

User features such as age, gender, and historical social media activities (109,626 features for each user) are made available in UserFeature.mat file. Sample news content files, shown in the following screenshot, contain information such as news title, news text, author name, and publisher web address.

We processed the raw data from the FakeNewsNet repository and converted it into CSV format for vertices and edges in a heterogeneous property graph that can be readily loaded into a Neptune database with Apache TinkerPop Gremlin. The constructed property graph is composed of four vertex types and five edge types, as demonstrated in the following schematic, which together describe the social context in which each news item is published and shared. The News vertices have two properties: news_title and news_type (Fake or Real). The edges connecting News and User vertices have a weight property describing how many times the user has shared the news. The User vertices have a 100-dimension property representing user features such as age, gender, and historical social media activities (reduced from 109,626 to 100 using principal coordinate analysis).

The following screenshot shows the first 10 rows of the processed nodes.csv file.

The following screenshot shows the first 10 rows of the processed edges.csv file.

To follow along with this post, start by using the following AWS CloudFormation quick-start template to quickly spin up an associated Neptune cluster and AWS graph notebook, and set up all the configurations needed to work with Neptune ML in a graph notebook. You then need to download and save the sample dataset in the default Amazon Simple Storage Service (Amazon S3) bucket associated with your SageMaker session, or in an S3 bucket of your choice. For rapid experimentation and initial data exploration, you can save a copy of the dataset under the home directory of the local volume attached to your SageMaker notebook instance, and follow the create_graph_dataset.ipynb Jupyter notebook. After you generate the processed nodes and edges files, you can run the following commands to upload the transformed graph data to Amazon S3:

bucket = ‘‘ prefix = ‘fake-news-detection/data’ s3_client = boto3.client(‘s3’) resp = s3_client.upload_file(‘./Data/upload/nodes.csv’, bucket, f”{prefix}/nodes.csv”) resp = s3_client.upload_file(‘./Data/upload/edges.csv’, bucket, f”{prefix}/edges.csv”)

You can use the %load magic command, which is available as part of the AWS graph notebook, to bulk load data to Neptune:

%load -s {s3_uri} -f csv -p OVERSUBSCRIBE –run

You can use the %graph_notebook_config magic command to see information about the Neptune cluster associated with your graph notebook. You can also use the %status magic command to see the status of your Neptune cluster, as shown in the following screenshot.

Solution overview

Neptune ML uses graph neural network technology to automatically create, train, and deploy ML models on your graph data. Neptune ML supports common graph prediction tasks, such as node classification and regression, edge classification and regression, and link prediction. In our solution, we use node classification to classify news nodes according to the news_type property.

The following diagram illustrates the high-level process flow to develop the best model for fake news detection.

Graph ML with Neptune ML involves five main steps:

  1. Export and configure the data – The data export step uses the Neptune-Export service to export data from Neptune into Amazon S3 in CSV format. A configuration file named training-data-configuration.json is automatically generated, which specifies how the exported data can be loaded into a trainable graph.
  2. Preprocess the data – The exported dataset is preprocessed using standard techniques to prepare it for model training. Feature normalization can be performed for numeric data, and text features can be encoded using word2vec. At the end of this step, a DGL graph is generated from the exported dataset for the model training step. This step is implemented using a SageMaker processing job, and the resulting data is stored in an Amazon S3 location that you have specified.
  3. Train the model – This step trains the ML model that will be used for predictions. Model training is done in two stages:
    1. The first stage uses a SageMaker processing job to generate a model training strategy configuration set that specifies what type of model and model hyperparameter ranges are used for the model training.
    2. The second stage uses a SageMaker model tuning job to try different hyperparameter configurations and select the training job that produced the best-performing model. The tuning job runs a pre-specified number of model training job trials on the processed data. At the end of this stage, the trained model parameters of the best training job are used to generate model artifacts for inference.
  4. Create an inference endpoint in SageMaker – The inference endpoint is a SageMaker endpoint instance that is launched with the model artifacts produced by the best training job. The endpoint is able to accept incoming requests from the graph database and return the model predictions for inputs in the requests.
  5. Query the ML model using Gremlin – You can use extensions to the Gremlin query language to query predictions from the inference endpoint.

Before we proceed with the first step of machine learning, let’s verify that the graph dataset is loaded in the Neptune cluster. Run the following Gremlin traversal to see the count of nodes by label:

%%gremlin g.V().groupCount().by(label).unfold().order().by(keys)

If nodes are loaded correctly, the output is as follows:

  • 126 author nodes
  • 182 news nodes
  • 28 publisher nodes
  • 15,257 user nodes

Use the following code to see the count edges by label:

%%gremlin g.E().groupCount().by(label).unfold().order().by(keys)

If edges are loaded correctly, the output is as follows:

  • 634,750 follows edges
  • 174 published edges
  • 250 wrote edges
  • 250 wrote_for edges

Now let’s go through the ML development process in detail.

Export and configure the data

The export process is triggered by calling to the Neptune-Export service endpoint. This call contains a configuration object that specifies the type of ML model to build, in our case node classification, as well as any feature configurations required.

The configuration options provided to the Neptune-Export service are broken into two main sections: selecting the target and configuring features. Here we want to classify news nodes according to the news_type property.

The second section of the configuration, configuring features, is where we specify details about the types of data stored in our graph and how the ML model should interpret that data. When data is exported from Neptune, all properties of all nodes are included. Each property is treated as a separate feature for the ML model. Neptune ML does its best to infer the correct type of feature for a property, but in many cases, the accuracy of the model can be improved by specifying information about the property used for a feature. We use word2vec to encode the news_title property of news nodes, and the numerical type for user_features property of user nodes. See the following code:

export_params={ “command”: “export-pg”, “params”: { “endpoint”: neptune_ml.get_host(), “profile”: “neptune_ml”, “useIamAuth”: neptune_ml.get_iam(), “cloneCluster”: False }, “outputS3Path”: f”{s3_uri}/neptune-export”, “additionalParams”: { “neptune_ml”: { “version”: “v2.0”, “targets”: [ { “node”: “news”, “property”: “news_type”, “type”: “classification” } ], “features”: [ { “node”: “news”, “property”: “news_title”, “type”: “text_word2vec” }, { “node”: “user”, “property”: “user_features”, “type”: “numerical” } ] } }, “jobSize”: “medium”}

Start the export process by running the following command:

%%neptune_ml export start –export-url {neptune_ml.get_export_service_host()} –export-iam –wait –store-to export_results ${export_params}

Preprocess the data

When the export job is complete, we’re ready to train our ML model. There are three machine learning steps in Neptune ML. The first step (data processing) processes the exported graph dataset using standard feature preprocessing techniques to prepare it for use by the DGL. This step performs functions such as feature normalization for numeric data and encoding text features using word2vec. At the conclusion of this step, the dataset is formatted for model training. This step is implemented using a SageMaker processing job, and data artifacts are stored in a pre-specified Amazon S3 location when the job is complete. Run the following code to create the data processing configuration and begin the processing job:

# The training_job_name can be set to a unique value below, otherwise one will be auto generated training_job_name=neptune_ml.get_training_job_name(‘fake-news-detection’) processing_params = f””” –config-file-name training-data-configuration.json –job-id {training_job_name} –s3-input-uri {export_results[‘outputS3Uri’]} –s3-processed-uri {str(s3_uri)}/preloading “””

Train the model

Now that you have the data processed in the desired format, this step trains the ML model that is used for predictions. The model training is done in two stages. The first stage uses a SageMaker processing job to generate a model training strategy. A model training strategy is a configuration set that specifies what type of model and model hyperparameter ranges are used for the model training. After the first stage is complete, the SageMaker processing job launches a SageMaker hyperparameter tuning job. The hyperparameter tuning job runs a pre-specified number of model training job trials on the processed data, and stores the model artifacts generated by the training in the output Amazon S3 location. When all the training jobs are complete, the hyperparameter tuning job also notes the training job that produced the best performing model.

We use the following training parameters:

training_params=f””” –job-id {training_job_name} –data-processing-id {training_job_name} –instance-type ml.c5.18xlarge –s3-output-uri {str(s3_uri)}/training –max-hpo-number 20 –max-hpo-parallel 4 “””

The hyperparameter tuning finds the best version of a model by running many training jobs on the dataset. You can summarize hyperparameters of the five best training jobs and their respective model performance as follows:

tuning_job_name = training_results[‘hpoJob’][‘name’] tuner = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name) full_df = tuner.dataframe() if len(full_df) > 0: df = full_df[full_df[“FinalObjectiveValue”] > -float(“inf”)] if len(df) > 0: df = df.sort_values(“FinalObjectiveValue”, ascending=False) print(“Number of training jobs with valid objective: %d” % len(df)) print({“lowest”: min(df[“FinalObjectiveValue”]), “highest”: max(df[“FinalObjectiveValue”])}) pd.set_option(“display.max_colwidth”, None) # Don’t truncate TrainingJobName else: print(“No training jobs have reported valid results yet.”)

We can see that the best performing training job achieved an accuracy of approximately 94%. This training job will be automatically selected by Neptune ML for creating an endpoint in the next step.

Create an endpoint

The final step of machine learning is to create an inference endpoint, which is a SageMaker endpoint instance that is launched with the model artifacts produced by the best training job. We use this endpoint in our graph queries to return the model predictions for the inputs in the request. After the endpoint is created, it stays active until it’s manually deleted. Create the endpoint with the following code:

endpoint_params=f””” –id {training_job_name} –model-training-job-id {training_job_name} “”” #Create endpoint %neptune_ml endpoint create –wait –store-to endpoint_results {endpoint_params}

Our new endpoint is now up and running.

Query the ML model

Now let’s query your trained graph to see how the model predicts news_type for one unseen news node:

# Random fake news: test node: Actual %%gremlin g.V().has(‘news_title’, ‘BREAKING: Steps to FORCE FBI Director Comey to Resign In Process – Hearing Decides His Fate Sept 28’).properties(“news_type”).value() # Random fake news: test node: Predicted %%gremlin g.with(“Neptune#ml.endpoint”, “${endpoint}”). V().has(‘news_title’, “BREAKING: Steps to FORCE FBI Director Comey to Resign In Process – Hearing Decides His Fate Sept 28”).properties(“news_type”).with(“Neptune#ml.classification”).value()

If your graph is continuously changing, you may need to update ML predictions frequently using the newest data. Although you can do this simply by rerunning the earlier steps (from data export and configuration to creating your inference endpoint), Neptune ML supports simpler ways to update your ML predictions using new data. See Workflows for handling evolving graph data for more details.


In this post, we showed how Neptune ML and GNNs can help detect social media fake news using node classification on graph data by combining information from the complex interaction patterns in the graph. For instructions on implementing this solution, see the GitHub repo. You can also clone and extend this solution with additional data sources for model retraining and tuning. We encourage you to reach out and discuss your use cases with the authors via your AWS account manager.

Additional references

For more information related to Neptune ML and detecting fake news in social media, see the following resources:

About the Authors

Hasan Shojaei is a Data Scientist with AWS Professional Services, where he helps customers across different industries such as sports, insurance, and financial services solve their business challenges through the use of big data, machine learning, and cloud technologies. Prior to this role, Hasan led multiple initiatives to develop novel physics-based and data-driven modeling techniques for top energy companies. Outside of work, Hasan is passionate about books, hiking, photography, and ancient history.

Sarita Joshi is a Senior Data Science Manager with the AWS Professional Services Intelligence team. Together with her team, Sarita plays a strategic role for our customers and partners by helping them achieve their business outcomes through machine learning and artificial intelligence solutions at scale. She has several years of experience as a consultant advising clients across many industries and technical domains, including AI, ML, analytics, and SAP. She holds a master’s degree in Computer Science, Specialty Data Science from Northeastern University.


Continue Reading


Copyright © 2021 Today's Digital.