Connect with us


Build a risk management machine learning workflow on Amazon SageMaker with no code

Since the global financial crisis, risk management has taken a major role in shaping decision-making for banks, including predicting loan status for potential customers. This is often a data-intensive exercise that requires machine learning (ML). However, not all organizations have the data science resources and expertise to build a risk management ML workflow. Amazon SageMaker…



Since the global financial crisis, risk management has taken a major role in shaping decision-making for banks, including predicting loan status for potential customers. This is often a data-intensive exercise that requires machine learning (ML). However, not all organizations have the data science resources and expertise to build a risk management ML workflow.

Amazon SageMaker is a fully managed ML platform that allows data engineers and business analysts to quickly and easily build, train, and deploy ML models. Data engineers and business analysts can collaborate using the no-code/low-code capabilities of SageMaker. Data engineers can use Amazon SageMaker Data Wrangler to quickly aggregate and prepare data for model building without writing code. Then business analysts can use the visual point-and-click interface of Amazon SageMaker Canvas to generate accurate ML predictions on their own.

In this post, we show how simple it is for data engineers and business analysts to collaborate to build an ML workflow involving data preparation, model building, and inference without writing code.

Solution overview

Although ML development is a complex and iterative process, you can generalize an ML workflow into the data preparation, model development, and model deployment stages.

Data Wrangler and Canvas abstract the complexities of data preparation and model development, so you can focus on delivering value to your business by drawing insights from your data without being an expert in code development. The following architecture diagram highlights the components in a no-code/low-code solution.

Amazon Simple Storage Service (Amazon S3) acts as our data repository for raw data, engineered data, and model artifacts. You can also choose to import data from Amazon Redshift, Amazon Athena, Databricks, and Snowflake.

As data scientists, we then use Data Wrangler for exploratory data analysis and feature engineering. Although Canvas can run feature engineering tasks, feature engineering usually requires some statistical and domain knowledge to enrich a dataset into the right form for model development. Therefore, we give this responsibility to data engineers so they can transform data without writing code with Data Wrangler.

After data preparation, we pass model building responsibilities to data analysts, who can use Canvas to train a model without having to write any code.

Finally, we make single and batch predictions directly within Canvas from the resulting model without having to deploy model endpoints ourselves.

Dataset overview

We use SageMaker features to predict the status of a loan using a modified version of Lending Club’s publicly available loan analysis dataset. The dataset contains loan data for loans issued through 2007–2011. The columns describing the loan and the borrower are our features. The column loan_status is the target variable, which is what we’re trying to predict.

To demonstrate in Data Wrangler, we split the dataset in two CSV files: part one and part two. We’ve removed some columns from Lending Club’s original dataset to simplify the demo. Our dataset contains over 37,000 rows and 21 feature columns, as described in the following table.

Column name Description
loan_status Current status of the loan (target variable).
loan_amount The listed amount of the loan applied for by the borrower. If the credit department reduces the loan amount, it’s reflected in this value.
funded_amount_by_investors The total amount committed by investors for that loan at that time.
term The number of payments on the loan. Values are in months and can be either 36 or 60.
interest_rate Interest rate on the loan.
installment The monthly payment owed by the borrower if the loan originates.
grade LC assigned loan grade.
sub_grade LC assigned loan subgrade.
employment_length Employment length in years. Possible values are between 0–10, where 0 means less than one year and 10 means ten or more years.
home_ownership The home ownership status provided by the borrower during registration. Our values are RENT, OWN, MORTGAGE, and OTHER.
annual_income The self-reported annual income provided by the borrower during registration.
verification_status Indicates if income was verified or not by the LC.
issued_amount The month at which the loan was funded.
purpose A category provided by the borrower for the loan request.
dti A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
earliest_credit_line The month the borrower’s earliest reported credit line was opened.
inquiries_last_6_months The number of inquiries in the past 6 months (excluding auto and mortgage inquiries).
open_credit_lines The number of open credit lines in the borrower’s credit file.
derogatory_public_records The number of derogatory public records.
revolving_line_utilization_rate Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
total_credit_lines The total number of credit lines currently in the borrower’s credit file.

We use this dataset for our data preparation and model training.


Complete the following prerequisite steps:

  1. Upload both loan files to an S3 bucket of your choice.
  2. Make sure you have the necessary permissions. For more information, refer to Get Started with Data Wrangler.
  3. Set up a SageMaker domain configured to use Data Wrangler. For instructions, refer to Onboard to Amazon SageMaker Domain.

Import the data

Create a new Data Wrangler data flow from the Amazon SageMaker Studio UI.

Import data from Amazon S3 by selecting the CSV files from the S3 bucket where you placed your dataset. After you import both files, you can see two separate workflows in the Data flow view.

You can choose several sampling options when importing your data in a Data Wrangler flow. Sampling can help when you have a dataset that is too large to prepare interactively, or when you want to preserve the proportion of rare events in your sampled dataset. Because our dataset is small, we don’t use sampling.

Prepare the data

For our use case, we have two datasets with a common column: id. As a first step in data preparation, we want to combine these files by joining them. For instructions, refer to Transform Data.

We use the Join data transformation step and use the Inner join type on the id column.

As a result of our join transformation, Data Wrangler creates two additional columns: id_0 and id_1. However, these columns are unnecessary for our model building purposes. We drop these redundant columns using the Manage columns transform step.

We’ve imported our datasets, joined them, and removed unnecessary columns. We’re now ready to enrich our data through feature engineering and prepare for model building.

Perform feature engineering

We used Data Wrangler for preparing data. You can also use the Data Quality and Insights Report feature within Data Wrangler to verify your data quality and detect abnormalities in your data. Data scientists often need to use these data insights to efficiently apply the right domain knowledge to engineering features. For this post, we assume we’ve completed these quality assessments and can move on to feature engineering.

In this step, we apply a few transformations to numeric, categorical, and text columns.

We first normalize the interest rate to scale the values between 0–1. We do this using the Process numeric transform to scale the interest_rate column using a min-max scaler. The purpose for normalization (or standardization) is to eliminate bias from our model. Variables that are measured at different scales won’t contribute equally to the model learning process. Therefore, a transformation function like a min-max scaler transform helps normalize features.

To convert a categorial variable into a numeric value, we use one-hot encoding. We choose the Encode categorical transform, then choose One-hot encode. One-hot encoding improves an ML model’s predictive ability. This process converts a categorical value into a new feature by assigning a binary value of 1 or 0 to the feature. As a simple example, if you had one column that held either a value of yes or no, one-hot encoding would convert that column to two columns: a Yes column and a No column. A yes value would have 1 in the Yes column and a 0 in the No column. One-hot encoding makes our data more useful because numeric values can more easily determine a probability for our predictions.

Finally, we featurize the employer_title column to transform its string values into a numerical vector. We apply the Count Vectorizer and a standard tokenizer within the Vectorize transform. Tokenization breaks down a sentence or series of text into words, whereas a vectorizer converts text data into a machine-readable form. These words are represented as vectors.

With all feature engineering steps complete, we can export the data and output the results into our S3 bucket. Alternatively, you can export your flow as Python code, or a Jupyter notebook to create a pipeline with your view using Amazon SageMaker Pipelines. Consider this when you want to run your feature engineering steps at scale or as part of an ML pipeline.

We can now use the Data Wrangler output file as our input for Canvas. We reference this as a dataset in Canvas to build our ML model.

In our case, we exported our prepared dataset to the default Studio bucket with an output prefix. We reference this dataset location when loading the data into Canvas for model building next.

Build and train your ML model with Canvas

On the SageMaker console, launch the Canvas application. To build an ML model from the prepared data in the previous section, we perform the following steps:

  1. Import the prepared dataset to Canvas from the S3 bucket.

We reference the same S3 path where we exported the Data Wrangler results from the previous section.

  1. Create new model in Canvas and name it loan_prediction_model.
  2. Select the imported dataset and add it to the model object.

To have Canvas build a model, we must select the target column.

  1. Because our goal is to predict the probability of a lender’s ability to repay a loan, we choose the loan_status column.

Canvas automatically identifies the type of ML problem statement. At the time of writing, Canvas supports regression, classification, and time series forecasting problems. You can specify the type of problem or have Canvas automatically infer the problem from your data.

  1. Choose your option to start the model building process: Quick build or Standard build.

The Quick build option uses your dataset to train a model within 2–15 minutes. This is useful when you’re experimenting with a new dataset to determine if the dataset you have will be sufficient to make predictions. We use this option for this post.

The Standard build option choses accuracy over speed and uses approximately 250 model candidates to train the model. The process usually takes 1–2 hours.

After the model is built, you can review the results of the model. Canvas estimates that your model is able to predict the right outcome 82.9% of the time. Your own results may vary due to the variability in training models.

In addition, you can dive deep into details analysis of the model to learn more about the model.

Feature importance represents the estimated importance of each feature in predicting the target column. In this case, the credit line column has the most significant impact in predicting if a customer will pay back the loan amount, followed by interest rate and annual income.

The confusion matrix in the Advanced metrics section contains information for users that want a deeper understanding of their model performance.

Before you can deploy your model for production workloads, use Canvas to test the model. Canvas manages our model endpoint and allows us to make predictions directly in the Canvas user interface.

  1. Choose Predict and review the findings on either the Batch prediction or Single prediction tab.

In the following example, we make a single prediction by modifying values to predict our target variable loan_status in real time

We can also select a larger dataset and have Canvas generate batch predictions on our behalf.


End-to-end machine learning is complex and iterative, and often involves multiple personas, technologies, and processes. Data Wrangler and Canvas enable collaboration between teams without requiring these teams to write any code.

A data engineer can easily prepare data using Data Wrangler without writing any code and pass the prepared dataset to a business analyst. A business analyst can then easily build accurate ML models with just a few click using Canvas and get accurate predictions in real time or in batch.

Get started with Data Wrangler using these tools without having to manage any infrastructure. You can set up Canvas quickly and immediately start creating ML models to support your business needs.

About the Authors

Peter Chung is a Solutions Architect for AWS, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions in both the public and private sectors. He holds all AWS certifications as well as two GCP certifications.

 Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.

Dan Ferguson is a Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably.


Continue Reading
Click to comment

Leave a Reply

Your email address will not be published.


Break through language barriers with Amazon Transcribe, Amazon Translate, and Amazon Polly

Imagine a surgeon taking video calls with patients across the globe without the need of a human translator. What if a fledgling startup could easily expand their product across borders and into new geographical markets by offering fluid, accurate, multilingual customer support and sales, all without the need of a live human translator? What happens…




[]Imagine a surgeon taking video calls with patients across the globe without the need of a human translator. What if a fledgling startup could easily expand their product across borders and into new geographical markets by offering fluid, accurate, multilingual customer support and sales, all without the need of a live human translator? What happens to your business when you’re no longer bound by language?

[]It’s common today to have virtual meetings with international teams and customers that speak many different languages. Whether they’re internal or external meetings, meaning often gets lost in complex discussions and you may encounter language barriers that prevent you from being as effective as you could be.

[]In this post, you will learn how to use three fully managed AWS services (Amazon Transcribe, Amazon Translate, and Amazon Polly) to produce a near-real-time speech-to-speech translator solution that can quickly translate a source speaker’s live voice input into a spoken, accurate, translated target language, all with zero machine learning (ML) experience.

Overview of solution

[]Our translator consists of three fully managed AWS ML services working together in a single Python script by using the AWS SDK for Python (Boto3) for our text translation and text-to-speech portions, and an asynchronous streaming SDK for audio input transcription.

Amazon Transcribe: Streaming speech to text

[]The first service you use in our stack is Amazon Transcribe, a fully managed speech-to-text service that takes input speech and transcribes it to text. Amazon Transcribe has flexible ingestion methods, batch or streaming, because it accepts either stored audio files or streaming audio data. In this post, you use the asynchronous Amazon Transcribe streaming SDK for Python, which uses the HTTP/2 streaming protocol to stream live audio and receive live transcriptions.

[]When we first built this prototype, Amazon Transcribe streaming ingestion didn’t support automatic language detection, but this is no longer the case as of November 2021. Both batch and streaming ingestion now support automatic language detection for all supported languages. In this post, we show how a parameter-based solution though a seamless multi-language parameterless design is possible through the use of streaming automatic language detection. After our transcribed speech segment is returned as text, you send a request to Amazon Translate to translate and return the results in our Amazon Transcribe EventHandler method.

Amazon Translate: State-of-the-art, fully managed translation API

[]Next in our stack is Amazon Translate, a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. As of June of 2022, Amazon Translate supports translation across 75 languages, with new language pairs and improvements being made constantly. Amazon Translate uses deep learning models hosted on a highly scalable and resilient AWS Cloud architecture to quickly deliver accurate translations either in real time or batched, depending on your use case. Using Amazon Translate is straightforward and requires no management of underlying architecture or ML skills. Amazon Translate has several features, like creating and using a custom terminology to handle mapping between industry-specific terms. For more information on Amazon Translate service limits, refer to Guidelines and limits. After the application receives the translated text in our target language, it sends the translated text to Amazon Polly for immediate translated audio playback.

Amazon Polly: Fully managed text-to-speech API

[]Finally, you send the translated text to Amazon Polly, a fully managed text-to-speech service that can either send back lifelike audio clip responses for immediate streaming playback or batched and saved in Amazon Simple Storage Service (Amazon S3) for later use. You can control various aspects of speech such as pronunciation, volume, pitch, speech rate, and more using standardized Speech Synthesis Markup Language (SSML).

[]You can synthesize speech for certain Amazon Polly Neural voices using the Newscaster style to make them sound like a TV or radio newscaster. You can also detect when specific words or sentences in the text are being spoken based on the metadata included in the audio stream. This allows the developer to synchronize graphical highlighting and animations, such as the lip movements of an avatar, with the synthesized speech.

[]You can modify the pronunciation of particular words, such as company names, acronyms, foreign words, or neologisms, for example “P!nk,” “ROTFL,” or “C’est la vie” (when spoken in a non-French voice), using custom lexicons.

Architecture overview

[]The following diagram illustrates our solution architecture.

Architectural Diagram []This diagram shows the data flow from the client device to Amazon Transcribe, Amazon Translate, and Amazon Polly

[]The workflow is as follows:

  1. Audio is ingested by the Python SDK.
  2. Amazon Polly converts the speech to text, in 39 possible languages.
  3. Amazon Translate converts the languages.
  4. Amazon Live Transcribe converts text to speech.
  5. Audio is outputted to speakers.


[]You need a host machine set up with a microphone, speakers, and reliable internet connection. A modern laptop should work fine for this because no additional hardware is needed. Next, you need to set up the machine with some software tools.

[]You must have Python 3.7+ installed to use the asynchronous Amazon Transcribe streaming SDK and for a Python module called pyaudio, which you use to control the machine’s microphone and speakers. This module depends on a C library called portaudio.h. If you encounter issues with pyaudio errors, we suggest checking your OS to see if you have the portaudio.h library installed.

[]For authorization and authentication of service calls, you create an AWS Identity and Access Management (IAM) service role with permissions to call the necessary AWS services. By configuring the AWS Command Line Interface (AWS CLI) with this IAM service role, you can run our script on your machine without having to pass in keys or passwords, because the AWS libraries are written to use the configured AWS CLI user’s credentials. This is a convenient method for rapid prototyping and ensures our services are being called by an authorized identity. As always, follow the principle of least privilege when assigning IAM policies when creating an IAM user or role.

[]To summarize, you need the following prerequisites:

  • A PC, Mac, or Linux machine with microphone, speakers, and internet connection
  • The portaudio.h C library for your OS (brew, apt get, wget), which is needed for pyaudio to work
  • AWS CLI 2.0 with properly authorized IAM user configured by running aws configure in the AWS CLI
  • Python 3.7+
  • The asynchronous Amazon Transcribe Python SDK
  • The following Python libraries:
    • boto3
    • amazon-transcribe
    • pyaudio
    • asyncio
    • concurrent

Implement the solution

[]You will be relying heavily on the asynchronous Amazon Transcribe streaming SDK for Python as a starting point, and are going to build on top of that specific SDK. After you have experimented with the streaming SDK for Python, you add streaming microphone input by using pyaudio, a commonly used Python open-source library used for manipulating audio data. Then you add Boto3 calls to Amazon Translate and Amazon Polly for our translation and text-to-speech functionality. Finally, you stream out translated speech through the computer’s speakers again with pyaudio. The Python module concurrent gives you the ability to run blocking code in its own asynchronous thread to play back your returned Amazon Polly speech in a seamless, non-blocking way.

[]Let’s import all our necessary modules, transcribe streaming classes, and instantiate some globals:

import boto3 import asyncio import pyaudio import concurrent from amazon_transcribe.client import TranscribeStreamingClient from amazon_transcribe.handlers import TranscriptResultStreamHandler from amazon_transcribe.model import TranscriptEvent polly = boto3.client(‘polly’, region_name = ‘us-west-2′) translate = boto3.client(service_name=’translate’, region_name=’us-west-2′, use_ssl=True) pa = pyaudio.PyAudio() #for mic stream, 1024 should work fine default_frames = 1024 #current params are set up for English to Mandarin, modify to your liking params[‘source_language’] = “en” params[‘target_language’] = “zh” params[‘lang_code_for_polly’] = “cmn-CN” params[‘voice_id’] = “Zhiyu” params[‘lang_code_for_transcribe’] = “en-US” []First, you use pyaudio to obtain the input device’s sampling rate, device index, and channel count:

#try grabbing the default input device and see if we get lucky default_indput_device = pa.get_default_input_device_info() # verify this is your microphone device print(default_input_device) #if correct then set it as your input device and define some globals input_device = default_input_device input_channel_count = input_device[“maxInputChannels”] input_sample_rate = input_device[“defaultSampleRate”] input_dev_index = input_device[“index”] []If this isn’t working, you can also loop through and print your devices as shown in the following code, and then use the device index to retrieve the device information with pyaudio:

print (“Available devices:n”) for i in range(0, pa.get_device_count()): info = pa.get_device_info_by_index(i) print (str(info[“index”]) + “: t %s n t %s n” % (info[“name”], p.get_host_api_info_by_index(info[“hostApi”])[“name”])) # select the correct index from the above returned list of devices, for example zero dev_index = 0 input_device = pa.get_device_info_by_index(dev_index) #set globals for microphone stream input_channel_count = input_device[“maxInputChannels”] input_sample_rate = input_device[“defaultSampleRate”] input_dev_index = input_device[“index”] []You use channel_count, sample_rate, and dev_index as parameters in a mic stream. In that stream’s callback function, you use an asyncio nonblocking thread-safe callback to put the input bytes of the mic stream into an asyncio input queue. Take note of the loop and input_queue objects created with asyncio and how they’re used in the following code:

async def mic_stream(): # This function wraps the raw input stream from the microphone forwarding # the blocks to an asyncio.Queue. loop = asyncio.get_event_loop() input_queue = asyncio.Queue() def callback(indata, frame_count, time_info, status): loop.call_soon_threadsafe(input_queue.put_nowait, indata) return (indata, pyaudio.paContinue) # Be sure to use the correct parameters for the audio stream that matches # the audio formats described for the source language you’ll be using: # print(input_device) #Open stream stream = = pyaudio.paInt16, channels = input_channel_count, rate = int(input_sample_rate), input = True, frames_per_buffer = default_frames, input_device_index = input_dev_index, stream_callback=callback) # Initiate the audio stream and asynchronously yield the audio chunks # as they become available. stream.start_stream() print(“started stream”) while True: indata = await input_queue.get() yield indata []Now when the generator function mic_stream() is called, it continually yields input bytes as long as there is microphone input data in the input queue.

[]Now that you know how to get input bytes from the microphone, let’s look at how to write Amazon Polly output audio bytes to a speaker output stream:

#text will come from MyEventsHandler def aws_polly_tts(text): response = polly.synthesize_speech( Engine = ‘standard’, LanguageCode = params[‘lang_code_for_polly’], Text=text, VoiceId = params[‘voice_id’], OutputFormat = “pcm”, ) output_bytes = response[‘AudioStream’] #play to the speakers write_to_speaker_stream(output_bytes) #how to write audio bytes to speakers def write_to_speaker_stream(output_bytes): “””Consumes bytes in chunks to produce the response’s output'””” print(“Streaming started…”) chunk_len = 1024 channels = 1 sample_rate = 16000 if output_bytes: polly_stream = format = pyaudio.paInt16, channels = channels, rate = sample_rate, output = True, ) #this is a blocking call – will sort this out with concurrent later while True: data = polly_stream.write(data) #If there’s no more data to read, stop streaming if not data: output_bytes.close() polly_stream.stop_stream() polly_stream.close() break print(“Streaming completed.”) else: print(“Nothing to stream.”) []Now let’s expand on what you built in the post Asynchronous Amazon Transcribe Streaming SDK for Python. In the following code, you create an executor object using the ThreadPoolExecutor subclass with three workers with concurrent. You then add an Amazon Translate call on the finalized returned transcript in the EventHandler and pass that translated text, the executor object, and our aws_polly_tts() function into an asyncio loop with loop.run_in_executor(), which runs our Amazon Polly function (with translated input text) asynchronously at the start of next iteration of the asyncio loop.

#use concurrent package to create an executor object with 3 workers ie threads executor = concurrent.futures.ThreadPoolExecutor(max_workers=3) class MyEventHandler(TranscriptResultStreamHandler): async def handle_transcript_event(self, transcript_event: TranscriptEvent): #If the transcription is finalized, send it to translate results = transcript_event.transcript.results if len(results) > 0: if len(results[0].alternatives) > 0: transcript = results[0].alternatives[0].transcript print(“transcript:”, transcript) print(results[0].channel_id) if hasattr(results[0], “is_partial”) and results[0].is_partial == False: #translate only 1 channel. the other channel is a duplicate if results[0].channel_id == “ch_0”: trans_result = translate.translate_text( Text = transcript, SourceLanguageCode = params[‘source_language’], TargetLanguageCode = params[‘target_language’] ) print(“translated text:” + trans_result.get(“TranslatedText”)) text = trans_result.get(“TranslatedText”) #we run aws_polly_tts with a non-blocking executor at every loop iteration await loop.run_in_executor(executor, aws_polly_tts, text) []Finally, we have the loop_me() function. In it, you define write_chunks(), which takes an Amazon Transcribe stream as an argument and asynchronously writes chunks of streaming mic input to it. You then use MyEventHandler() with the output transcription stream as its argument and create a handler object. Then you use await with asyncio.gather() and pass in the write_chunks() and handler with the handle_events() method to handle the eventual futures of these coroutines. Lastly, you gather all event loops and loop the loop_me() function with run_until_complete(). See the following code:

async def loop_me(): # Setup up our client with our chosen AWS region client = TranscribeStreamingClient(region=”us-west-2″) stream = await client.start_stream_transcription( language_code=params[‘lang_code_for_transcribe’], media_sample_rate_hz=int(device_info[“defaultSampleRate”]), number_of_channels = 2, enable_channel_identification=True, media_encoding=”pcm”, ) recorded_frames = [] async def write_chunks(stream): # This connects the raw audio chunks generator coming from the microphone # and passes them along to the transcription stream. print(“getting mic stream”) async for chunk in mic_stream(): t.tic() recorded_frames.append(chunk) await stream.input_stream.send_audio_event(audio_chunk=chunk) t.toc(“chunks passed to transcribe: “) await stream.input_stream.end_stream() handler = MyEventHandler(stream.output_stream) await asyncio.gather(write_chunks(stream), handler.handle_events()) #write a proper while loop here loop = asyncio.get_event_loop() loop.run_until_complete(loop_me()) loop.close() []When the preceding code is run together without errors, you can speak into the microphone and quickly hear your voice translated to Mandarin Chinese. The automatic language detection feature for Amazon Transcribe and Amazon Translate translates any supported input language into the target language. You can speak for quite some time and because of the non-blocking nature of the function calls, all your speech input is translated and spoken, making this an excellent tool for translating live speeches.


[]Although this post demonstrated how these three fully managed AWS APIs can function seamlessly together, we encourage you to think about how you could use these services in other ways to deliver multilingual support for services or media like multilingual closed captioning for a fraction of the current cost. Medicine, business, and even diplomatic relations could all benefit from an ever-improving, low-cost, low-maintenance translation service.

[]For more information about the proof of concept code base for this use case check out our Github.

About the Authors

[]Michael Tran is a Solutions Architect with Envision Engineering team at Amazon Web Services. He provides technical guidance and helps customers accelerate their ability to innovate through showing the art of the possible on AWS. He has built multiple prototypes around AI/ML, and IoT for our customers. You can contact me @Mike_Trann on Twitter.

[]Cameron Wilkes is a Prototyping Architect on the AWS Industry Accelerator team. While on the team he delivered several ML based prototypes to customers to demonstrate the “Art of the Possible” of ML on AWS. He enjoys music production, off-roading and design.


Continue Reading


Use Amazon SageMaker Data Wrangler in Amazon SageMaker Studio with a default lifecycle configuration

If you use the default lifecycle configuration for your domain or user profile in Amazon SageMaker Studio and use Amazon SageMaker Data Wrangler for data preparation, then this post is for you. In this post, we show how you can create a Data Wrangler flow and use it for data preparation in a Studio environment…




If you use the default lifecycle configuration for your domain or user profile in Amazon SageMaker Studio and use Amazon SageMaker Data Wrangler for data preparation, then this post is for you. In this post, we show how you can create a Data Wrangler flow and use it for data preparation in a Studio environment with a default lifecycle configuration.

Data Wrangler is a capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications via a visual interface. Data preparation is a crucial step of the ML lifecycle, and Data Wrangler provides an end-to-end solution to import, explore, transform, featurize, and process data for ML in a visual, low-code experience. It lets you easily and quickly connect to AWS components like Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and AWS Lake Formation, and external sources like Snowflake and DataBricks DeltaLake. Data Wrangler supports standard data types such as CSV, JSON, ORC, and Parquet.

Studio apps are interactive applications that enable Studio’s visual interface, code authoring, and run experience. App types can be either Jupyter Server or Kernel Gateway:

  • Jupyter Server – Enables access to the visual interface for Studio. Every user in Studio gets their own Jupyter Server app.
  • Kernel Gateway – Enables access to the code run environment and kernels for your Studio notebooks and terminals. For more information, see Jupyter Kernel Gateway.

Lifecycle configurations (LCCs) are shell scripts to automate customization for your Studio environments, such as installing JupyterLab extensions, preloading datasets, and setting up source code repositories. LCC scripts are triggered by Studio lifecycle events, such as starting a new Studio notebook. To set a lifecycle configuration as the default for your domain or user profile programmatically, you can create a new resource or update an existing resource. To associate a lifecycle configuration as a default, you first need to create a lifecycle configuration following the steps in Creating and Associating a Lifecycle Configuration

Note: Default lifecycle configurations set up at the domain level are inherited by all users, whereas those set up at the user level are scoped to a specific user. If you apply both domain-level and user profile-level lifecycle configurations at the same time, the user profile-level lifecycle configuration takes precedence and is applied to the application irrespective of what lifecycle configuration is applied at the domain level. For more information, see Setting Default Lifecycle Configurations.

Data Wrangler accepts the default Kernel Gateway lifecycle configuration, but some of the commands defined in the default Kernel Gateway lifecycle configuration aren’t applicable to Data Wrangler, which can cause Data Wrangler to fail to start. The following screenshot shows an example of an error message you might get when launching the Data Wrangler flow. This may happen only with default lifecycle configurations and not with lifecycle configurations.

Data Wrangler Error

Solution overview

Customers using the default lifecycle configuration in Studio can follow this post and use the supplied code block within the lifecycle configuration script to launch a Data Wrangler app without any errors.

Set up the default lifecycle configuration

To set up a default lifecycle configuration, you must add it to the DefaultResourceSpec of the appropriate app type. The behavior of your lifecycle configuration depends on whether it’s added to the DefaultResourceSpec of a Jupyter Server or Kernel Gateway app:

  • Jupyter Server apps – When added to the DefaultResourceSpec of a Jupyter Server app, the default lifecycle configuration script runs automatically when the user logs in to Studio for the first time or restarts Studio. You can use this to automate one-time setup actions for the Studio developer environment, such as installing notebook extensions or setting up a GitHub repo. For an example of this, see Customize Amazon SageMaker Studio using Lifecycle Configurations.
  • Kernel Gateway apps – When added to the DefaultResourceSpec of a Kernel Gateway app, Studio defaults to selecting the lifecycle configuration script from the Studio launcher. You can launch a notebook or terminal with the default script or choose a different one from the list of lifecycle configurations.

A default Kernel Gateway lifecycle configuration specified in DefaultResourceSpec applies to all Kernel Gateway images in the Studio domain unless you choose a different script from the list presented in the Studio launcher.

When you work with lifecycle configurations for Studio, you create a lifecycle configuration and attach it to either your Studio domain or user profile. You can then launch a Jupyter Server or Kernel Gateway application to use the lifecycle configuration.

The following table summarizes these errors you may encounter when launching a Data Wrangler application with default lifecycle configurations.

Level at Which the Lifecycle Configuration Is Applied

Create Data Wrangler Flow Works (or) Error

Domain Bad Request Error Apply the script (see below)
User Profile Bad Request Error Apply the script (see below)
Application Works—No issue Not required

When you use the default lifecycle configuration associated with Studio and Data Wrangler (Kernel Gateway app), you might encounter Kernel Gateway app failure. In this post, we demonstrate how to set the default lifecycle configuration properly to exclude running commands in a Data Wrangler application so you don’t encounter Kernel Gateway app failure.

Let’s say you want to install a git-clone-repo script as the default lifecycle configuration that checks out a Git repository under the user’s home folder automatically when the Jupyter server starts. Let’s look at each scenario of applying a lifecycle configuration (Studio domain, user profile, or application level).

Apply lifecycle configuration at the Studio domain or user profile level

To apply the default Kernel Gateway lifecycle configuration at the Studio domain or user profile level, complete the steps in this section. We start with instructions for the user profile level.

In your lifecycle configuration script, you have to include the following code block that checks and skips the Data Wrangler Kernel Gateway app:

#!/bin/bash set -eux STATUS=$( python3 -c “import sagemaker_dataprep” echo $? ) if [ “$STATUS” -eq 0 ]; then echo ‘Instance is of Type Data Wrangler’ else echo ‘Instance is not of Type Data Wrangler’ fi

For example, let’s use the following script as our original (note that the folder to clone the repo is changed to /root from /home/sagemaker-user):

# Clones a git repository into the user’s home folder #!/bin/bash set -eux # Replace this with the URL of your git repository export REPOSITORY_URL=”” git -C /root clone $REPOSITORY_URL

The new modified script looks like the following:

#!/bin/bash set -eux STATUS=$( python3 -c “import sagemaker_dataprep” echo $? ) if [ “$STATUS” -eq 0 ]; then echo ‘Instance is of Type Data Wrangler’ else echo ‘Instance is not of Type Data Wrangler’ # Replace this with the URL of your git repository export REPOSITORY_URL=”” git -C /root clone $REPOSITORY_URL fi

You can save this script as

Now you run a series of commands in your terminal or command prompt. You should configure the AWS Command Line Interface (AWS CLI) to interact with AWS. If you haven’t set up the AWS CLI, refer to Configuring the AWS CLI.

  1. Convert your file into Base64 format. This requirement prevents errors due to the encoding of spacing and line breaks. LCC_GIT=openssl base64 -A -in /Users/abcde/Downloads/
  2. Create a Studio lifecycle configuration. The following command creates a lifecycle configuration that runs on launch of an associated Kernel Gateway app: aws sagemaker create-studio-lifecycle-config —region us-east-2 —studio-lifecycle-config-name lcc-git —studio-lifecycle-config-content $LCC_GIT —studio-lifecycle-config-app-type KernelGateway
  3. Use the following API call to create a new user profile with an associated lifecycle configuration: aws sagemaker create-user-profile –domain-id d-vqc14vvvvvvv –user-profile-name test –region us-east-2 –user-settings ‘{ “KernelGatewayAppSettings”: { “LifecycleConfigArns” : [“arn:aws:sagemaker:us-east-2:000000000000:studio-lifecycle-config/lcc-git”], “DefaultResourceSpec”: { “InstanceType”: “ml.m5.xlarge”, “LifecycleConfigArn”: “arn:aws:sagemaker:us-east-2:00000000000:studio-lifecycle-config/lcc-git” } } }’

    Alternatively, if you want to create a Studio domain to associate your lifecycle configuration at the domain level, or update the user profile or domain, you can follow the steps in Setting Default Lifecycle Configurations.

  4. Now you can launch your Studio app from the SageMaker Control Panel.control Panel
  5. In your Studio environment, on the File menu, choose New and Data Wrangler Flow.The new Data Wrangler flow should open without any issues.
    New Data Wrangler Flow
  6. To validate the Git clone, you can open a new Launcher in Studio.
  7. Under Notebooks and compute resources, choose the Python 3 notebook and the Data Science SageMaker image to start your script as your default lifecycle configuration script.
    Notebook and Compute

You can see the Git cloned to /root in the following screenshot.

Git cloned to /root

We have successfully applied the default Kernel lifecycle configuration at the user profile level and created a Data Wrangler flow. To configure at the Studio domain level, the only change is instead of creating a user profile, you pass the ARN of the lifecycle configuration in a create-domain call.

Apply lifecycle configuration at the application level

If you apply the default Kernel Gateway lifecycle configuration at the application level, you won’t have any issues because Data Wrangler skips the lifecycle configuration applied at the application level.


In this post, we showed how to configure your default lifecycle configuration properly for Studio when you use Data Wrangler for data preparation and visualization requirements.

To summarize, if you need to use the default lifecycle configuration for Studio to automate customization for your Studio environments and use Data Wrangler for data preparation, you can apply the default Kernel Gateway lifecycle configuration at the user profile or Studio domain level with the appropriate code block included in your lifecycle configuration so that the default lifecycle configuration checks it and skips the Data Wrangler Kernel Gateway app.

For more information, see the following resources:

About the Authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Vicky Zhang is a Software Development Engineer at Amazon SageMaker. She is passionate about problem solving. In her spare time, she enjoys watching detective movies and playing badminton.

Rahul Nabera is a Data Analytics Consultant in AWS Professional Services. His current work focuses on enabling customers build their data and machine learning workloads on AWS. In his spare time, he enjoys playing cricket and volleyball.


Continue Reading


AWS Week in Review – July 4, 2022

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS! Summer has arrived in Finland, and these last few days have been hotter than in the Canary Islands! Today in the US it is Independence Day. I hope that if…




This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Summer has arrived in Finland, and these last few days have been hotter than in the Canary Islands! Today in the US it is Independence Day. I hope that if you are celebrating, you’re having a great time. This week I’m very excited about some developer experience and artificial intelligence launches.

Last Week’s Launches
Here are some launches that got my attention during the previous week:

AWS SAM Accelerate is now generally available – SAM Accelerate is a new capability of the AWS Serverless Application Model CLI, which makes it easier for serverless developers to test code changes against the cloud. You can do a hot swap of code directly in the cloud when making a change in your local development environment. This allows you to develop applications faster. Learn more about this launch in the What’s New post.

Amplify UI for React is generally available – Amplify UI is an open-source UI library that helps developers build cloud-native applications. Amplify UI for React comes with over 35 components that you can use, an authentication component that allows you to connect to your backend with no extra configuration, theming for your components. You can also build your UI using Figma. Check the Amplify UI for React site to learn more about all the capabilities offered.

Amazon Connect has new announcements – First, Amazon Connect added support to personalize the flows of the customer experience using Amazon Lex sentiment analysis. It also added support to branch out the flows depending on Amazon Lex confidence scores. Lastly, it added confidence scores to Amazon Connect Customer Profiles to help companies merge duplicate customer records.

Amazon QuickSight – QuickSight authors can now learn and experience Q before signing up. Authors can choose from six different sample topics and explore different visualizations. In addition, QuickSight now supports Level Aware Calculations (LAC) and rolling date functionality. These two new features bring flexibility and simplification to customers to build advanced calculation and dashboards.

Amazon SageMaker – RStudio on SageMaker now allows you to bring your own development environment in a custom image. RStudio on SageMaker is a fully managed RStudio Workbench in the cloud. In addition, SageMaker added four new tabular data modeling algorithms: LightGBM, CatBoost, AutoGluon-Tabular, and TabTransformer to the existing set of built-in algorithms, pre-trained models and pre-built solution templates it provides.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other updates and news that you may have missed:

AWS Support announced an improved experience when creating a case – There is a new interface for creating support cases in the AWS Support Center console. Now you can create a case with a simplified three-step process that guides you through the flow. Learn more about this new process in the What’s new post.

New AWS Step Functions workflows collection on Serverless Land – The Step Functions workflows collection is a new experience that makes it easier to discover, deploy, and share AWS Step Functions workflows. In this collection, you can find opinionated templates that implement the best practices to build using Step Functions. Learn more about this new collection in Ben’s blog post.

Podcast Charlas Técnicas de AWS – If you understand Spanish, this podcast is for you. Podcast Charlas Técnicas is one of the official AWS Podcasts in Spanish, which shares a new episode ever other week. The podcast is meant for builders, and it shares stories about how customers implement and learn AWS, how to architect applications, and how to use new services. You can listen to all the episodes directly from your favorite podcast app or from the AWS Podcasts en español website.

AWS open-source news and updates – A newsletter curated by my colleague Ricardo brings you the latest open-source projects, posts, events, and more.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS Summit New York – Join us on July 12 for the in-person AWS Summit. You can register on the AWS Summit page for free.

AWS re:Inforce – This is an in-person learning conference with a focus on security, compliance, identity, and privacy. You can register now to access hundreds of technical sessions, and other content. It will take place July 26 and 27 in Boston, MA.

That’s all for this week. Check back next Monday for another Week in Review!

— Marcia


Continue Reading


Copyright © 2021 Today's Digital.