Connect with us


How Imperva expedites ML development and collaboration via Amazon SageMaker notebooks

This is a guest post by Imperva, a solutions provider for cybersecurity.  Imperva is a cybersecurity leader, headquartered in California, USA, whose mission is to protect data and all paths to it. In the last few years, we’ve been working on integrating machine learning (ML) into our products. This includes detecting malicious activities in databases,…



[]This is a guest post by Imperva, a solutions provider for cybersecurity. 

[]Imperva is a cybersecurity leader, headquartered in California, USA, whose mission is to protect data and all paths to it. In the last few years, we’ve been working on integrating machine learning (ML) into our products. This includes detecting malicious activities in databases, automatically configuring security policies, and clustering security events into meaningful stories.

[]As we’re pushing to advance our detection capabilities, we’re investing in ML models for our solutions. For example, Imperva provides an API Security service. This service aims to protect all APIs from various attacks, including attacks that traditional WAF can’t easily stop, such as those described in the OWASP top 10. This is a significant investment area for us, so we took steps to expedite our ML development process in order to cover more ground, efficiently research API attacks, and expedite our ability to deliver value for our customers.

[]In this post, we share how we expedited ML development and collaboration via Amazon SageMaker notebooks.

Jupyter Notebooks: The common research ground

[]Data science research processes raised the attention of big tech companies and the development community to new heights. It’s now easier than ever to kick off a data-driven project using managed ML services. A great example for this is the rise of citizen data scientists, which according to Gartner are “‘power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more expertise.”

[]With the expected growth of ML users, sharing experiments across teams becomes a critical parameter in the development velocity. Among the many common steps, one of the most important steps for data scientists kicking off a project would be to open up a new Jupyter notebook and dive into the challenge ahead.

[]Jupyter notebooks are a cross between an IDE and a document. It provides the researcher with an easy and interactive way to test different approaches, plot the results, present and export them, while using a language and interface of their choice such as Python, R, Spark, Bash, or others.

[]Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML. SageMaker includes this exact capability and more as part of its SageMaker notebooks feature.

[]Anyone who has tried to use Jupyter Notebooks in a team has probably reached a point where they attempted to use a notebook belonging to someone else, only to find out it’s not as easy as it sounds. Often, you just don’t have access to the required notebook. On other occasions, notebooks are used locally for research, and so the code is often littered with hardcoded paths and isn’t committed to any repository. Even if the code is committed to a repository of some sorts, (hopefully) the data it requires isn’t committed. To sum things up, it ain’t easy to collaborate with Jupyter Notebooks.

[]In this post, we show you how we share data science research code at Imperva, and how we use SageMaker notebooks with additional features we’ve added to support our custom requirements and enhance collaboration. We also share how all these efforts have led to a significant reduction in costs and time spent on housekeeping. Although this architecture is a good fit for us, you can choose different configurations, such as complete resource isolation with a separate file system for each user.

How we expedited our ML development

[]Our workflow is pretty standard, we take a subset of data, load it into a Jupyter notebook, and start exploring the data. After we have a decent understanding of the data, we start experimenting and combining different algorithms until we come up with a decent initial solution. When we have a good enough proof of concept (POC), we proceed to validate the results over time, experimenting and adjusting the algorithm as we go. Eventually, when we reach a high level of confidence, we deliver the model and continue to validate the results.


[]At first this process made perfect sense. We had small projects that didn’t require much computing power, and we had enough time to work on them solo until we reached a POC. The projects were simple enough for us to deploy, serve, and monitor the model ourselves, or in other cases, deliver the model as a Docker container. When performance and scale were important, we would pass ownership of the model to a dev team using a specification document with pseudo-code. But times are changing, and as the team and projects grew and developed, we needed a better way to do things. We had to scale our projects when massive computing resources were required, and find a better way to pass ownership without using dull and extensive specification documents.

[]Furthermore, when everyone is using some remote virtual machine or Amazon Elastic Compute Cloud (Amazon EC2) instance to run their Jupyter notebooks, their projects tend to lack documentation and get messy.

SageMaker notebooks

[]In comes SageMaker notebooks: a managed Jupyter Notebooks platform hosted on AWS, where you can easily create a notebook instance—an EC2 (virtual computer) instance that runs a Jupyter Notebooks server. Besides the notebook now being in the cloud and accessible from everywhere, you can easily rescale the notebook instance, giving it as much computing resources as you require.

[]Having unlimited computing resources is great, but it wasn’t why we decided to start using SageMaker notebooks. We can summarize the objectives we wanted to achieve into three main points:

  • Making research easier – Creating an easy, user-friendly work environment that can be quickly accessed and shared within the research team.
  • Organizing data and code – Cutting the mess by making it easier to access data and creating a structured way to keep code.
  • Delivering projects – Creating a better way to separate research playground and production, and finding a better way to share our ideas with development teams without using extensive, dull documents.

Easier research

[]SageMaker notebooks reside in the cloud, making it inherently accessible from almost anywhere. Starting a Jupyter notebook takes just a few minutes and all your output from the previous run is saved, making it very simple to jump right back into it. However, our research requirements included a few additional aspects that needed a solution:

  • Quick views – Having the notebooks available at all times in order to review results of previous runs. If the instance where you keep your code is down, you have to start it just to look at the output. This can be frustrating, especially if you’re using an expensive instance and you just want to look at your results. This cut down the time each team member had to spend waiting for the instance to start from 5–15 minutes to 0.
  • Shared views – Having the ability to explore cross-instance notebooks. SageMaker notebook instances are provided with dedicated storage by default. We wanted to break this wall and enable the team to work together.
  • Persistent libraries – Libraries are stored temporarily in SageMaker notebook instances. We wanted to change that to cut down the time it takes to fully install all the required libraries and shorten it by 100%, from approximately 5 minutes down to 0.
  • Cost-effective service – Optimizing costs while minimizing researchers’ involvement. By default, turning an instance on and off is done manually. This could lead to unnecessary charges caused by human error.

[]To bridge the gap between the default SageMaker configuration and what we were looking for, we used just two main ingredients: Amazon Elastic File System (Amazon EFS) and lifecycle configuration in SageMaker. The first, as the name implies, is a file system, and the second is basically a piece of code that runs when the notebook is started or first created.

Shared and quick views

[]We connected this file system to all our notebook instances so that they all have a shared file system. This way we can save our code in Amazon EFS, instead of using the notebook instance’s file system, and access it from any notebook instance.

[]This made things easier because we can now create a read-only, small, super cheap notebook instance (for this post, let’s call it the viewer instance) that always stays on, and use it to easily access code and results without needing to start the notebook instance that ran the code. Furthermore, we can now easily share code between ourselves because it’s stored in a shared location instead of being kept in multiple different notebook instances.


[]So, how do you actually connect a file system to a notebook instance?

[]We created a lifecycle configuration that connects an EFS to a notebook instance, and attached this configuration to every notebook instance we wanted to be part of the shared environment.

[]In this section, we walk you through the lifecycle configuration script we wrote, or to be more accurate, stole shamelessly from the examples provided by AWS and mashed them together.

[]The following script prefix is standard boilerplate:

[]Now we connect the notebook to an EFS make sure you know the EFS instance’s name: mkdir -p /home/ec2-user/SageMaker/efs sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport $EFS_NAME:/ /home/ec2-user/SageMaker/efs sudo chmod go+rw /home/ec2-user/SageMaker/efs

Persistent and cost-effective service

[]After we connected the file system, we started thinking about working with notebooks. Because AWS charges for every hour the instance is running, we decided it would be good practice to automatically shut down the SageMaker notebook if it’s idle for a while. We started with a default value of 1 hour, but by using the instance’s tags, users could set any value that suits them from the SageMaker GUI. Applying the default 1-hour configuration could be defined as global lifecycle configuration, and overriding it can be defined as local lifecycle configuration. This policy effectively prevented researchers from accidentally leaving on unused instances, reducing the cost of SageMaker instances by 25%.


# get instance tags location NOTEBOOK_ARN=$(jq ‘.ResourceArn’ /opt/ml/metadata/resource-metadata.json –raw-output) # extract idle time parameter value from tags list IDLE_TIME=$(aws sagemaker list-tags –resource-arn $NOTEBOOK_ARN | jq ‘.Tags[] | select(.Key==”idle”) | .Value’) # in case idle time not specified set to one hour (3600 sec) [[ -z “$IDLE_TIME” ]] && IDLE_TIME=3600 # fetch the auto stop script from AWS samples repo wget # starting the SageMaker autostop script in cron (crontab -l 2>/dev/null; echo “*/5 * * * * /usr/bin/python $PWD/ –time $IDLE_TIME –ignore-connections”) | crontab – sudo -u ec2-user -i <<'EOF' unset SUDO_UID []So now the notebook is connected to Amazon EFS and automatically shuts down when idle. But this raised another issue—by default, Python libraries in SageMaker notebook instances are installed in the ephemeral storage, meaning they get deleted when the instance is stopped and have to be reinstalled the next time the instance is started. This means we have to reinstall libraries at least once a day, which isn’t the best experience and can take anywhere between a few seconds to a few minutes per package. We decided to add a script that changes this behavior and causes all library installations to be persistent by changing the Python library installation path to the notebook instance’s storage (Amazon Elastic Block Store), effectively eliminated any time wasted on reinstalling packages.


[]This script runs every time the notebook instance starts, installs miniconda and some basic Python libraries in the persistent storage, and activates miniconda:

# use an address within the notebook instance’s file system WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda # if this is the first time the lifecycle config is running – install miniconda if [ ! -d “$WORKING_DIR” ]; then mkdir -p “$WORKING_DIR” # download miniconda wget -O “$WORKING_DIR/” # install miniconda bash “$WORKING_DIR/” -b -u -p “$WORKING_DIR/miniconda” # delete miniconda installer rm -rf “$WORKING_DIR/” # create a custom conda environment source “$WORKING_DIR/miniconda/bin/activate” KERNEL_NAME=”custom_python” PYTHON=”3.9″ conda create –yes –name “$KERNEL_NAME” python=”$PYTHON” conda activate “$KERNEL_NAME” pip install –quiet ipykernel conda install –yes numpy pip install –quiet boto3 pandas matplotlib sklearn dill EOF fi # activate miniconda source “$WORKING_DIR/miniconda/bin/activate” for env in $WORKING_DIR/miniconda/envs/*; do BASENAME=$(basename “$env”) source activate “$BASENAME” python -m ipykernel install –user –name “$BASENAME” –display-name “Custom ($BASENAME)” Done # disable SageMaker-provided Conda functionality, leaving in only what we’ve installed echo “c.EnvironmentKernelSpecManager.use_conda_directly = False” >> /home/ec2-user/.jupyter/ rm /home/ec2-user/.condarc EOF []Quick restart and we’re done!

# restart the Jupyter server restart jupyter-server

Data and code organization

[]Remember the EFS that we just talked about? It’s here for more.

[]After storing all our code in the same location, we thought it might be better to organize it a bit.

[]We decided that each team member should create their own notebook instance that only they use. However, instead of using the instance’s file system, we use Amazon EFS and implement the following hierarchy:

[]——–Team member




[]This way we can all easily access each other’s code, but we still know what belongs to whom.

[]But what about completed projects? We decided to add an additional branch for projects that have been fully documented and delivered:

[]——–Team member




[]——–Completed projects




[]So now that our code is organized neatly, how do we access our data?

[]We keep our data in Amazon Simple Storage Service (Amazon S3) and access it via Amazon Athena. This made it very easy to set a role for our notebook instances with permissions to access Athena and Amazon S3. This way, by simply using a few lines of code, and without messing around with credentials, we can easily query Athena and pull data to work on.

[]On top of that, we created a dedicated network using Amazon Virtual Private Cloud (Amazon VPC), which gave the notebook instances access to our internal Git repository and private PyPI repository. This made it easy to access useful internal code and packages. The following diagram shows how it all looks in our notebooks platform.



[]Finally, how do we utilize these notebooks to easily deliver projects?

[]One of the great things about Jupyter notebooks is that, in addition to writing code and displaying the output, you can easily add text and headlines, thereby creating an interactive document.

[]In the next few lines, we describe our delivery processes when we hand over the model to a dev team, and when we deploy the model ourselves.

[]On projects where scale, performance, and reliability are a high priority, we hand over the model to be rewritten by a dev team. After we reach a mature POC, we share the notebook with the developers assigned to the project using the previously mentioned read-only notebook instance.

[]The developers can now read the document, see the input and output for each block of code, and have a better understanding of how it works and why, which makes it easier for them to implement. In the past, we had to write a specification document for these types of cases, which basically means rewriting the code as pseudo code with lots of comments and explanations. Now we could simply integrate our comments and explanation into the SageMaker notebook, which saved many days of work for each project.

[]On projects that don’t require a dev team to rewrite the code, we reorganize the code inside a Docker container, and deploy it in a Kubernetes cluster. Although it might seem like a hassle to transform code from a notebook into a Dockerized, standard Python project, this process has its own benefits:

  • Explainability and visibility – Instead of explaining what your algorithm does by diving through your messy project, you can just use the notebook you worked on during the research phase.
  • Purpose separation – The research code is in the notebook, and the production code is in the Python project. You can keep researching without touching the production code and only update it when you’ve had a breakthrough.
  • Debuggability – If your model runs into trouble, you can easily debug it in the notebook.

What’s next

[]Jupyter notebooks provide a great playground for data scientists. On a smaller scale, it’s very convenient to use on your local machine. However, when you start working on larger projects in larger teams, there are many advantages to moving to a managed Jupyter Notebooks server. The great thing about SageMaker notebooks is that you can customize your notebook instances, such as instance size, code sharing, and automation scripts, kernel selection, and more, which helps you save tremendous amounts of time and money

[]Simply put, we created a process that expedites ML development and collaboration while reducing the cost of SageMaker notebooks by at least 25%, and reducing the overhead time researchers spend on installations and waiting for instances to be ready to work.

[]Our current SageMaker notebooks environment contains the following:

  • Managed Jupyter notebook instances
  • Separate, customizable computing instances for each user
  • Shared file system used to organize projects and easily share code with peers
  • Lifecycle configurations that reduce costs and make it easier to start working
  • Connection to data sources, code repositories, and package indexes

[]We plan on making this environment even better by adding a few additional features:

  • Cost monitoring – To monitor our budget, we’ll add a special tag to each instance in order to track their cost.
  • Auto save state – We’ll create a lifecycle configuration that automatically saves a notebook’s state, allowing users to easily restore the notebook’s state even after it was shut down.
  • Restricted permissions system – We want to enable users from different groups to participate in our research and explore our data by letting them create notebook instances and access our data, but under predefined restrictions. For example, they’ll only be able to create small, inexpensive notebook instances, and access only a part of the data.

[]As a next step, we encourage you to try out SageMaker notebooks. For more examples, check out the SageMaker examples GitHub repo.

About the Authors

[] Matan Lion is Data Science team leader at Imperva’s Threat Research Group. His team is responsible for delivering data-driven solutions and cyber security innovation across the company products portfolio, including application and data security frontlines, leveraging big data and machine learning

[]Johnathan Azaria is Data Scientist and a member of Imperva Research Labs, a premier research organization for security analysis, vulnerability discovery and compliance expertise. Prior to the data science role, Johnathan was a security researcher specialized in network and application based attacks. Johnathan holds a B.Sc and an M.Sc in Bioinformatics from Bar Ilan University.

[]Yaniv Vaknin is a Machine Learning Specialist at Amazon Web Services. Prior to AWS, Yaniv held leadership positions with AI startups and Enterprise including co-founder and CEO of Yaniv works with AWS customers to harness the power of Machine Learning to solve real world tasks and derive value. In his spare time, Yaniv enjoys playing soccer with his boys.


Continue Reading
Click to comment

Leave a Reply

Your email address will not be published.


Customize pronunciation using lexicons in Amazon Polly

Amazon Polly is a text-to-speech service that uses advanced deep learning technologies to synthesize natural-sounding human speech. It is used in a variety of use cases, such as contact center systems, delivering conversational user experiences with human-like voices for automated real-time status check, automated account and billing inquiries, and by news agencies like The Washington…




Amazon Polly is a text-to-speech service that uses advanced deep learning technologies to synthesize natural-sounding human speech. It is used in a variety of use cases, such as contact center systems, delivering conversational user experiences with human-like voices for automated real-time status check, automated account and billing inquiries, and by news agencies like The Washington Post to allow readers to listen to news articles.

As of today, Amazon Polly provides over 60 voices in 30+ language variants. Amazon Polly also uses context to pronounce certain words differently based upon the verb tense and other contextual information. For example, “read” in “I read a book” (present tense) and “I will read a book” (future tense) is pronounced differently.

However, in some situations you may want to customize the way Amazon Polly pronounces a word. For example, you may need to match the pronunciation with local dialect or vernacular. Names of things (e.g., Tomato can be pronounced as tom-ah-to or tom-ay-to), people, streets, or places are often pronounced in many different ways.

In this post, we demonstrate how you can leverage lexicons for creating custom pronunciations. You can apply lexicons for use cases such as publishing, education, or call centers.

Customize pronunciation using SSML tag

Let’s say you stream a popular podcast from Australia and you use the Amazon Polly Australian English (Olivia) voice to convert your script into human-like speech. In one of your scripts, you want to use words that are unknown to Amazon Polly voice. For example, you want to send Mātariki (Māori New Year) greetings to your New Zealand listeners. For such scenarios, Amazon Polly supports phonetic pronunciation, which you can use to achieve a pronunciation that is close to the correct pronunciation in the foreign language.

You can use the Speech Synthesis Markup Language (SSML) tag to suggest a phonetic pronunciation in the ph attribute. Let me show you how you can use SSML tag.

First, login into your AWS console and search for Amazon Polly in the search bar at the top. Select Amazon Polly and then choose Try Polly button.

In the Amazon Polly console, select Australian English from the language dropdown and enter following text in the Input text box and then click on Listen to test the pronunciation.

I’m wishing you all a very Happy Mātariki.

Sample speech without applying phonetic pronunciation:

If you hear the sample speech above, you can notice that the pronunciation of Mātariki – a word which is not part of Australian English – isn’t quite spot-on. Now, let’s look at how in such scenarios we can use phonetic pronunciation using SSML tag to customize the speech produced by Amazon Polly.

To use SSML tags, turn ON the SSML option in Amazon Polly console. Then copy and paste following SSML script containing phonetic pronunciation for Mātariki specified inside the ph attribute of the tag.

I’m wishing you all a very Happy Mātariki.

With the tag, Amazon Polly uses the pronunciation specified by the ph attribute instead of the standard pronunciation associated by default with the language used by the selected voice.

Sample speech after applying phonetic pronunciation:

If you hear the sample sound, you’ll notice that we opted for a different pronunciation for some of vowels (e.g., ā) to make Amazon Polly synthesize the sounds that are closer to the correct pronunciation. Now you might have a question, how do I generate the phonetic transcription “” for the word Mātariki?

You can create phonetic transcriptions by referring to the Phoneme and Viseme tables for the supported languages. In the example above we have used the phonemes for Australian English.

Amazon Polly offers support in two phonetic alphabets: IPA and X-Sampa. Benefit of X-Sampa is that they are standard ASCII characters, so it is easier to type the phonetic transcription with a normal keyboard. You can use either of IPA or X-Sampa to generate your transcriptions, but make sure to stay consistent with your choice, especially when you use a lexicon file which we’ll cover in the next section.

Each phoneme in the phoneme table represents a speech sound. The bolded letters in the “Example” column of the Phoneme/Viseme table in the Australian English page linked above represent the part of the word the “Phoneme” corresponds to. For example, the phoneme /j/ represents the sound that an Australian English speaker makes when pronouncing the letter “y” in “yes.”

Customize pronunciation using lexicons

Phoneme tags are suitable for one-off situations to customize isolated cases, but these are not scalable. If you process huge volume of text, managed by different editors and reviewers, we recommend using lexicons. Using lexicons, you can achieve consistency in adding custom pronunciations and simultaneously reduce manual effort of inserting phoneme tags into the script.

A good practice is that after you test the custom pronunciation on the Amazon Polly console using the tag, you create a library of customized pronunciations using lexicons. Once lexicons file is uploaded, Amazon Polly will automatically apply phonetic pronunciations specified in the lexicons file and eliminate the need to manually provide a tag.

Create a lexicon file

A lexicon file contains the mapping between words and their phonetic pronunciations. Pronunciation Lexicon Specification (PLS) is a W3C recommendation for specifying interoperable pronunciation information. The following is an example PLS document:

Matariki Mātariki NZ New Zealand

Make sure that you use correct value for the xml:lang field. Use en-AU if you’re uploading the lexicon file to use with the Amazon Polly Australian English voice. For a complete list of supported languages, refer to Languages Supported by Amazon Polly.

To specify a custom pronunciation, you need to add a element which is a container for a lexical entry with one or more element and one or more pronunciation information provided inside element.

The element contains the text describing the orthography of the element. You can use a element to specify the word whose pronunciation you want to customize. You can add multiple elements to specify all word variations, for example with or without macrons. The element is case-sensitive, and during speech synthesis Amazon Polly string matches the words inside your script that you’re converting to speech. If a match is found, it uses the element, which describes how the is pronounced to generate phonetic transcription.

You can also use for commonly used abbreviations. In the preceding example of a lexicon file, NZ is used as an alias for New Zealand. This means that whenever Amazon Polly comes across “NZ” (with matching case) in the body of the text, it’ll read those two letters as “New Zealand”.

For more information on lexicon file format, see Pronunciation Lexicon Specification (PLS) Version 1.0 on the W3C website.

You can save a lexicon file with as a .pls or .xml file before uploading it to Amazon Polly.

Upload and apply the lexicon file

Upload your lexicon file to Amazon Polly using the following instructions:

  1. On the Amazon Polly console, choose Lexicons in the navigation pane.
  2. Choose Upload lexicon.
  3. Enter a name for the lexicon and then choose a lexicon file.
  4. Choose the file to upload.
  5. Choose Upload lexicon.

If a lexicon by the same name (whether a .pls or .xml file) already exists, uploading the lexicon overwrites the existing lexicon.

Now you can apply the lexicon to customize pronunciation.

  1. Choose Text-to-Speech in the navigation pane.
  2. Expand Additional settings.
  3. Turn on Customize pronunciation.
  4. Choose the lexicon on the drop-down menu.

You can also choose Upload lexicon to upload a new lexicon file (or a new version).

It’s a good practice to version control the lexicon file in a source code repository. Keeping the custom pronunciations in a lexicon file ensures that you can consistently refer to phonetic pronunciations for certain words across the organization. Also, keep in mind the pronunciation lexicon limits mentioned on Quotas in Amazon Polly page.

Test the pronunciation after applying the lexicon

Let’s perform quick test using “Wishing all my listeners in NZ, a very Happy Mātariki” as the input text.

We can compare the audio files before and after applying the lexicon.

Before applying the lexicon:

After applying the lexicon:


In this post, we discussed how you can customize pronunciations of commonly used acronyms or words not found in the selected language in Amazon Polly. You can use SSML tag which is great for inserting one-off customizations or testing purposes. We recommend using Lexicon to create a consistent set of pronunciations for frequently used words across your organization. This enables your content writers to spend time on writing instead of the tedious task of adding phonetic pronunciations in the script repetitively. You can try this in your AWS account on the Amazon Polly console.

Summary of resources

About the Authors

Ratan Kumar is a Solutions Architect based out of Auckland, New Zealand. He works with large enterprise customers helping them design and build secure, cost-effective, and reliable internet scale applications using the AWS cloud. He is passionate about technology and likes sharing knowledge through blog posts and twitch sessions.

Maciek Tegi is a Principal Audio Designer and a Product Manager for Polly Brand Voices. He has worked in professional capacity in the tech industry, movies, commercials and game localization. In 2013, he was the first audio engineer hired to the Alexa Text-To- Speech team. Maciek was involved in releasing 12 Alexa TTS voices across different countries, over 20 Polly voices, and 4 Alexa celebrity voices. Maciek is a triathlete, and an avid acoustic guitar player.


Continue Reading


AWS Week in Review – May 16, 2022

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS! I had been on the road for the last five weeks and attended many of the AWS Summits in Europe. It was great to talk to so many of you…




This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

I had been on the road for the last five weeks and attended many of the AWS Summits in Europe. It was great to talk to so many of you in person. The Serverless Developer Advocates are going around many of the AWS Summits with the Serverlesspresso booth. If you attend an event that has the booth, say “Hi ” to my colleagues, and have a coffee while asking all your serverless questions. You can find all the upcoming AWS Summits in the events section at the end of this post.

Last week’s launches
Here are some launches that got my attention during the previous week.

AWS Step Functions announced a new console experience to debug your state machine executions – Now you can opt-in to the new console experience of Step Functions, which makes it easier to analyze, debug, and optimize Standard Workflows. The new page allows you to inspect executions using three different views: graph, table, and event view, and add many new features to enhance the navigation and analysis of the executions. To learn about all the features and how to use them, read Ben’s blog post.

Example on how the Graph View looks

Example on how the Graph View looks

AWS Lambda now supports Node.js 16.x runtime – Now you can start using the Node.js 16 runtime when you create a new function or update your existing functions to use it. You can also use the new container image base that supports this runtime. To learn more about this launch, check Dan’s blog post.

AWS Amplify announces its Android library designed for Kotlin – The Amplify Android library has been rewritten for Kotlin, and now it is available in preview. This new library provides better debugging capacities and visibility into underlying state management. And it is also using the new AWS SDK for Kotlin that was released last year in preview. Read the What’s New post for more information.

Three new APIs for batch data retrieval in AWS IoT SiteWise – With this new launch AWS IoT SiteWise now supports batch data retrieval from multiple asset properties. The new APIs allow you to retrieve current values, historical values, and aggregated values. Read the What’s New post to learn how you can start using the new APIs.

AWS Secrets Manager now publishes secret usage metrics to Amazon CloudWatch – This launch is very useful to see the number of secrets in your account and set alarms for any unexpected increase or decrease in the number of secrets. Read the documentation on Monitoring Secrets Manager with Amazon CloudWatch for more information.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other launches and news that you may have missed:

IBM signed a deal with AWS to offer its software portfolio as a service on AWS. This allows customers using AWS to access IBM software for automation, data and artificial intelligence, and security that is built on Red Hat OpenShift Service on AWS.

Podcast Charlas Técnicas de AWS – If you understand Spanish, this podcast is for you. Podcast Charlas Técnicas is one of the official AWS podcasts in Spanish. This week’s episode introduces you to Amazon DynamoDB and shares stories on how different customers use this database service. You can listen to all the episodes directly from your favorite podcast app or the podcast web page.

AWS Open Source News and Updates – Ricardo Sueiras, my colleague from the AWS Developer Relation team, runs this newsletter. It brings you all the latest open-source projects, posts, and more. Read edition #112 here.

Upcoming AWS Events
It’s AWS Summits season and here are some virtual and in-person events that might be close to you:

You can register for re:MARS to get fresh ideas on topics such as machine learning, automation, robotics, and space. The conference will be in person in Las Vegas, June 21–24.

That’s all for this week. Check back next Monday for another Week in Review!

— Marcia


Continue Reading


Personalize your machine translation results by using fuzzy matching with Amazon Translate

A person’s vernacular is part of the characteristics that make them unique. There are often countless different ways to express one specific idea. When a firm communicates with their customers, it’s critical that the message is delivered in a way that best represents the information they’re trying to convey. This becomes even more important when…




A person’s vernacular is part of the characteristics that make them unique. There are often countless different ways to express one specific idea. When a firm communicates with their customers, it’s critical that the message is delivered in a way that best represents the information they’re trying to convey. This becomes even more important when it comes to professional language translation. Customers of translation systems and services expect accurate and highly customized outputs. To achieve this, they often reuse previous translation outputs—called translation memory (TM)—and compare them to new input text. In computer-assisted translation, this technique is known as fuzzy matching. The primary function of fuzzy matching is to assist the translator by speeding up the translation process. When an exact match can’t be found in the TM database for the text being translated, translation management systems (TMSs) often have the option to search for a match that is less than exact. Potential matches are provided to the translator as additional input for final translation. Translators who enhance their workflow with machine translation capabilities such as Amazon Translate often expect fuzzy matching data to be used as part of the automated translation solution.

In this post, you learn how to customize output from Amazon Translate according to translation memory fuzzy match quality scores.

Translation Quality Match

The XML Localization Interchange File Format (XLIFF) standard is often used as a data exchange format between TMSs and Amazon Translate. XLIFF files produced by TMSs include source and target text data along with match quality scores based on the available TM. These scores—usually expressed as a percentage—indicate how close the translation memory is to the text being translated.

Some customers with very strict requirements only want machine translation to be used when match quality scores are below a certain threshold. Beyond this threshold, they expect their own translation memory to take precedence. Translators often need to apply these preferences manually either within their TMS or by altering the text data. This flow is illustrated in the following diagram. The machine translation system processes the translation data—text and fuzzy match scores— which is then reviewed and manually edited by translators, based on their desired quality thresholds. Applying thresholds as part of the machine translation step allows you to remove these manual steps, which improves efficiency and optimizes cost.

Machine Translation Review Flow

Figure 1: Machine Translation Review Flow

The solution presented in this post allows you to enforce rules based on match quality score thresholds to drive whether a given input text should be machine translated by Amazon Translate or not. When not machine translated, the resulting text is left to the discretion of the translators reviewing the final output.

Solution Architecture

The solution architecture illustrated in Figure 2 leverages the following services:

  • Amazon Simple Storage Service – Amazon S3 buckets contain the following content:
    • Fuzzy match threshold configuration files
    • Source text to be translated
    • Amazon Translate input and output data locations
  • AWS Systems Manager – We use Parameter Store parameters to store match quality threshold configuration values
  • AWS Lambda – We use two Lambda functions:
    • One function preprocesses the quality match threshold configuration files and persists the data into Parameter Store
    • One function automatically creates the asynchronous translation jobs
  • Amazon Simple Queue Service – An Amazon SQS queue triggers the translation flow as a result of new files coming into the source bucket

Solution Architecture Diagram

Figure 2: Solution Architecture

You first set up quality thresholds for your translation jobs by editing a configuration file and uploading it into the fuzzy match threshold configuration S3 bucket. The following is a sample configuration in CSV format. We chose CSV for simplicity, although you can use any format. Each line represents a threshold to be applied to either a specific translation job or as a default value to any job.

default, 75 SourceMT-Test, 80

The specifications of the configuration file are as follows:

  • Column 1 should be populated with the name of the XLIFF file—without extension—provided to the Amazon Translate job as input data.
  • Column 2 should be populated with the quality match percentage threshold. For any score below this value, machine translation is used.
  • For all XLIFF files whose name doesn’t match any name listed in the configuration file, the default threshold is used—the line with the keyword default set in Column 1.

Auto-generated parameter in Systems Manager Parameter Store

Figure 3: Auto-generated parameter in Systems Manager Parameter Store

When a new file is uploaded, Amazon S3 triggers the Lambda function in charge of processing the parameters. This function reads and stores the threshold parameters into Parameter Store for future usage. Using Parameter Store avoids performing redundant Amazon S3 GET requests each time a new translation job is initiated. The sample configuration file produces the parameter tags shown in the following screenshot.

The job initialization Lambda function uses these parameters to preprocess the data prior to invoking Amazon Translate. We use an English-to-Spanish translation XLIFF input file, as shown in the following code. It contains the initial text to be translated, broken down into what is referred to as segments, represented in the source tags.

Consent Form CONSENT FORM FORMULARIO DE CONSENTIMIENTO Screening Visit: Screening Visit Selección

The source text has been pre-matched with the translation memory beforehand. The data contains potential translation alternatives—represented as tags—alongside a match quality attribute, expressed as a percentage. The business rule is as follows:

  • Segments received with alternative translations and a match quality below the threshold are untouched or empty. This signals to Amazon Translate that they must be translated.
  • Segments received with alternative translations with a match quality above the threshold are pre-populated with the suggested target text. Amazon Translate skips those segments.

Let’s assume the quality match threshold configured for this job is 80%. The first segment with 99% match quality isn’t machine translated, whereas the second segment is, because its match quality is below the defined threshold. In this configuration, Amazon Translate produces the following output:

Consent Form FORMULARIO DE CONSENTIMIENTO CONSENT FORM FORMULARIO DE CONSENTIMIENTO Screening Visit: Visita de selección Screening Visit Selección

In the second segment, Amazon Translate overwrites the target text initially suggested (Selección) with a higher quality translation: Visita de selección.

One possible extension to this use case could be to reuse the translated output and create our own translation memory. Amazon Translate supports customization of machine translation using translation memory thanks to the parallel data feature. Text segments previously machine translated due to their initial low-quality score could then be reused in new translation projects.

In the following sections, we walk you through the process of deploying and testing this solution. You use AWS CloudFormation scripts and data samples to launch an asynchronous translation job personalized with a configurable quality match threshold.


For this walkthrough, you must have an AWS account. If you don’t have an account yet, you can create and activate one.

Launch AWS CloudFormation stack

  1. Choose Launch Stack:
  2. For Stack name, enter a name.
  3. For ConfigBucketName, enter the S3 bucket containing the threshold configuration files.
  4. For ParameterStoreRoot, enter the root path of the parameters created by the parameters processing Lambda function.
  5. For QueueName, enter the SQS queue that you create to post new file notifications from the source bucket to the job initialization Lambda function. This is the function that reads the configuration file.
  6. For SourceBucketName, enter the S3 bucket containing the XLIFF files to be translated. If you prefer to use a preexisting bucket, you need to change the value of the CreateSourceBucket parameter to No.
  7. For WorkingBucketName, enter the S3 bucket Amazon Translate uses for input and output data.
  8. Choose Next.

    Figure 4: CloudFormation stack details

  9. Optionally on the Stack Options page, add key names and values for the tags you may want to assign to the resources about to be created.
  10. Choose Next.
  11. On the Review page, select I acknowledge that this template might cause AWS CloudFormation to create IAM resources.
  12. Review the other settings, then choose Create stack.

AWS CloudFormation takes several minutes to create the resources on your behalf. You can watch the progress on the Events tab on the AWS CloudFormation console. When the stack has been created, you can see a CREATE_COMPLETE message in the Status column on the Overview tab.

Test the solution

Let’s go through a simple example.

  1. Download the following sample data.
  2. Unzip the content.

There should be two files: an .xlf file in XLIFF format, and a threshold configuration file with .cfg as the extension. The following is an excerpt of the XLIFF file.

English to French sample file extract

Figure 5: English to French sample file extract

  1. On the Amazon S3 console, upload the quality threshold configuration file into the configuration bucket you specified earlier.

The value set for test_En_to_Fr is 75%. You should be able to see the parameters on the Systems Manager console in the Parameter Store section.

  1. Still on the Amazon S3 console, upload the .xlf file into the S3 bucket you configured as source. Make sure the file is under a folder named translate (for example, /translate/test_En_to_Fr.xlf).

This starts the translation flow.

  1. Open the Amazon Translate console.

A new job should appear with a status of In Progress.

Auto-generated parameter in Systems Manager Parameter Store

Figure 6: In progress translation jobs on Amazon Translate console

  1. Once the job is complete, click into the job’s link and consult the output. All segments should have been translated.

All segments should have been translated. In the translated XLIFF file, look for segments with additional attributes named lscustom:match-quality, as shown in the following screenshot. These custom attributes identify segments where suggested translation was retained based on score.

Custom attributes identifying segments where suggested translation was retained based on score

Figure 7: Custom attributes identifying segments where suggested translation was retained based on score

These were derived from the translation memory according to the quality threshold. All other segments were machine translated.

You have now deployed and tested an automated asynchronous translation job assistant that enforces configurable translation memory match quality thresholds. Great job!


If you deployed the solution into your account, don’t forget to delete the CloudFormation stack to avoid any unexpected cost. You need to empty the S3 buckets manually beforehand.


In this post, you learned how to customize your Amazon Translate translation jobs based on standard XLIFF fuzzy matching quality metrics. With this solution, you can greatly reduce the manual labor involved in reviewing machine translated text while also optimizing your usage of Amazon Translate. You can also extend the solution with data ingestion automation and workflow orchestration capabilities, as described in Speed Up Translation Jobs with a Fully Automated Translation System Assistant.

About the Authors

Narcisse Zekpa is a Solutions Architect based in Boston. He helps customers in the Northeast U.S. accelerate their adoption of the AWS Cloud, by providing architectural guidelines, design innovative, and scalable solutions. When Narcisse is not building, he enjoys spending time with his family, traveling, cooking, and playing basketball.

Dimitri Restaino is a Solutions Architect at AWS, based out of Brooklyn, New York. He works primarily with Healthcare and Financial Services companies in the North East, helping to design innovative and creative solutions to best serve their customers. Coming from a software development background, he is excited by the new possibilities that serverless technology can bring to the world. Outside of work, he loves to hike and explore the NYC food scene.


Continue Reading


Copyright © 2021 Today's Digital.