Connect with us


The development of Bundesliga Match Fact Passing Profile, a deep dive into passing in football

This post was authored by Simon Rolfes. Simon played 288 Bundesliga games as a central midfielder, scored 41 goals, and won 26 caps for Germany. Currently, he serves as Sporting Director at Bayer 04 Leverkusen, where he oversees and develops the pro player roster, the scouting department, and the club’s youth development. Simon also writes…



This post was authored by Simon Rolfes. Simon played 288 Bundesliga games as a central midfielder, scored 41 goals, and won 26 caps for Germany. Currently, he serves as Sporting Director at Bayer 04 Leverkusen, where he oversees and develops the pro player roster, the scouting department, and the club’s youth development. Simon also writes weekly columns on about the latest Bundesliga Match Facts powered by AWS. There he offers his expertise as a former player, captain, and TV analyst to highlight the impact of advanced statistics and machine learning in the world of football. In this post, Simon analyzes the importance of some of the new Bundesliga Match Facts powered by AWS that fans can see during the 2021-2022 season. The AWS Professional Services team then details the AWS technology used behind these advanced stats.

Passing the ball is one of the most common actions on the pitch. It’s simply one player maneuvering the ball to another on his team. “Simple” is a word every coach uses daily when it comes to passing: “Keep it simple.” Yet, looking at the effect a pass can have on a match, nothing is simple.

Consider this for example: in the 2020-2021 Bundesliga season, on average 917 passes were completed per match. The record for the highest and lowest number of completed passes was set in the same match, on  Match Day 26 when Arminia Bielefeld hosted RB Leipzig. Arminia completed 152 passes compared to 865 by Leipzig. In traditional football analysis, the amount of completed passes is widely seen as an indicator for team dominance. Given that, you might be surprised that Leipzig won that match only by the slightest of margins: 1:0.

Or take an individual player’s performance: In 2020-2021, the average Bundesliga player completed 86% of his passes, but individual numbers vary from 22% completion rate to 100%. If a higher completion rate really indicates dominance, what then if the player with 22% brought a striker into a scoring position with every pass, ideally bypassing several defenders each time? And what does it say if the player with the 100% rate was a defender passing the ball horizontally to another defender, because he couldn’t find a teammate to pass to up the pitch? Then again, a horizontal pass can be a great tool for offense too, opening up the field by moving the ball to the other side of the pitch.

So, there are myriad types of passes, played for a great variety of reasons. What is it then that makes a pass special, and how are players using passes in different situations to move the ball around?

The new Bundesliga Match Fact Passing Profile uncovers exactly that by providing real-time insights into the passing capabilities of all players and teams in the Bundesliga. Machine learning (ML) models trained on Amazon SageMaker analyzed nearly 2 million passes from previous Bundesliga seasons to construct an algorithm that can compute a difficulty score for each pass at any moment in time. It does this by first computing 26 pass characteristics for each pass in AWS Glue. These passing features, developed in collaboration with a group of football experts, include the distance to the receiver, the number of defending players in between, the pressure a player is under, and many more. They’re then input to train an ML model in SageMaker to calculate the effect of each feature on the chance that a pass completes.

After the model is trained, it can estimate a difficulty score for each pass it sees. These difficulty scores can then be used in a variety of ways. For example, they can be aggregated for each player to form a passing profile, showing which passing decisions players make. Do they prioritize an offensive pass? Are they looking for a safe option by passing the ball back? Or do they seek to open up the play by a long ball. All of this is captured in the Match Fact “Passing Profile”, which shows the direction of passes and the difficulty score for each player live on television and in the Bundesliga app.

In the following section, the AWS Professional Services team, who worked with Bundesliga to bring these Match Facts to life, explains how this advanced stat came to fruition.

How does it work?

Building an ML model that can predict the difficulty of a given pass requires us to create a large dataset filled with both successful and unsuccessful passes from the past. Although much is known about successful passes (for example, the receiver, the location where the ball was controlled, the duration and distance of the pass), little is known about an unsuccessful pass because it simply didn’t reach its intended target. We therefore adopt an approach proposed by Anzer and Bauer (2021) to identify the intended receiver of an unsuccessful pass using a ball trajectory and motion model so that we can add these entries into our passing dataset.

Identifying the intended target

Although it sometimes doesn’t look like it, a ball has to adhere to the laws of physics. We can use gravity, air drag, and rolling drag to map the trajectory of a pass. With a physical model as proposed by Spearman and Basye (2017), we can use the first 0.4 seconds after a pass is given to map the entire trajectory of the ball. The physical model in the following figure estimates the trajectory based on this 0.4-second timeframe. This computed path is shown in the image on the left in orange. In this example, player 11 from the blue team attempts to initiate an attack with a pass to player 32, who is making a run on the right flank. To evaluate our physical model, we can compare the estimated ball trajectory to the actual trajectory provided by the tracking data, shown in black. Comparing both trajectories shows that the estimated trajectory is fairly close to reality. However, the model doesn’t account for the drag force after a pass meets the pitch (due to weather conditions), and doesn’t consider curve balls because this information isn’t available within the tracking data.

After the trajectory of the ball is modeled, we know where it’s estimated to land. The next step is to calculate who could reach the ball. This is done by a motion model. This motion model estimates the area a player can reach within a pre-defined time window and is largely based on a player’s speed and direction. The model is compared to the movement of players in the previous three seasons of Bundesliga data to understand how players move. The results can be visualized into four circles around each player, representing the area they can reach within 0.5, 1, 1.5, and 2 seconds.

Each player’s potential movement is computed and compared to the estimated landing location of the ball. Given the assumption that a ball can be controlled when it’s below 1.5 meters in height, we can make an estimated guess of which player could reach the ball first. Now, to estimate the intended receiver of a given pass, we combine the ball trajectory model with the motion model. If we map the trajectory of an unsuccessful pass, we can use our motion model to determine which player could reach the ball first. This player is likely to be the intended receiver of a pass. We can then use this information to add relevant data points (such as the receiver, the location where the ball was controlled, the duration and distance of the pass) for unsuccessful passes to our dataset.

Passing difficulty

We can use ML to estimate the difficulty of each pass. We use the passing dataset that contains 2 million passes from previous Bundesliga seasons to train a supervised ML model that computes a pass completion probability for each of those passes. This is computed by finding patterns in a set of tailored features that are available at the time of a pass. These features were developed in collaboration with football experts to understand the relevant aspects impacting the difficulty of a pass. The ML algorithm decides which features truly have an impact and which are negligible. This results in a model taking a pass and predicting its likely chance of completion.

Passing profile and efficiency

We can use this passing or xPass (expected passes) model to estimate the passing profile of a player and his passing efficiency. The passing profile consists of the passing decisions a player makes. Does the player look for short balls or long balls, pass left or right, and how difficult are the passes the player attempts? We can use the xPass model to evaluate how effective a player is in his passing decisions and therefore estimate his impact in the game.

The passing profile is displayed in two ways in the live broadcast. The graphic on the left displays the direction of play a player favors in the current game, featuring the main passing direction and the distribution of passes until that moment in time. The graphic on the right shows additional statistics that complement the passing direction, such as the number of difficult passes a player has attempted so far, including the completion rate and the ratio between short and long passes.

These statistics are further explored during the end of the first half and start of the second half using a graphical comparison format. In this comparison template, two players can be compared in their passing profile, showcasing the difference in passing choices as well as completion rate.

In addition to the passing profiles, fans can also explore the passing efficiency of players across Bundesliga games. In the stats section of the Bundesliga app, viewers can see the efficiency of players by comparing their actual completed passes with their expected completed passes (as by the xPass model). This provides a much more objective view on the passing capabilities of players than simply looking at the number of passes and the completion rate.

In this overview, the difficulty of a pass is taken into perspective. For example, let’s say we have two players that both complete two passes. Player A completes two difficult passes with an expected pass completion rate of 40%. Player B completes two simple passes with an expected completion rate of 95%. Evaluating both players using the old metrics would results in both players having completed two passes with a completion rate of 100%. With the new xPass model, we can actually see that Player A was expected to complete 0.8 pass (40% + 40%) but actually completed two passes, which result in an efficiency score of 2 – 0.8 = 1.2 passes. He is therefore over-performing by 1.2 passes. Player B completed two 95% passes, so we expect him to complete 1.9 passes. He actually completed two passes. This results in an over-performance of 2 – 1.9 = 0.1 passes. Player B is pretty much performing as expected, whereas Player A is putting on a top performance.

Let’s look at an example of two players that play at the same position (right-back), Lars Bender (Bayer 04 Leverkusen) and Stefan Lainer (Borussia Mönchengladbach) on Match Day 23 from season 2020-2021. When looking at pure passing completion rates, Bender seems to be outperforming Lainer by completing about 90% of his passes. Lainer only completes 60% of his passes and seems to be falling behind. However, if we take a closer look, we find that Lainer passes about 80% of his passes forward. Bender on the other hand is only passing 15% of his passes forward, and seems to be prioritizing safe passes backward. This risk-taking behavior and ability to spot the attacking intention of players wasn’t possible before with the standard metrics.

Passing profile and efficiency allows us to make this comparison between players that wasn’t previously possible. It allows us to see which player is demonstrating exceptional passing skills and which players aren’t finding their teammates.

Training the passing profile model

The passing profile model is only the tip of the iceberg; behind the scenes we need to account for several important operations, such as continuous training, continuous improvements to the model, continuous deployment of new models, model monitoring, metadata tracking, model lineage, and multi-account deployment. To address these particularities of industrializing ML models, we created training and deployment pipelines. Moreover, looking towards the future development of additional Match Facts, we invested additional time in developing reusable model training and deployment pipelines. These generic pipelines are designed and implemented using the AWS Cloud Development Kit (AWS CDK). Templatizing these pipelines ensure the consistent development of new Match Facts while reducing effort and time to market.

Our architecture considers all our three environments: development, staging, and production. Given the experimental nature of model training, the actual training pipeline resides on our development environment. This allows our data and ML engineers to freely work and experiment with new features and analysis.

After the team tests the new model and is satisfied with the results, we promote the model from development to staging through an approval chain (pull requests) on Bitbucket. After we test further on staging, we use the same process from staging to production to make the new model available for a live setting.

For the end-to-end workflow, we use AWS Step Functions; all the steps are defined using the AWS CDK. The AWS CDK generates an AWS CloudFormation template containing the final state definitions for the Step Functions state machine in Amazon States Language.

Using AWS CDK and Step Functions allows us to instantiate the same base training pipeline definition for different Match Facts. This setup is flexible and adapts to different Match Fact requirements. For example, we can adjust parameters in a certain step, such as the underlying type of ML algorithm. We can also add, remove, and adjust new steps without needing to change the underlying core structure of the training pipeline. In this manner, our data scientists can focus on creating the best models for the Match Facts, without the burden of creating infrastructure and handling operations.

We have two main workflows (state machines) for any given Match Fact model training pipeline instance: one for the data preprocessing pipeline, and another for the actual training pipeline. This setup avoids running the preprocessing over thousands of matches every time we want to train a new model. Therefore, we can experiment with different parameters for training the model while saving time and money on data preprocessing. Conversely, we can experiment with creating new features without needing to incur costs for training the model immediately afterwards.

The following diagram shows our data preprocessing pipeline.

The state machines consist of various jobs in AWS Glue, functions in AWS Lambda, and SageMaker jobs to provide the end-to-end flexibility to our data scientists. The preprocessing workflow is responsible for the data preprocessing, where the defined Lambda function (Step 1) dynamically fetches the match data from the stored match information, which is then fed to a processing job in AWS Glue (Step 2) that handles the feature extraction from the fetched raw match data. With the nature of positional match data, we have plenty of data that needs to be preprocessed before training. Thanks to the mapping feature of Step Functions, we can run jobs in AWS Glue in parallel, which allows us to save time in preprocessing. Finally, AWS Glue saves the processed match data to Amazon Simple Storage Service (Amazon S3) to be used by the model training state machine.

The following diagram illustrates our training pipeline.

The training pipeline workflow starts the training with a single AWS Glue job (Step 3) that aggregates all the processed match data from the previous step, and shuffles and splits the data into three datasets: training, validation, and test.

The training and validation datasets are used to train and find the best hyperparameters for the model using SageMaker automatic model tuning (Step 4). Testing data is used by our data scientists to evaluate and analyze the model outcomes and metrics; for instance, to detect problems in our training such as over- or under-fitting. The outcome of the SageMaker tuning job is the model with the hyperparameters that has the best performance.

After we produce the best model, we use several Lambda functions (Step 5) to clean the output and start the process of verifying and registering the new model to the SageMaker Model Registry (Step 7). This allows us to promote the same successfully verified and tested model to the other environments such as staging and production while also having a conditional state that can deploy or update the corresponding SageMaker model endpoints (Step 6).

We train the models in one account (development) and deploy them to different accounts because there’s no need to retrain the models. The deployment pipeline (see the following diagram) allows us to move the trained ML model to other accounts and is driven by the SageMaker Model Registry and BitBucket custom pipelines.

For governance purposes, we defined a manual approval process using pull requests that can be approved by product owners. After the pull request is approved in BitBucket (Step 8), we perform cross-account deployments using the SageMaker Model Registry for the desired model to the target environment, such as staging or production (Step 9). This allows us to have a single trail of truth with a consistent model that is tested from the beginning and that we can trace back to its initial release. It also provides an approval process for new models whenever we want to release a new model to the live production environment, for example.

With the aforementioned training and deployment architecture, the Passing Profile Match Fact has fostered faster modifications, faster bug-fixing, faster integration of successful experiments to other environments, and lower operational and development costs.


In this post, we demonstrated how the Bundesliga Match Fact Passing Profile makes it possible to objectively compare the difficulty of passes. We used historical data of nearly 2 million passes to build an ML model on SageMaker, which computes the difficulty of a pass. The model is based on 26 factors, such as distance the ball travels or the pressure the passer is under (for more information about pressure, see the Match Fact Most Pressed Player). We’ve shown how to build a reusable model training pipeline and facilitate multi- and cross-account deployments of ML models with the click of a button.

Passing Profile will be on display in Bundesliga broadcasts and the Bundesliga app starting September 11, 2021. We hope you enjoy the insights this advanced stat will provide. Learn more about the partnership between AWS and Bundesliga by visiting the webpage!

About the Authors

Simon Rolfes played 288 Bundesliga games as a central midfielder, scored 41 goals and won 26 caps for Germany. Currently Rolfes serves as Sporting Director at Bayer 04 Leverkusen where he oversees and develops the pro player roster, the scouting department and the club’s youth development. Simon also writes weekly columns on about the latest Bundesliga Match Facts powered by AWS

Luuk Figdor is a Senior Sports Technology Specialist in the AWS Professional Services team. He works with players, clubs, leagues and media companies such as the Bundesliga and Formula 1 to help them tell stories with data using machine learning. In his spare time he likes to learn all about the mind and the intersection between psychology, economics, and AI.

Gabriel Anzer is the lead data scientist at Sportec Solutions AG, a subsidiary of the DFL. He works on extracting interesting insights from football data using AI/ML for both fans and clubs. Gabriel’s background is in Mathematics and Machine Learning, but he is additionally pursuing his PhD in Sports Analytics at the University of Tübingen and working on his football coaching license.

Gabriella Hernandez Larios is a data scientist at AWS Professional Services. She works with customers across industries unveiling the power of AI/ML to achieve their business outcomes. Gabriela loves football (soccer) and in her spare time she likes to do sports like running, swimming, yoga, CrossFit and hiking.

Jakub Michalczyk is a Data Scientist at Sportec Solutions AG. Several years ago, he chose Math studies over playing football, as he came to the conclusion he was not good enough at the latter. Now he combines both these passions in his professional career by applying machine learning methods to gain a better insight into this beautiful game. In his spare time, he still enjoys playing seven-a-side football, watching crime movies, and listening to film music.

Murat Eksi is a full-stack technologist at AWS Professional Services. He has worked with various industries including finance, sports and media, gaming, manufacturing, and automotive to accelerate their business outcomes through Application Development, Security, IoT, Analytics, DevOps and Infrastructure. Outside of work, he loves traveling around the world, learning new languages while setting up local events for entrepreneurs and business owners in Stockholm. He also recently started taking flight lessons.


Continue Reading
Click to comment

Leave a Reply

Your email address will not be published.


AWS Week in Review – May 16, 2022

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS! I had been on the road for the last five weeks and attended many of the AWS Summits in Europe. It was great to talk to so many of you…




This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

I had been on the road for the last five weeks and attended many of the AWS Summits in Europe. It was great to talk to so many of you in person. The Serverless Developer Advocates are going around many of the AWS Summits with the Serverlesspresso booth. If you attend an event that has the booth, say “Hi ” to my colleagues, and have a coffee while asking all your serverless questions. You can find all the upcoming AWS Summits in the events section at the end of this post.

Last week’s launches
Here are some launches that got my attention during the previous week.

AWS Step Functions announced a new console experience to debug your state machine executions – Now you can opt-in to the new console experience of Step Functions, which makes it easier to analyze, debug, and optimize Standard Workflows. The new page allows you to inspect executions using three different views: graph, table, and event view, and add many new features to enhance the navigation and analysis of the executions. To learn about all the features and how to use them, read Ben’s blog post.

Example on how the Graph View looks

Example on how the Graph View looks

AWS Lambda now supports Node.js 16.x runtime – Now you can start using the Node.js 16 runtime when you create a new function or update your existing functions to use it. You can also use the new container image base that supports this runtime. To learn more about this launch, check Dan’s blog post.

AWS Amplify announces its Android library designed for Kotlin – The Amplify Android library has been rewritten for Kotlin, and now it is available in preview. This new library provides better debugging capacities and visibility into underlying state management. And it is also using the new AWS SDK for Kotlin that was released last year in preview. Read the What’s New post for more information.

Three new APIs for batch data retrieval in AWS IoT SiteWise – With this new launch AWS IoT SiteWise now supports batch data retrieval from multiple asset properties. The new APIs allow you to retrieve current values, historical values, and aggregated values. Read the What’s New post to learn how you can start using the new APIs.

AWS Secrets Manager now publishes secret usage metrics to Amazon CloudWatch – This launch is very useful to see the number of secrets in your account and set alarms for any unexpected increase or decrease in the number of secrets. Read the documentation on Monitoring Secrets Manager with Amazon CloudWatch for more information.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other launches and news that you may have missed:

IBM signed a deal with AWS to offer its software portfolio as a service on AWS. This allows customers using AWS to access IBM software for automation, data and artificial intelligence, and security that is built on Red Hat OpenShift Service on AWS.

Podcast Charlas Técnicas de AWS – If you understand Spanish, this podcast is for you. Podcast Charlas Técnicas is one of the official AWS podcasts in Spanish. This week’s episode introduces you to Amazon DynamoDB and shares stories on how different customers use this database service. You can listen to all the episodes directly from your favorite podcast app or the podcast web page.

AWS Open Source News and Updates – Ricardo Sueiras, my colleague from the AWS Developer Relation team, runs this newsletter. It brings you all the latest open-source projects, posts, and more. Read edition #112 here.

Upcoming AWS Events
It’s AWS Summits season and here are some virtual and in-person events that might be close to you:

You can register for re:MARS to get fresh ideas on topics such as machine learning, automation, robotics, and space. The conference will be in person in Las Vegas, June 21–24.

That’s all for this week. Check back next Monday for another Week in Review!

— Marcia


Continue Reading


Personalize your machine translation results by using fuzzy matching with Amazon Translate

A person’s vernacular is part of the characteristics that make them unique. There are often countless different ways to express one specific idea. When a firm communicates with their customers, it’s critical that the message is delivered in a way that best represents the information they’re trying to convey. This becomes even more important when…




A person’s vernacular is part of the characteristics that make them unique. There are often countless different ways to express one specific idea. When a firm communicates with their customers, it’s critical that the message is delivered in a way that best represents the information they’re trying to convey. This becomes even more important when it comes to professional language translation. Customers of translation systems and services expect accurate and highly customized outputs. To achieve this, they often reuse previous translation outputs—called translation memory (TM)—and compare them to new input text. In computer-assisted translation, this technique is known as fuzzy matching. The primary function of fuzzy matching is to assist the translator by speeding up the translation process. When an exact match can’t be found in the TM database for the text being translated, translation management systems (TMSs) often have the option to search for a match that is less than exact. Potential matches are provided to the translator as additional input for final translation. Translators who enhance their workflow with machine translation capabilities such as Amazon Translate often expect fuzzy matching data to be used as part of the automated translation solution.

In this post, you learn how to customize output from Amazon Translate according to translation memory fuzzy match quality scores.

Translation Quality Match

The XML Localization Interchange File Format (XLIFF) standard is often used as a data exchange format between TMSs and Amazon Translate. XLIFF files produced by TMSs include source and target text data along with match quality scores based on the available TM. These scores—usually expressed as a percentage—indicate how close the translation memory is to the text being translated.

Some customers with very strict requirements only want machine translation to be used when match quality scores are below a certain threshold. Beyond this threshold, they expect their own translation memory to take precedence. Translators often need to apply these preferences manually either within their TMS or by altering the text data. This flow is illustrated in the following diagram. The machine translation system processes the translation data—text and fuzzy match scores— which is then reviewed and manually edited by translators, based on their desired quality thresholds. Applying thresholds as part of the machine translation step allows you to remove these manual steps, which improves efficiency and optimizes cost.

Machine Translation Review Flow

Figure 1: Machine Translation Review Flow

The solution presented in this post allows you to enforce rules based on match quality score thresholds to drive whether a given input text should be machine translated by Amazon Translate or not. When not machine translated, the resulting text is left to the discretion of the translators reviewing the final output.

Solution Architecture

The solution architecture illustrated in Figure 2 leverages the following services:

  • Amazon Simple Storage Service – Amazon S3 buckets contain the following content:
    • Fuzzy match threshold configuration files
    • Source text to be translated
    • Amazon Translate input and output data locations
  • AWS Systems Manager – We use Parameter Store parameters to store match quality threshold configuration values
  • AWS Lambda – We use two Lambda functions:
    • One function preprocesses the quality match threshold configuration files and persists the data into Parameter Store
    • One function automatically creates the asynchronous translation jobs
  • Amazon Simple Queue Service – An Amazon SQS queue triggers the translation flow as a result of new files coming into the source bucket

Solution Architecture Diagram

Figure 2: Solution Architecture

You first set up quality thresholds for your translation jobs by editing a configuration file and uploading it into the fuzzy match threshold configuration S3 bucket. The following is a sample configuration in CSV format. We chose CSV for simplicity, although you can use any format. Each line represents a threshold to be applied to either a specific translation job or as a default value to any job.

default, 75 SourceMT-Test, 80

The specifications of the configuration file are as follows:

  • Column 1 should be populated with the name of the XLIFF file—without extension—provided to the Amazon Translate job as input data.
  • Column 2 should be populated with the quality match percentage threshold. For any score below this value, machine translation is used.
  • For all XLIFF files whose name doesn’t match any name listed in the configuration file, the default threshold is used—the line with the keyword default set in Column 1.

Auto-generated parameter in Systems Manager Parameter Store

Figure 3: Auto-generated parameter in Systems Manager Parameter Store

When a new file is uploaded, Amazon S3 triggers the Lambda function in charge of processing the parameters. This function reads and stores the threshold parameters into Parameter Store for future usage. Using Parameter Store avoids performing redundant Amazon S3 GET requests each time a new translation job is initiated. The sample configuration file produces the parameter tags shown in the following screenshot.

The job initialization Lambda function uses these parameters to preprocess the data prior to invoking Amazon Translate. We use an English-to-Spanish translation XLIFF input file, as shown in the following code. It contains the initial text to be translated, broken down into what is referred to as segments, represented in the source tags.

Consent Form CONSENT FORM FORMULARIO DE CONSENTIMIENTO Screening Visit: Screening Visit Selección

The source text has been pre-matched with the translation memory beforehand. The data contains potential translation alternatives—represented as tags—alongside a match quality attribute, expressed as a percentage. The business rule is as follows:

  • Segments received with alternative translations and a match quality below the threshold are untouched or empty. This signals to Amazon Translate that they must be translated.
  • Segments received with alternative translations with a match quality above the threshold are pre-populated with the suggested target text. Amazon Translate skips those segments.

Let’s assume the quality match threshold configured for this job is 80%. The first segment with 99% match quality isn’t machine translated, whereas the second segment is, because its match quality is below the defined threshold. In this configuration, Amazon Translate produces the following output:

Consent Form FORMULARIO DE CONSENTIMIENTO CONSENT FORM FORMULARIO DE CONSENTIMIENTO Screening Visit: Visita de selección Screening Visit Selección

In the second segment, Amazon Translate overwrites the target text initially suggested (Selección) with a higher quality translation: Visita de selección.

One possible extension to this use case could be to reuse the translated output and create our own translation memory. Amazon Translate supports customization of machine translation using translation memory thanks to the parallel data feature. Text segments previously machine translated due to their initial low-quality score could then be reused in new translation projects.

In the following sections, we walk you through the process of deploying and testing this solution. You use AWS CloudFormation scripts and data samples to launch an asynchronous translation job personalized with a configurable quality match threshold.


For this walkthrough, you must have an AWS account. If you don’t have an account yet, you can create and activate one.

Launch AWS CloudFormation stack

  1. Choose Launch Stack:
  2. For Stack name, enter a name.
  3. For ConfigBucketName, enter the S3 bucket containing the threshold configuration files.
  4. For ParameterStoreRoot, enter the root path of the parameters created by the parameters processing Lambda function.
  5. For QueueName, enter the SQS queue that you create to post new file notifications from the source bucket to the job initialization Lambda function. This is the function that reads the configuration file.
  6. For SourceBucketName, enter the S3 bucket containing the XLIFF files to be translated. If you prefer to use a preexisting bucket, you need to change the value of the CreateSourceBucket parameter to No.
  7. For WorkingBucketName, enter the S3 bucket Amazon Translate uses for input and output data.
  8. Choose Next.

    Figure 4: CloudFormation stack details

  9. Optionally on the Stack Options page, add key names and values for the tags you may want to assign to the resources about to be created.
  10. Choose Next.
  11. On the Review page, select I acknowledge that this template might cause AWS CloudFormation to create IAM resources.
  12. Review the other settings, then choose Create stack.

AWS CloudFormation takes several minutes to create the resources on your behalf. You can watch the progress on the Events tab on the AWS CloudFormation console. When the stack has been created, you can see a CREATE_COMPLETE message in the Status column on the Overview tab.

Test the solution

Let’s go through a simple example.

  1. Download the following sample data.
  2. Unzip the content.

There should be two files: an .xlf file in XLIFF format, and a threshold configuration file with .cfg as the extension. The following is an excerpt of the XLIFF file.

English to French sample file extract

Figure 5: English to French sample file extract

  1. On the Amazon S3 console, upload the quality threshold configuration file into the configuration bucket you specified earlier.

The value set for test_En_to_Fr is 75%. You should be able to see the parameters on the Systems Manager console in the Parameter Store section.

  1. Still on the Amazon S3 console, upload the .xlf file into the S3 bucket you configured as source. Make sure the file is under a folder named translate (for example, /translate/test_En_to_Fr.xlf).

This starts the translation flow.

  1. Open the Amazon Translate console.

A new job should appear with a status of In Progress.

Auto-generated parameter in Systems Manager Parameter Store

Figure 6: In progress translation jobs on Amazon Translate console

  1. Once the job is complete, click into the job’s link and consult the output. All segments should have been translated.

All segments should have been translated. In the translated XLIFF file, look for segments with additional attributes named lscustom:match-quality, as shown in the following screenshot. These custom attributes identify segments where suggested translation was retained based on score.

Custom attributes identifying segments where suggested translation was retained based on score

Figure 7: Custom attributes identifying segments where suggested translation was retained based on score

These were derived from the translation memory according to the quality threshold. All other segments were machine translated.

You have now deployed and tested an automated asynchronous translation job assistant that enforces configurable translation memory match quality thresholds. Great job!


If you deployed the solution into your account, don’t forget to delete the CloudFormation stack to avoid any unexpected cost. You need to empty the S3 buckets manually beforehand.


In this post, you learned how to customize your Amazon Translate translation jobs based on standard XLIFF fuzzy matching quality metrics. With this solution, you can greatly reduce the manual labor involved in reviewing machine translated text while also optimizing your usage of Amazon Translate. You can also extend the solution with data ingestion automation and workflow orchestration capabilities, as described in Speed Up Translation Jobs with a Fully Automated Translation System Assistant.

About the Authors

Narcisse Zekpa is a Solutions Architect based in Boston. He helps customers in the Northeast U.S. accelerate their adoption of the AWS Cloud, by providing architectural guidelines, design innovative, and scalable solutions. When Narcisse is not building, he enjoys spending time with his family, traveling, cooking, and playing basketball.

Dimitri Restaino is a Solutions Architect at AWS, based out of Brooklyn, New York. He works primarily with Healthcare and Financial Services companies in the North East, helping to design innovative and creative solutions to best serve their customers. Coming from a software development background, he is excited by the new possibilities that serverless technology can bring to the world. Outside of work, he loves to hike and explore the NYC food scene.


Continue Reading


Enhance the caller experience with hints in Amazon Lex

We understand speech input better if we have some background on the topic of conversation. Consider a customer service agent at an auto parts wholesaler helping with orders. If the agent knows that the customer is looking for tires, they’re more likely to recognize responses (for example, “Michelin”) on the phone. Agents often pick up…




We understand speech input better if we have some background on the topic of conversation. Consider a customer service agent at an auto parts wholesaler helping with orders. If the agent knows that the customer is looking for tires, they’re more likely to recognize responses (for example, “Michelin”) on the phone. Agents often pick up such clues or hints based on their domain knowledge and access to business intelligence dashboards. Amazon Lex now supports a hints capability to enhance the recognition of relevant phrases in a conversation. You can programmatically provide phrases as hints during a live interaction to influence the transcription of spoken input. Better recognition drives efficient conversations, reduces agent handling time, and ultimately increases customer satisfaction.

In this post, we review the runtime hints capability and use it to implement verification of callers based on their mother’s maiden name.

Overview of the runtime hints capability

You can provide a list of phrases or words to help your bot with the transcription of speech input. You can use these hints with built-in slot types such as first and last names, street names, city, state, and country. You can also configure these for your custom slot types.

You can use the capability to transcribe names that may be difficult to pronounce or understand. For example, in the following sample conversation, we use it to transcribe the name “Loreck.”

Conversation 1

IVR: Welcome to ACME bank. How can I help you today?

Caller: I want to check my account balance.

IVR: Sure. Which account should I pull up?

Caller: Checking

IVR: What is the account number?

Caller: 1111 2222 3333 4444

IVR: For verification purposes, what is your mother’s maiden name?

Caller: Loreck

IVR: Thank you. The balance on your checking account is 123 dollars.

Words provided as hints are preferred over other similar words. For example, in the second sample conversation, the runtime hint (“Smythe”) is selected over a more common transcription (“Smith”).

Conversation 2

IVR: Welcome to ACME bank. How can I help you today?

Caller: I want to check my account balance.

IVR: Sure. Which account should I pull up?

Caller: Checking

IVR: What is the account number?

Caller: 5555 6666 7777 8888

IVR: For verification purposes, what is your mother’s maiden name?

Caller: Smythe

IVR: Thank you. The balance on your checking account is 456 dollars.

If the name doesn’t match the runtime hint, you can fail the verification and route the call to an agent.

Conversation 3

IVR: Welcome to ACME bank. How can I help you today?

Caller: I want to check my account balance.

IVR: Sure. Which account should I pull up?

Caller: Savings

IVR: What is the account number?

Caller: 5555 6666 7777 8888

IVR: For verification purposes, what is your mother’s maiden name?

Caller: Jane

IVR: There is an issue with your account. For support, you will be forwarded to an agent.

Solution overview

Let’s review the overall architecture for the solution (see the following diagram):

  • We use an Amazon Lex bot integrated with an Amazon Connect contact flow to deliver the conversational experience.
  • We use a dialog codehook in the Amazon Lex bot to invoke an AWS Lambda function that provides the runtime hint at the previous turn of the conversation.
  • For the purposes of this post, the mother’s maiden name data used for authentication is stored in an Amazon DynamoDB table.
  • After the caller is authenticated, the control is passed to the bot to perform transactions (for example, check balance)

In addition to the Lambda function, you can also send runtime hints to Amazon Lex V2 using the PutSession, RecognizeText, RecognizeUtterance, or StartConversation operations. The runtime hints can be set at any point in the conversation and are persisted at every turn until cleared.

Deploy the sample Amazon Lex bot

To create the sample bot and configure the runtime phrase hints, perform the following steps. This creates an Amazon Lex bot called BankingBot, and one slot type (accountNumber).

  1. Download the Amazon Lex bot.
  2. On the Amazon Lex console, choose Actions, Import.
  3. Choose the file that you downloaded, and choose Import.
  4. Choose the bot BankingBot on the Amazon Lex console.
  5. Choose the language English (GB).
  6. Choose Build.
  7. Download the supporting Lambda code.
  8. On the Lambda console, create a new function and select Author from scratch.
  9. For Function name, enter BankingBotEnglish.
  10. For Runtime, choose Python 3.8.
  11. Choose Create function.
  12. In the Code source section, open and delete the existing code.
  13. Download the function code and open it in a text editor.
  14. Copy the code and enter it into the empty function code field.
  15. Choose deploy.
  16. On the Amazon Lex console, select the bot BankingBot.
  17. Choose Deployment and then Aliases, then choose the alias TestBotAlias.
  18. On the Aliases page, choose Languages and choose English (GB).
  19. For Source, select the bot BankingBotEnglish.
  20. For Lambda version or alias, enter $LATEST.
  21. On the DynamoDB console, choose Create table.
  22. Provide the name as customerDatabase.
  23. Provide the partition key as accountNumber.
  24. Add an item with accountNumber: “1111222233334444” and mothersMaidenName “Loreck”.
  25. Add item with accountNumber: “5555666677778888” and mothersMaidenName “Smythe”.
  26. Make sure the Lambda function has permissions to read from the DynamoDB table customerDatabase.
  27. On the Amazon Connect console, choose Contact flows.
  28. In the Amazon Lex section, select your Amazon Lex bot and make it available for use in the Amazon Connect contact flow.
  29. Download the contact flow to integrate with the Amazon Lex bot.
  30. Choose the contact flow to load it into the application.
  31. Make sure the right bot is configured in the “Get Customer Input” block.
  32. Choose a queue in the “Set working queue” block.
  33. Add a phone number to the contact flow.
  34. Test the IVR flow by calling in to the phone number.

Test the solution

You can now call in to the Amazon Connect phone number and interact with the bot.


Runtime hints allow you to influence the transcription of words or phrases dynamically in the conversation. You can use business logic to identify the hints as the conversation evolves. Better recognition of the user input allows you to deliver an enhanced experience. You can configure runtime hints via the Lex V2 SDK. The capability is available in all AWS Regions where Amazon Lex operates in the English (Australia), English (UK), and English (US) locales.

To learn more, refer to runtime hints.

About the Authors

Kai Loreck is a professional services Amazon Connect consultant. He works on designing and implementing scalable customer experience solutions. In his spare time, he can be found playing sports, snowboarding, or hiking in the mountains.

Anubhav Mishra is a Product Manager with AWS. He spends his time understanding customers and designing product experiences to address their business challenges.

Sravan Bodapati is an Applied Science Manager at AWS Lex. He focuses on building cutting edge Artificial Intelligence and Machine Learning solutions for AWS customers in ASR and NLP space. In his spare time, he enjoys hiking, learning economics, watching TV shows and spending time with his family.


Continue Reading


Copyright © 2021 Today's Digital.