Connect with us


Augment search with metadata by chaining Amazon Textract, Amazon Comprehend, and Amazon Kendra

Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra reimagines enterprise search for your websites and applications so your employees and customers can easily find the content they’re looking for, even when it’s scattered across multiple locations and content repositories within your organization. With Amazon Kendra, you can stop searching…



[]Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra reimagines enterprise search for your websites and applications so your employees and customers can easily find the content they’re looking for, even when it’s scattered across multiple locations and content repositories within your organization. With Amazon Kendra, you can stop searching through troves of unstructured data and discover the right answers to your questions, when you need them.

[]Although Amazon Kendra is a great search tool, it only performs as well as the quality of documents in its index. Like with all things AI/ML related, the better the quality of data that is input into Amazon Kendra, the more targeted and precise the search results. So how can we improve the documents in our Amazon Kendra index to maximize search result performance? To allow Amazon Kendra to return more targeted search results, we enrich the documents with metadata to use attributes such as main language, named entities, key phrases, and more.

[]In this post, we address the following use case: With large amounts of raw historical documents to search on, how do we connect metadata to the documents to take advantage of Amazon Kendra’s boosting and filtering features? We aim to demonstrate a way in which you can enrich your historical data by adding metadata to searchable documents with Amazon Textract and Amazon Comprehend, to get more targeted and flexible searches with Amazon Kendra.

[]Note that the following tutorial is written using Amazon SageMaker Notebooks as a code platform. However, these API calls can be done using any IDE of your choice. To save costs, and for the sake of your own familiarity, feel free to use your favorite IDE in place of SageMaker Notebooks to follow along.

Solution overview

[]For this post, we examine a hypothetical use case for a media entertainment company. We have many documents about movies and television shows and want to use Amazon Kendra to query the data. For demonstration purposes, we pull public data on Wikipedia to create PDF documents that act as our company’s data that we want to query on. We use an Amazon SageMaker notebook instance as our code platform. We use Python, along with the Boto3 Python library, to connect to and use the Amazon Textract, Amazon Comprehend, and Amazon Kendra APIs.

[]We walk you through the following high-level steps:

  1. Create our media PDF documents through Wikipedia.
  2. Create metadata using Amazon Textract and Amazon Comprehend.
  3. Configure the Amazon Kendra index and load the data.
  4. Run a sample query and experiment with boosting query performance.


[]As a prerequisite, we first set up a SageMaker notebook instance and a Python notebook within it.

Create a SageMaker notebook instance

[]To create a SageMaker notebook instance, you can follow the instructions in the documentation Create a Notebook Instance, or follow the configuration that we use in this post.

  1. Create a notebook instance with the following configuration:
    1. Notebook instance name – KendraAugmentation
    2. Notebook instance class – ml.t2.medium
    3. Elastic inference – None

[]Next, we create an AWS Identity and Access Management (IAM) role.

  1. Choose Create a new role.
  2. Choose Next to create a role.

[]The role name starts with AmazonSageMaker-ExecutionRole-xxxxxxxx. For this example, we create a role called AmazonSageMaker-ExecutionRole-Kendra-Blog.

  1. For Root access, select Enable.
  2. Leave the remaining options at their default.
  3. Choose Create notebook instance.

[]You’re redirected to a page that shows that your notebook instance is being created. The process takes a few minutes. When you see a green InService state, the notebook is ready.

Create a Python3 notebook in your SageMaker notebook instance

[]When your SageMaker instance is ready, choose the version of Jupyter you prefer to use. For this post, we use the original Jupyter notebook as opposed to JupyterLab.

[]When inside, create a new conda_python3 notebook.

[]With this, we’re ready to start writing and running Python code. To run the rest of the code that follows, run the following code in the first Jupyter notebook cell to import the necessary modules we need:

# Module Imports import boto3 import os []As we go through each of the sections, we import other modules as necessary.

Create Media PDF documents through Wikipedia

[]Run the following code in one of the Jupyter notebook cells to create an Amazon Simple Storage Service (Amazon S3) bucket where we store all the media documents that we search on:

# Instantiate Amazon S3 client. s3_client = boto3.client(‘s3’) # Create bucket. bucket_name = “kendra-augmentation-documents-jp” s3_client.create_bucket(Bucket=bucket_name) # List buckets to make sure bucket was created. response = s3_client.list_buckets() print(‘Existing buckets:’) for bucket in response[‘Buckets’]: print(f’ {bucket[“Name”]}’) []For this post, our bucket is kendra-augmentation-documents-jp. You can update the code with a different name.

[]As we mentioned earlier, we create mock PDF documents from public Wikipedia content that represent the media data that we augment and perform searches on. I’ve pre-selected movies and TV shows from the entertainment industry in the following code, but you can choose different topics in your notebook.

# Import fpdf module. from fpdf import FPDF # Movie and TV show topics for media documents. topics = [“Dumb & Dumber Movie”, “Black Panther Movie”, “Star Wars The Last Jedi Movie”, “Mary Poppins Movie”, “Kung Fu Panda Movie”, “I Love Lucy”, “The Office TV Show”, “Star Trek: The Original Series”, “NCIS TV Show”, “Game of Thrones TV Show”] # Create PDF documents out of each topic and store into Amazon S3 bucket. for topic in topics: # Define text and PDF document names. text_filename = f”{topic}.txt”.replace(” “,”_”) pdf_filename = text_filename.replace(“txt”,”pdf”) # Write to text first. with open(text_filename, “w+”) as f: f.write(wikipedia.summary(topic)) # Convert text to pdf. pdf = FPDF() with open(text_filename, ‘rb’) as text_file: txt =‘latin-1’) pdf.set_font(‘Times’, ”, 12) pdf.add_page() pdf.multi_cell(0, 5, txt) pdf.output(pdf_filename, ‘F’) # Upload to Amazon S3 bucket. s3_client = boto3.client(‘s3’) s3_client.upload_file(pdf_filename, “kendra-augmentation-documents-jp”, pdf_filename) os.remove(text_filename) os.remove(pdf_filename) []When this code block finishes running, we have 10 media PDF documents that we can augment with metadata using Amazon Textract and Amazon Comprehend, then run queries on with Amazon Kendra.

Create metadata using Amazon Textract and Amazon Comprehend

[]To create metadata for each of our PDF files, we must first extract the text portions of each PDF using Amazon Textract. We run the extracted text through Amazon Comprehend to attach attributes (metadata) to the PDFs, such as named entities, dominant language, and key phrases. Note that Amazon Comprehend will be able to read PDF files directly in a future feature release.

  1. Use the following helper function (s3_get_filenames) to get all the file names in a specific bucket or prefix folder in Amazon S3:

def s3_get_filenames(bucket, prefix=None): “”” Gets all the filenames in a specific bucket/prefix folder in Amazon S3. Parameters: ———- bucket : str String representing bucket name you want to get filenames from. prefix : str String representing prefix within the bucket that you want to get filenames from. Returns: ——- file_list : list[str] List containing all filenames within the bucket/prefix location “”” # Set Amazon S3 client and get file objects. s3 = boto3.client(‘s3’) if prefix == None: prefix = ” result = s3.list_objects(Bucket=bucket, Prefix=prefix) # Put all file names into one list. file_list = [] for obj in result[‘Contents’]: # Only take objects that are not the folder. if ‘metadata/’ not in obj[‘Key’]: file_list += [obj[‘Key’]] return file_list []We run Amazon Textract on each of our PDF files to extract the text of each file and transform the data into a format that we later ingest into Amazon Comprehend.

[]Next, we create the S3 bucket and service role settings needed to run Amazon Textract through SageMaker notebook instances.

  1. Create a new S3 bucket to store our Amazon Textract output, called kendra-augmentation-textract-output-jp:

# Create a new bucket for Amazon Textract text outputs: bucket_name = “kendra-augmentation-textract-output-jp” s3_client.create_bucket(Bucket=bucket_name)

  1. Attach the AmazonTextractFullAccess policy to the same AmazonSageMaker-ExecutionRole-Kendra-Blog role.

[]This policy allows SageMaker to access Amazon Textract.

  1. Run the following code to run Amazon Textract on our PDF files, create new .txt files for Amazon Comprehend to use, and send these files to the S3 bucket we created:

# Amazon Textract specific imports. !pip install amazon-textract-caller amazon-textract-prettyprinter from textractcaller.t_call import call_textract, Textract_Features from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_lines_string import textractprettyprinter.t_pretty_print as prettyprint # Transform and output documents from s3. input_bucket = ‘kendra-augmentation-documents-jp’ output_bucket = ‘kendra-augmentation-textract-output-jp’ s3 = boto3.client(‘s3′) # Get all the names of the media documents we want to run Textract on. input_documents = s3_get_filenames(input_bucket) # Loop through input documents, transform textract outputs into LINE # transformations, and output to S3 for ingestion into Comprehend. for document_name in input_documents: # Define input document to read. input_document = f’s3://{input_bucket}/{document_name}’ # Get text using Textract call_textract function. textract_json = call_textract(input_document=input_document) # Convert response from Textract using get_lines_string function. line_transformation_text = get_lines_string(textract_json=textract_json) # Put text into text file to send back to S3. filename = f'{document_name}_LINE.txt’ with open(filename,’w+’) as f: f.write(line_transformation_text) # Send text file to S3 to be ingested into Comprehend. with open(filename, ‘rb’) as data: s3.upload_fileobj(data, output_bucket, filename) []We now have a .txt Amazon Textract output file for each of our PDFs in the kendra-augmentation-textract-output-jp bucket.

[]Now that we have our .txt files with the text representation of our PDF files, we can create metadata out of them using Amazon Comprehend.

  1. Attach the ComprehendFullAccess policy to the AmazonSageMaker-ExecutionRole-Kendra-Blog role.

[]We extract and attach the following Amazon Comprehend metadata attributes to each document for Amazon Kendra to index on:

  • Dominant language – The language that’s being used the most in the document
  • Named entities – A textual reference to the unique name of a real-world object, such as people, places, and commercial items, and precise references to measures such as dates and quantities
  • Key phrases – A string containing a noun phrase that describes a particular thing
  • Sentiment – The positive, negative, neutral, and mixed sentiment score of the entire document
  1. Use the following ComprehendAnalyzer Python class to simplify and unify the Amazon Comprehend API calls. Either copy and paste the code into one of the notebook cells and run it, or create a separate .py file and import it in the notebook.

import boto3 import json class ComprehendAnalyzer: “”” Class that takes a document in Amazon S3 and uses Amazon Comprehend to define metadata to it as attributes for the purpose of being used by Amazon Kendra downstream. “”” def __init__(self, s3_bucket, document=None, lang=None): “”” Instantiates Amazon S3 and Amazon Comprehend clients plus class attributes. “”” # Instantiate Amazon S3 and Amazon Comprehend. self.s3 = boto3.client(‘s3’) self.comprehend = boto3.client(‘comprehend’) # Instantiate class attributes. self.s3_bucket = s3_bucket if s3_bucket != None: self.document = document if lang == None: self.lang = ‘en’ # Attribute list that will be used by Amazon Kendra downstream. self.attribute_list = [] def set_document(self, document): “”” Sets self.document whenever you want to analyze a new document without having to instantiate another comprehend_analyzer object. Parameters: ———- document : str String representing document filepath to process. Returns: ——- Void “”” # Set self.document to new document and reset self.attribute_list self.document = document self.attribute_list = [] def get_dominant_languages(self, confidence_threshold=.75): “”” Gets the dominant langauages in self.document in self.s3_bucket in string format. Only add languages with a confidence score that is greater than or equal to the confidence_threshold input parameter. Parameters: ———- confidence_threshold : float Float representing confidence threshold for adding languages to metadata. Defaults to .75. s3 : boto3.resources.factory.s3.ServiceResource Amazon S3 client to do the processing. Defaults to None. comprehend : botocore.client.Comprehend Amazon Comprehend client to do the processing. Defaults to None. Returns: ——- languages_text : str String representing all the dominant languages found in the document that are greater than, or equal to, the confidence_threshold input parameter. “”” # Grab text from document. test_text = self.s3.get_object(Bucket=self.s3_bucket, Key=self.document)[‘Body’].read() # Detect language using Amazon Comprehend. comprehend_response = self.comprehend.detect_dominant_language(Text = test_text.decode(‘utf-8’)) # Take languages over confidence_threshold from comprehend_response. languages = [] for l in comprehend_response[‘Languages’]: if l[‘Score’] >= confidence_threshold: languages.append(l[‘LanguageCode’]) languages_text = ‘, ‘.join(languages) # Attribute dictionary input. attribute_format = {‘Key’ : ‘Languages’, ‘Value’ : {‘StringValue’ : languages_text}} # Add languages to self.attribute_list. self.attribute_list.append(attribute_format) return languages_text def get_named_entities(self, confidence_threshold=.75): “”” Gets the named entities in self.document in Amazon S3 in string format. Only add named entities with a confidence score that is greater than or equal to the confidence_threshold input parameter. Parameters: ———- confidence_threshold : float Float representing confidence threshold for adding named entities to metadata. Defaults to .75. Returns: ——- named_entities : list List representing all the named entities found in the document that are greater than, or equal to, the confidence_threshold input parameter. “”” # Grab text from document. test_text = self.s3.get_object(Bucket=self.s3_bucket, Key=self.document)[‘Body’].read() # Detect named entities using Amazon Comprehend. comprehend_response = self.comprehend.detect_entities(Text = test_text.decode(‘utf-8’), LanguageCode=self.lang) # Take named entities over confidence_threshold from comprehend_response. named_entities = [] for entity in comprehend_response[‘Entities’]: if entity[‘Score’] >= confidence_threshold: named_entities.append(entity[‘Text’]) attribute_format = {‘Key’ : ‘Named_Entities’, ‘Value’ : {‘StringListValue’ : named_entities[0:10]}} # Add named entities to self.attribute_list. self.attribute_list.append(attribute_format) return named_entities def get_key_phrases(self, confidence_threshold=.75): “”” Gets key phrases in self.document in Amazon S3 in string format. Only add key phrases with a confidence score that is greater than or equal to the confidence_threshold input parameter. Parameters: ———- confidence_threshold : float Float representing confidence threshold for adding key phrases to metadata. Defaults to .75. Returns: ——- key_phrases : list List representing all the key phrases found in the document that are greater than, or equal to, the confidence_threshold input parameter. “”” # Grab text from document. test_text = self.s3.get_object(Bucket=self.s3_bucket, Key=self.document)[‘Body’].read() # Detect key phrases using Amazon Comprehend. comprehend_response = self.comprehend.detect_key_phrases(Text = test_text.decode(‘utf-8’), LanguageCode=self.lang) # Take named entities over confidence_threshold from comprehend_response. key_phrases = [] for phrase in comprehend_response[‘KeyPhrases’]: if phrase[‘Score’] >= confidence_threshold: key_phrases.append(phrase[‘Text’]) # Attribute dictionary input. attribute_format = {‘Key’ : ‘Key_Phrases’, ‘Value’ : {‘StringListValue’ : key_phrases[0:10]}} # Add key phrases to self.attribute_list. self.attribute_list.append(attribute_format) return key_phrases def get_sentiment(self): “”” Gets sentiment in self.document in Amazon S3 in string format. Only add sentiment with a confidence score that is greater than or equal to the confidence_threshold input parameter. Parameters: ———- None Returns: ——- sentiment_dict : dict Dictionary representing all the sentiment found in the document broken down into overall sentiment and inidividual scores for positive, negative, neutral, and mixed sentiments. “”” # Grab text from document. test_text = self.s3.get_object(Bucket=self.s3_bucket, Key=self.document)[‘Body’].read() # Detect sentiment using Amazon Comprehend. comprehend_response = self.comprehend.detect_sentiment(Text = test_text.decode(‘utf-8’), LanguageCode=self.lang) # Add sentiment scores to self.attribute_dict. attribute_format = [{‘Key’ : ‘Sentiment’, ‘Value’ : {‘StringValue’ : comprehend_response[‘Sentiment’]}}, {‘Key’ : ‘Positive_Score’, ‘Value’ : {‘LongValue’ : int(comprehend_response[‘SentimentScore’][‘Positive’]*100)}}, {‘Key’ : ‘Negative_Score’, ‘Value’ : {‘LongValue’ : int(comprehend_response[‘SentimentScore’][‘Negative’]*100)}}, {‘Key’ : ‘Neutral_Score’, ‘Value’ : {‘LongValue’ : int(comprehend_response[‘SentimentScore’][‘Neutral’]*100)}}, {‘Key’ : ‘Mixed_Score’, ‘Value’ : {‘LongValue’ : int(comprehend_response[‘SentimentScore’][‘Mixed’]*100)}}] self.attribute_list += attribute_format # Add same information to sentiment_dict. sentiment_dict = {} sentiment_dict[‘Sentiment’] = comprehend_response[‘Sentiment’] sentiment_dict[‘Positive_Score’] = int(comprehend_response[‘SentimentScore’][‘Positive’]*100) sentiment_dict[‘Negative_Score’] = int(comprehend_response[‘SentimentScore’][‘Negative’]*100) sentiment_dict[‘Neutral_Score’] = int(comprehend_response[‘SentimentScore’][‘Neutral’]*100) sentiment_dict[‘Mixed_Score’] = int(comprehend_response[‘SentimentScore’][‘Mixed’]*100) return sentiment_dict []We now have everything we need to create an Amazon Kendra index, create and add metadata to the index, and start boosting and filtering our Amazon Kendra searches!

Configure our Amazon Kendra index

[]Now that we’ve got our Amazon Textract outputs and our Amazon Comprehend class in ComprehendAnalyzer, we can put everything together with Amazon Kendra.

Configure Amazon Kendra IAM access

[]Like in the previous steps, we need to give SageMaker access to use Amazon Kendra by attaching the AmazonKendraFullAccess policy to the AmazonSageMaker-ExecutionRole-Kendra-Blog role. Then we create an IAM policy and service role.

  1. Attach the AmazonKendraFullAccess policy to the AmazonSageMaker-ExecutionRole-Kendra-Blog role.

[]To create an index with Amazon Kendra, we first create an IAM policy that lets Amazon Kendra access our CloudWatch Logs, and then create an Amazon Kendra service role. For full instructions, see the Amazon Kendra Developer Guide. I outline the exact steps in this section for your convenience.

  1. On the IAM console, choose Policies in the navigation pane.
  2. Choose Create policy.
  3. Choose JSON and replace the default policy with the following:

{ “Version”: “2012-10-17”, “Statement”: [ { “Effect”: “Allow”, “Action”: [ “cloudwatch:PutMetricData” ], “Resource”: “*”, “Condition”: { “StringEquals”: { “cloudwatch:namespace”: “AWS/Kendra” } } }, { “Effect”: “Allow”, “Action”: [ “logs:DescribeLogGroups” ], “Resource”: “*” }, { “Effect”: “Allow”, “Action”: [ “logs:CreateLogGroup” ], “Resource”: [ “arn:aws:logs:region:account ID:log-group:/aws/kendra/*” ] }, { “Effect”: “Allow”, “Action”: [ “logs:DescribeLogStreams”, “logs:CreateLogStream”, “logs:PutLogEvents” ], “Resource”: [ “arn:aws:logs:region:account ID:log-group:/aws/kendra/*:log-stream:*” ] } ] }

  1. Choose Review policy.
  2. Name the policy KendraPolicyForGettingStartedIndex and choose Create policy.
  3. Choose Another AWS account and enter your account ID.
  4. Choose Next: Permissions.
  5. In the navigation pane, choose Roles.
  6. Choose Create role.
  7. Choose the policy that you just created and choose Next: Tags.
  8. Don’t add any tags and choose Next: Review.
  9. Name the role KendraRoleForGettingStartedIndex and choose Create role.
  10. Find the role that you just created and open the role summary.
  11. Choose Trust relationships and then choose Edit trust relationship.
  12. Replace the existing trust relationship with the following:

{ “Version”: “2012-10-17”, “Statement”: [ { “Effect”: “Allow”, “Principal”: { “Service”: “” }, “Action”: “sts:AssumeRole” } ] }

  1. Choose Update trust policy.

Create your Amazon Kendra Index

[]Now that we’ve got all the policies and roles that we need, let’s create our Amazon Kendra index using the following code. You have to update the role ARN with your AWS account number.

# Instantiate Amazon Kendra client. kendra = boto3.client(‘kendra’) # Set index name and input service role. index_name = “blog-media-company-index” index_role_arn =”arn:aws:iam::{your_account_no}:role/service-role/KendraPolicyForGettingStartedIndex” # Create Amazon Kendra index index_response = kendra.create_index( Name = index_name, RoleArn = index_role_arn ) # Get index ID for reference. index_id = index_response[“Id”] # Check status of index. import time while True: # Get index description index_description = kendra.describe_index(Id = index_id) # When status is not CREATING quit. status = index_description[“Status”] print(” Creating index. Status: “+status) time.sleep(60) if status != “CREATING”: break []When this code block is done running, you should see the status as Active when your Amazon Kendra index has been created.

Define the Amazon Kendra index metadata configuration

[]We now define the metadata configuration for the index blog-media-company-index we just made. It follows the Amazon Comprehend attributes we defined in our Python class ComprehendAnalyzer. See the following code:

# Since comprehend_analyzer has the ability to create 8 new attributes with Amazon # Textract and Amazon Comprehend, we’ll update the metadata configuration of Amazon # Kendra to reflect those attributes. meta_config_dict = {‘Id’:index_id, ‘DocumentMetadataConfigurationUpdates’:[ {‘Name’: ‘Languages’, ‘Type’: ‘STRING_VALUE’, ‘Search’: { ‘Facetable’: True, ‘Searchable’ : True, ‘Displayable’: True}, ‘Relevance’: { ‘Importance’: 1}, }, {‘Name’: ‘Key_Phrases’, ‘Type’: ‘STRING_LIST_VALUE’, ‘Search’: { ‘Facetable’: True, ‘Searchable’ : True, ‘Displayable’: True}, }, {‘Name’: ‘Named_Entities’, ‘Type’: ‘STRING_LIST_VALUE’, ‘Search’: { ‘Facetable’: True, ‘Searchable’ : True, ‘Displayable’: True}, }, {‘Name’: ‘Sentiment’, ‘Type’: ‘STRING_VALUE’, ‘Search’: { ‘Facetable’: True, ‘Searchable’ : True, ‘Displayable’: True}, ‘Relevance’: { ‘Importance’: 1}, }, {‘Name’: ‘Positive_Score’, ‘Type’: ‘LONG_VALUE’, ‘Search’: { ‘Facetable’: True, ‘Searchable’ : False, ‘Displayable’: True}, ‘Relevance’: { ‘Importance’: 1, ‘RankOrder’: ‘DESCENDING’}, }, {‘Name’: ‘Negative_Score’, ‘Type’: ‘LONG_VALUE’, ‘Search’: { ‘Facetable’: True, ‘Searchable’ : False, ‘Displayable’: True}, ‘Relevance’: { ‘Importance’: 1, ‘RankOrder’: ‘DESCENDING’}, }, {‘Name’: ‘Neutral_Score’, ‘Type’: ‘LONG_VALUE’, ‘Search’: { ‘Facetable’: True, ‘Searchable’ : False, ‘Displayable’: True}, ‘Relevance’: { ‘Importance’: 1, ‘RankOrder’: ‘DESCENDING’}, }, {‘Name’: ‘Mixed_Score’, ‘Type’: ‘LONG_VALUE’, ‘Search’: { ‘Facetable’: True, ‘Searchable’ : False, ‘Displayable’: True}, ‘Relevance’: { ‘Importance’: 1, ‘RankOrder’: ‘DESCENDING’}, }, ] } response = kendra.update_index(**meta_config_dict) print(response)

Create metadata using ComprehendAnalyzer

[]Now that we’ve created our index blog-media-company-index and defined and set our metadata configuration, we use ComprehendAnalyzer to extract metadata from our media files in Amazon S3:

# Define input parameters. textract_output_bucket = ‘kendra-augmentation-textract-output-jp’ textract_documents = s3_get_filenames(textract_output_bucket) analyzer = ComprehendAnalyzer(s3_bucket=textract_output_bucket) # Instantiate Amazon S3 client. s3 = boto3.client(‘s3’) # Instantiate document list to be ingested into Amazon Kendra index. documents = [] # Loop through each document in Amazon S3: # 1. Create document by using the Amazon Textract output in addition to # the metadata defined with comprehend_analyzer. # 2. Append document to document list to be ingested into Amazon Kendra. for d in textract_documents: # Set document. analyzer.set_document(d) # Get metadata. analyzer.get_dominant_languages() analyzer.get_key_phrases() analyzer.get_named_entities() analyzer.get_sentiment() # Remove either “_LINE.txt” or “_WORD.txt” from the document filename. document_id = d[0:-9] # Grab text from the Amazon Textract output. text = s3.get_object(Bucket=textract_output_bucket, Key=d)[‘Body’].read() # Define document with Amazon Textract text and Amazon Comprehend attributes. document = { ‘Id’: document_id, ‘Title’: document_id, ‘Blob’: text, ‘Attributes’:analyzer.attribute_list, ‘ContentType’:’PLAIN_TEXT’ } documents.append(document) []If you want to see what the metadata looks like, look at the first item in the documents Python list by running the following code:

# Take a look at the first of the metadata documents you’ve prepared. documents[0][‘Attributes’]

Load metadata into the Amazon Kendra index

[]The last step is to load the metadata we extracted using ComprehendAnalyzer into the blog-media-company-index index by running the following code:

kendra.batch_put_document( IndexId = index_id, Documents = documents) []Now we’re ready to start querying and boosting some of the metadata attributes!

Query the index and boost metadata attributes

[]We now have everything set up to start querying our data. We’re able to weigh attributes differently in terms of significance, make metadata attributes searchable, influence the order of results coming back from the query by improving the sentiment metadata, and much more.

Run a sample query

[]Before we get into a few examples that demonstrate the power and flexibility this metadata attachment gives us, let’s run the following code to query the blog-media-company-index index:

# Print function to more easily visualize Amazon Kendra query results. def print_results(response, result_number): print (‘nSearch results for query: ‘ + query + ‘n’) count = 0 for query_result in response[‘ResultItems’]: print(‘——————-‘) print(‘Type: ‘ + str(query_result[‘Type’])) if query_result[‘Type’]==’ANSWER’: count += 1 answer_text = query_result[‘DocumentExcerpt’][‘Text’] print(answer_text) if query_result[‘Type’]==’DOCUMENT’: if ‘DocumentTitle’ in query_result: document_title = query_result[‘DocumentTitle’][‘Text’] print(‘Title: ‘ + document_title) document_text = query_result[‘DocumentExcerpt’][‘Text’] print(document_text) count += 1 print (‘——————nn’) if count >= result_number: break; def print_list(dict_list): for attribute in dict_list: text = attribute[‘Key’] + “: ” for k,v in attribute[‘Value’].items(): text += str(v) + “n” print(text) []We can test the following query to get a sense of how to query our new index:

# Search Query. query = ‘who was star wars produced by’ response=kendra.query( QueryText = query, IndexId = index_id) print_results(response, 3) []You should get a response like the following screenshot.

[]Now that you know how to query, let’s get into some examples of how we can use our metadata to influence our searches.

Improve metadata

[]This section contains some examples of how we can influence and control our search for more targeted results. For each of the examples, we update our blog-media-company-index index by modifying our meta_config_dict and rerunning the following code:


Example 1: Weighing attributes

[]To weigh attributes by significance, update the Importance value of the attributes. The range for importance goes from 1–10, 1 being lowest, and 10 being the highest.

[]For example, let’s say we have a use case where we have different country entities and we have documents in many different languages. We can increase the significance of the Languages metadata attribute to account for this by updating its Importance to 10, and making sure Searchable is set to True. This makes it so that the text in the field Languages is searchable. See the following code:

{‘Name’: ‘Languages’, ‘Type’: ‘STRING_VALUE’, ‘Search’: { ‘Facetable’: True, ‘Searchable’ : True, ‘Displayable’: True}, ‘Relevance’: { ‘Importance’: 10}, } []Now let’s say that we’re looking for more positive context results. We increase the Importance value of the metadata attribute Sentiment to 10:

{‘Name’: ‘Sentiment’, ‘Type’: ‘STRING_VALUE’, ‘Search’: { ‘Facetable’: True, ‘Searchable’ : True, ‘Displayable’: True}, ‘Relevance’: { ‘Importance’: 10}, }

Example 2: Ranking search results

[]Let’s say we want to influence the rank of the search results by a particular sentiment metadata attribute. We can simply configure the Importance and RankOrder of the sentiment we want. For example, if we want to increase the significance of the positive results and rank those results higher than the negative, we update the Positive_score attribute to have an Importance of 10 and a RankOrder of DESCENDING to put the most positive results at the top. We leave the Importance of Negative_Score at 1 and update its RankOrder to ASCENDING to make sure the least negative sentiment results show up higher. See the following code:

{‘Name’: ‘Positive_Score’, ‘Type’: ‘LONG_VALUE’, ‘Search’: { ‘Facetable’: True, ‘Searchable’ : False, ‘Displayable’: True}, ‘Relevance’: { ‘Importance’: 1, ‘RankOrder’: ‘DESCENDING’}, }, {‘Name’: ‘Negative_Score’, ‘Type’: ‘LONG_VALUE’, ‘Search’: { ‘Facetable’: True, ‘Searchable’ : False, ‘Displayable’: True}, ‘Relevance’: { ‘Importance’: 1, ‘RankOrder’: ‘ASCENDING’}, }

Get creative!

[]At this point, you’ve got your Amazon Kendra index and metadata attributes set up. Go ahead and play around with querying, weighing metadata, and ranking results by creating your own creative combinations!

Clean up

[]To avoid extra charges, shut down the SageMaker and Amazon Kendra resources when you’re done.

  1. On the SageMaker console, choose Notebook and Notebook instances.
  2. Select the notebook that you created.
  3. On the Actions menu, choose Stop.
  4. Choose Delete.

[]Alternatively, you can keep the instance stopped indefinitely and not be charged.

  1. On the Amazon Kendra console, choose Indexes.
  2. Select the index you created.
  3. On the Actions menu, choose Delete.

[]Because we used Amazon Textract and Amazon Comprehend via API, there are no shutdown steps necessary for those resources.


[]In this post, we showed how to do the following:

  • Use Amazon Textract on PDF files to extract text from documents
  • Use Amazon Comprehend to extract metadata attributes from Amazon Textract output
  • Perform targeted searches with Amazon Kendra using the metadata attributes extracted by Amazon Comprehend

[]Although this may have been a mock media company example using public sample data, I hope you were able to have some fun following along and realize the potential—and power—of chaining Amazon Textract, Amazon Comprehend, and Amazon Kendra together. Use this new knowledge and start augmenting your historical data! To learn more about how Amazon Kendra’s fully managed intelligent search service can help your business, visit our webpage or dive into our documentation and tutorials!

About the Author

[]James Poquiz is a Data Scientist with AWS Professional Services based in Orange County, California. He has a BS in Computer Science from the University of California, Irvine and has several years of experience working in the data domain having played many different roles. Today he works on implementing and deploying scalable ML solutions to achieve business outcomes for AWS clients.


Continue Reading
Click to comment

Leave a Reply

Your email address will not be published.


Customize pronunciation using lexicons in Amazon Polly

Amazon Polly is a text-to-speech service that uses advanced deep learning technologies to synthesize natural-sounding human speech. It is used in a variety of use cases, such as contact center systems, delivering conversational user experiences with human-like voices for automated real-time status check, automated account and billing inquiries, and by news agencies like The Washington…




Amazon Polly is a text-to-speech service that uses advanced deep learning technologies to synthesize natural-sounding human speech. It is used in a variety of use cases, such as contact center systems, delivering conversational user experiences with human-like voices for automated real-time status check, automated account and billing inquiries, and by news agencies like The Washington Post to allow readers to listen to news articles.

As of today, Amazon Polly provides over 60 voices in 30+ language variants. Amazon Polly also uses context to pronounce certain words differently based upon the verb tense and other contextual information. For example, “read” in “I read a book” (present tense) and “I will read a book” (future tense) is pronounced differently.

However, in some situations you may want to customize the way Amazon Polly pronounces a word. For example, you may need to match the pronunciation with local dialect or vernacular. Names of things (e.g., Tomato can be pronounced as tom-ah-to or tom-ay-to), people, streets, or places are often pronounced in many different ways.

In this post, we demonstrate how you can leverage lexicons for creating custom pronunciations. You can apply lexicons for use cases such as publishing, education, or call centers.

Customize pronunciation using SSML tag

Let’s say you stream a popular podcast from Australia and you use the Amazon Polly Australian English (Olivia) voice to convert your script into human-like speech. In one of your scripts, you want to use words that are unknown to Amazon Polly voice. For example, you want to send Mātariki (Māori New Year) greetings to your New Zealand listeners. For such scenarios, Amazon Polly supports phonetic pronunciation, which you can use to achieve a pronunciation that is close to the correct pronunciation in the foreign language.

You can use the Speech Synthesis Markup Language (SSML) tag to suggest a phonetic pronunciation in the ph attribute. Let me show you how you can use SSML tag.

First, login into your AWS console and search for Amazon Polly in the search bar at the top. Select Amazon Polly and then choose Try Polly button.

In the Amazon Polly console, select Australian English from the language dropdown and enter following text in the Input text box and then click on Listen to test the pronunciation.

I’m wishing you all a very Happy Mātariki.

Sample speech without applying phonetic pronunciation:

If you hear the sample speech above, you can notice that the pronunciation of Mātariki – a word which is not part of Australian English – isn’t quite spot-on. Now, let’s look at how in such scenarios we can use phonetic pronunciation using SSML tag to customize the speech produced by Amazon Polly.

To use SSML tags, turn ON the SSML option in Amazon Polly console. Then copy and paste following SSML script containing phonetic pronunciation for Mātariki specified inside the ph attribute of the tag.

I’m wishing you all a very Happy Mātariki.

With the tag, Amazon Polly uses the pronunciation specified by the ph attribute instead of the standard pronunciation associated by default with the language used by the selected voice.

Sample speech after applying phonetic pronunciation:

If you hear the sample sound, you’ll notice that we opted for a different pronunciation for some of vowels (e.g., ā) to make Amazon Polly synthesize the sounds that are closer to the correct pronunciation. Now you might have a question, how do I generate the phonetic transcription “” for the word Mātariki?

You can create phonetic transcriptions by referring to the Phoneme and Viseme tables for the supported languages. In the example above we have used the phonemes for Australian English.

Amazon Polly offers support in two phonetic alphabets: IPA and X-Sampa. Benefit of X-Sampa is that they are standard ASCII characters, so it is easier to type the phonetic transcription with a normal keyboard. You can use either of IPA or X-Sampa to generate your transcriptions, but make sure to stay consistent with your choice, especially when you use a lexicon file which we’ll cover in the next section.

Each phoneme in the phoneme table represents a speech sound. The bolded letters in the “Example” column of the Phoneme/Viseme table in the Australian English page linked above represent the part of the word the “Phoneme” corresponds to. For example, the phoneme /j/ represents the sound that an Australian English speaker makes when pronouncing the letter “y” in “yes.”

Customize pronunciation using lexicons

Phoneme tags are suitable for one-off situations to customize isolated cases, but these are not scalable. If you process huge volume of text, managed by different editors and reviewers, we recommend using lexicons. Using lexicons, you can achieve consistency in adding custom pronunciations and simultaneously reduce manual effort of inserting phoneme tags into the script.

A good practice is that after you test the custom pronunciation on the Amazon Polly console using the tag, you create a library of customized pronunciations using lexicons. Once lexicons file is uploaded, Amazon Polly will automatically apply phonetic pronunciations specified in the lexicons file and eliminate the need to manually provide a tag.

Create a lexicon file

A lexicon file contains the mapping between words and their phonetic pronunciations. Pronunciation Lexicon Specification (PLS) is a W3C recommendation for specifying interoperable pronunciation information. The following is an example PLS document:

Matariki Mātariki NZ New Zealand

Make sure that you use correct value for the xml:lang field. Use en-AU if you’re uploading the lexicon file to use with the Amazon Polly Australian English voice. For a complete list of supported languages, refer to Languages Supported by Amazon Polly.

To specify a custom pronunciation, you need to add a element which is a container for a lexical entry with one or more element and one or more pronunciation information provided inside element.

The element contains the text describing the orthography of the element. You can use a element to specify the word whose pronunciation you want to customize. You can add multiple elements to specify all word variations, for example with or without macrons. The element is case-sensitive, and during speech synthesis Amazon Polly string matches the words inside your script that you’re converting to speech. If a match is found, it uses the element, which describes how the is pronounced to generate phonetic transcription.

You can also use for commonly used abbreviations. In the preceding example of a lexicon file, NZ is used as an alias for New Zealand. This means that whenever Amazon Polly comes across “NZ” (with matching case) in the body of the text, it’ll read those two letters as “New Zealand”.

For more information on lexicon file format, see Pronunciation Lexicon Specification (PLS) Version 1.0 on the W3C website.

You can save a lexicon file with as a .pls or .xml file before uploading it to Amazon Polly.

Upload and apply the lexicon file

Upload your lexicon file to Amazon Polly using the following instructions:

  1. On the Amazon Polly console, choose Lexicons in the navigation pane.
  2. Choose Upload lexicon.
  3. Enter a name for the lexicon and then choose a lexicon file.
  4. Choose the file to upload.
  5. Choose Upload lexicon.

If a lexicon by the same name (whether a .pls or .xml file) already exists, uploading the lexicon overwrites the existing lexicon.

Now you can apply the lexicon to customize pronunciation.

  1. Choose Text-to-Speech in the navigation pane.
  2. Expand Additional settings.
  3. Turn on Customize pronunciation.
  4. Choose the lexicon on the drop-down menu.

You can also choose Upload lexicon to upload a new lexicon file (or a new version).

It’s a good practice to version control the lexicon file in a source code repository. Keeping the custom pronunciations in a lexicon file ensures that you can consistently refer to phonetic pronunciations for certain words across the organization. Also, keep in mind the pronunciation lexicon limits mentioned on Quotas in Amazon Polly page.

Test the pronunciation after applying the lexicon

Let’s perform quick test using “Wishing all my listeners in NZ, a very Happy Mātariki” as the input text.

We can compare the audio files before and after applying the lexicon.

Before applying the lexicon:

After applying the lexicon:


In this post, we discussed how you can customize pronunciations of commonly used acronyms or words not found in the selected language in Amazon Polly. You can use SSML tag which is great for inserting one-off customizations or testing purposes. We recommend using Lexicon to create a consistent set of pronunciations for frequently used words across your organization. This enables your content writers to spend time on writing instead of the tedious task of adding phonetic pronunciations in the script repetitively. You can try this in your AWS account on the Amazon Polly console.

Summary of resources

About the Authors

Ratan Kumar is a Solutions Architect based out of Auckland, New Zealand. He works with large enterprise customers helping them design and build secure, cost-effective, and reliable internet scale applications using the AWS cloud. He is passionate about technology and likes sharing knowledge through blog posts and twitch sessions.

Maciek Tegi is a Principal Audio Designer and a Product Manager for Polly Brand Voices. He has worked in professional capacity in the tech industry, movies, commercials and game localization. In 2013, he was the first audio engineer hired to the Alexa Text-To- Speech team. Maciek was involved in releasing 12 Alexa TTS voices across different countries, over 20 Polly voices, and 4 Alexa celebrity voices. Maciek is a triathlete, and an avid acoustic guitar player.


Continue Reading


AWS Week in Review – May 16, 2022

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS! I had been on the road for the last five weeks and attended many of the AWS Summits in Europe. It was great to talk to so many of you…




This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

I had been on the road for the last five weeks and attended many of the AWS Summits in Europe. It was great to talk to so many of you in person. The Serverless Developer Advocates are going around many of the AWS Summits with the Serverlesspresso booth. If you attend an event that has the booth, say “Hi ” to my colleagues, and have a coffee while asking all your serverless questions. You can find all the upcoming AWS Summits in the events section at the end of this post.

Last week’s launches
Here are some launches that got my attention during the previous week.

AWS Step Functions announced a new console experience to debug your state machine executions – Now you can opt-in to the new console experience of Step Functions, which makes it easier to analyze, debug, and optimize Standard Workflows. The new page allows you to inspect executions using three different views: graph, table, and event view, and add many new features to enhance the navigation and analysis of the executions. To learn about all the features and how to use them, read Ben’s blog post.

Example on how the Graph View looks

Example on how the Graph View looks

AWS Lambda now supports Node.js 16.x runtime – Now you can start using the Node.js 16 runtime when you create a new function or update your existing functions to use it. You can also use the new container image base that supports this runtime. To learn more about this launch, check Dan’s blog post.

AWS Amplify announces its Android library designed for Kotlin – The Amplify Android library has been rewritten for Kotlin, and now it is available in preview. This new library provides better debugging capacities and visibility into underlying state management. And it is also using the new AWS SDK for Kotlin that was released last year in preview. Read the What’s New post for more information.

Three new APIs for batch data retrieval in AWS IoT SiteWise – With this new launch AWS IoT SiteWise now supports batch data retrieval from multiple asset properties. The new APIs allow you to retrieve current values, historical values, and aggregated values. Read the What’s New post to learn how you can start using the new APIs.

AWS Secrets Manager now publishes secret usage metrics to Amazon CloudWatch – This launch is very useful to see the number of secrets in your account and set alarms for any unexpected increase or decrease in the number of secrets. Read the documentation on Monitoring Secrets Manager with Amazon CloudWatch for more information.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other launches and news that you may have missed:

IBM signed a deal with AWS to offer its software portfolio as a service on AWS. This allows customers using AWS to access IBM software for automation, data and artificial intelligence, and security that is built on Red Hat OpenShift Service on AWS.

Podcast Charlas Técnicas de AWS – If you understand Spanish, this podcast is for you. Podcast Charlas Técnicas is one of the official AWS podcasts in Spanish. This week’s episode introduces you to Amazon DynamoDB and shares stories on how different customers use this database service. You can listen to all the episodes directly from your favorite podcast app or the podcast web page.

AWS Open Source News and Updates – Ricardo Sueiras, my colleague from the AWS Developer Relation team, runs this newsletter. It brings you all the latest open-source projects, posts, and more. Read edition #112 here.

Upcoming AWS Events
It’s AWS Summits season and here are some virtual and in-person events that might be close to you:

You can register for re:MARS to get fresh ideas on topics such as machine learning, automation, robotics, and space. The conference will be in person in Las Vegas, June 21–24.

That’s all for this week. Check back next Monday for another Week in Review!

— Marcia


Continue Reading


Personalize your machine translation results by using fuzzy matching with Amazon Translate

A person’s vernacular is part of the characteristics that make them unique. There are often countless different ways to express one specific idea. When a firm communicates with their customers, it’s critical that the message is delivered in a way that best represents the information they’re trying to convey. This becomes even more important when…




A person’s vernacular is part of the characteristics that make them unique. There are often countless different ways to express one specific idea. When a firm communicates with their customers, it’s critical that the message is delivered in a way that best represents the information they’re trying to convey. This becomes even more important when it comes to professional language translation. Customers of translation systems and services expect accurate and highly customized outputs. To achieve this, they often reuse previous translation outputs—called translation memory (TM)—and compare them to new input text. In computer-assisted translation, this technique is known as fuzzy matching. The primary function of fuzzy matching is to assist the translator by speeding up the translation process. When an exact match can’t be found in the TM database for the text being translated, translation management systems (TMSs) often have the option to search for a match that is less than exact. Potential matches are provided to the translator as additional input for final translation. Translators who enhance their workflow with machine translation capabilities such as Amazon Translate often expect fuzzy matching data to be used as part of the automated translation solution.

In this post, you learn how to customize output from Amazon Translate according to translation memory fuzzy match quality scores.

Translation Quality Match

The XML Localization Interchange File Format (XLIFF) standard is often used as a data exchange format between TMSs and Amazon Translate. XLIFF files produced by TMSs include source and target text data along with match quality scores based on the available TM. These scores—usually expressed as a percentage—indicate how close the translation memory is to the text being translated.

Some customers with very strict requirements only want machine translation to be used when match quality scores are below a certain threshold. Beyond this threshold, they expect their own translation memory to take precedence. Translators often need to apply these preferences manually either within their TMS or by altering the text data. This flow is illustrated in the following diagram. The machine translation system processes the translation data—text and fuzzy match scores— which is then reviewed and manually edited by translators, based on their desired quality thresholds. Applying thresholds as part of the machine translation step allows you to remove these manual steps, which improves efficiency and optimizes cost.

Machine Translation Review Flow

Figure 1: Machine Translation Review Flow

The solution presented in this post allows you to enforce rules based on match quality score thresholds to drive whether a given input text should be machine translated by Amazon Translate or not. When not machine translated, the resulting text is left to the discretion of the translators reviewing the final output.

Solution Architecture

The solution architecture illustrated in Figure 2 leverages the following services:

  • Amazon Simple Storage Service – Amazon S3 buckets contain the following content:
    • Fuzzy match threshold configuration files
    • Source text to be translated
    • Amazon Translate input and output data locations
  • AWS Systems Manager – We use Parameter Store parameters to store match quality threshold configuration values
  • AWS Lambda – We use two Lambda functions:
    • One function preprocesses the quality match threshold configuration files and persists the data into Parameter Store
    • One function automatically creates the asynchronous translation jobs
  • Amazon Simple Queue Service – An Amazon SQS queue triggers the translation flow as a result of new files coming into the source bucket

Solution Architecture Diagram

Figure 2: Solution Architecture

You first set up quality thresholds for your translation jobs by editing a configuration file and uploading it into the fuzzy match threshold configuration S3 bucket. The following is a sample configuration in CSV format. We chose CSV for simplicity, although you can use any format. Each line represents a threshold to be applied to either a specific translation job or as a default value to any job.

default, 75 SourceMT-Test, 80

The specifications of the configuration file are as follows:

  • Column 1 should be populated with the name of the XLIFF file—without extension—provided to the Amazon Translate job as input data.
  • Column 2 should be populated with the quality match percentage threshold. For any score below this value, machine translation is used.
  • For all XLIFF files whose name doesn’t match any name listed in the configuration file, the default threshold is used—the line with the keyword default set in Column 1.

Auto-generated parameter in Systems Manager Parameter Store

Figure 3: Auto-generated parameter in Systems Manager Parameter Store

When a new file is uploaded, Amazon S3 triggers the Lambda function in charge of processing the parameters. This function reads and stores the threshold parameters into Parameter Store for future usage. Using Parameter Store avoids performing redundant Amazon S3 GET requests each time a new translation job is initiated. The sample configuration file produces the parameter tags shown in the following screenshot.

The job initialization Lambda function uses these parameters to preprocess the data prior to invoking Amazon Translate. We use an English-to-Spanish translation XLIFF input file, as shown in the following code. It contains the initial text to be translated, broken down into what is referred to as segments, represented in the source tags.

Consent Form CONSENT FORM FORMULARIO DE CONSENTIMIENTO Screening Visit: Screening Visit Selección

The source text has been pre-matched with the translation memory beforehand. The data contains potential translation alternatives—represented as tags—alongside a match quality attribute, expressed as a percentage. The business rule is as follows:

  • Segments received with alternative translations and a match quality below the threshold are untouched or empty. This signals to Amazon Translate that they must be translated.
  • Segments received with alternative translations with a match quality above the threshold are pre-populated with the suggested target text. Amazon Translate skips those segments.

Let’s assume the quality match threshold configured for this job is 80%. The first segment with 99% match quality isn’t machine translated, whereas the second segment is, because its match quality is below the defined threshold. In this configuration, Amazon Translate produces the following output:

Consent Form FORMULARIO DE CONSENTIMIENTO CONSENT FORM FORMULARIO DE CONSENTIMIENTO Screening Visit: Visita de selección Screening Visit Selección

In the second segment, Amazon Translate overwrites the target text initially suggested (Selección) with a higher quality translation: Visita de selección.

One possible extension to this use case could be to reuse the translated output and create our own translation memory. Amazon Translate supports customization of machine translation using translation memory thanks to the parallel data feature. Text segments previously machine translated due to their initial low-quality score could then be reused in new translation projects.

In the following sections, we walk you through the process of deploying and testing this solution. You use AWS CloudFormation scripts and data samples to launch an asynchronous translation job personalized with a configurable quality match threshold.


For this walkthrough, you must have an AWS account. If you don’t have an account yet, you can create and activate one.

Launch AWS CloudFormation stack

  1. Choose Launch Stack:
  2. For Stack name, enter a name.
  3. For ConfigBucketName, enter the S3 bucket containing the threshold configuration files.
  4. For ParameterStoreRoot, enter the root path of the parameters created by the parameters processing Lambda function.
  5. For QueueName, enter the SQS queue that you create to post new file notifications from the source bucket to the job initialization Lambda function. This is the function that reads the configuration file.
  6. For SourceBucketName, enter the S3 bucket containing the XLIFF files to be translated. If you prefer to use a preexisting bucket, you need to change the value of the CreateSourceBucket parameter to No.
  7. For WorkingBucketName, enter the S3 bucket Amazon Translate uses for input and output data.
  8. Choose Next.

    Figure 4: CloudFormation stack details

  9. Optionally on the Stack Options page, add key names and values for the tags you may want to assign to the resources about to be created.
  10. Choose Next.
  11. On the Review page, select I acknowledge that this template might cause AWS CloudFormation to create IAM resources.
  12. Review the other settings, then choose Create stack.

AWS CloudFormation takes several minutes to create the resources on your behalf. You can watch the progress on the Events tab on the AWS CloudFormation console. When the stack has been created, you can see a CREATE_COMPLETE message in the Status column on the Overview tab.

Test the solution

Let’s go through a simple example.

  1. Download the following sample data.
  2. Unzip the content.

There should be two files: an .xlf file in XLIFF format, and a threshold configuration file with .cfg as the extension. The following is an excerpt of the XLIFF file.

English to French sample file extract

Figure 5: English to French sample file extract

  1. On the Amazon S3 console, upload the quality threshold configuration file into the configuration bucket you specified earlier.

The value set for test_En_to_Fr is 75%. You should be able to see the parameters on the Systems Manager console in the Parameter Store section.

  1. Still on the Amazon S3 console, upload the .xlf file into the S3 bucket you configured as source. Make sure the file is under a folder named translate (for example, /translate/test_En_to_Fr.xlf).

This starts the translation flow.

  1. Open the Amazon Translate console.

A new job should appear with a status of In Progress.

Auto-generated parameter in Systems Manager Parameter Store

Figure 6: In progress translation jobs on Amazon Translate console

  1. Once the job is complete, click into the job’s link and consult the output. All segments should have been translated.

All segments should have been translated. In the translated XLIFF file, look for segments with additional attributes named lscustom:match-quality, as shown in the following screenshot. These custom attributes identify segments where suggested translation was retained based on score.

Custom attributes identifying segments where suggested translation was retained based on score

Figure 7: Custom attributes identifying segments where suggested translation was retained based on score

These were derived from the translation memory according to the quality threshold. All other segments were machine translated.

You have now deployed and tested an automated asynchronous translation job assistant that enforces configurable translation memory match quality thresholds. Great job!


If you deployed the solution into your account, don’t forget to delete the CloudFormation stack to avoid any unexpected cost. You need to empty the S3 buckets manually beforehand.


In this post, you learned how to customize your Amazon Translate translation jobs based on standard XLIFF fuzzy matching quality metrics. With this solution, you can greatly reduce the manual labor involved in reviewing machine translated text while also optimizing your usage of Amazon Translate. You can also extend the solution with data ingestion automation and workflow orchestration capabilities, as described in Speed Up Translation Jobs with a Fully Automated Translation System Assistant.

About the Authors

Narcisse Zekpa is a Solutions Architect based in Boston. He helps customers in the Northeast U.S. accelerate their adoption of the AWS Cloud, by providing architectural guidelines, design innovative, and scalable solutions. When Narcisse is not building, he enjoys spending time with his family, traveling, cooking, and playing basketball.

Dimitri Restaino is a Solutions Architect at AWS, based out of Brooklyn, New York. He works primarily with Healthcare and Financial Services companies in the North East, helping to design innovative and creative solutions to best serve their customers. Coming from a software development background, he is excited by the new possibilities that serverless technology can bring to the world. Outside of work, he loves to hike and explore the NYC food scene.


Continue Reading


Copyright © 2021 Today's Digital.