Connect with us


Use SEC text for ratings classification using multimodal ML in Amazon SageMaker JumpStart

Starting today, we’re releasing new tools for multimodal financial analysis within Amazon SageMaker JumpStart. SageMaker JumpStart helps you quickly and easily get started with machine learning (ML) and provides a set of solutions for the most common use cases that can be trained and deployed readily with just a few clicks. You can now access…



[]Starting today, we’re releasing new tools for multimodal financial analysis within Amazon SageMaker JumpStart. SageMaker JumpStart helps you quickly and easily get started with machine learning (ML) and provides a set of solutions for the most common use cases that can be trained and deployed readily with just a few clicks. You can now access a collection of multimodal financial text analysis tools, including example notebooks, text models, and solutions.

[]With these new tools, you can enhance your tabular ML workflows with new insights from financial text documents and potentially help save up to weeks of development time. With the new SageMaker JumpStart Industry SDK, you can easily retrieve common public financial documents, including SEC filings, and further process financial text documents with features such as summarization and scoring of the text for various attributes, such as sentiment, litigiousness, risk, and readability. In addition, you can access pre-trained language models trained on financial text for transfer learning, and use example notebooks for data retrieval, text feature engineering, multimodal classification and regression models.

[]In this post, we show how to curate a dataset of SEC filings and financial variables, use natural language processing (NLP) for feature engineering on the dataset, and undertake multimodal ML to build a better ratings classifier.

[]The new financial analysis features include an example notebook that demonstrates APIs to retrieve parsed SEC filings, APIs for summarizers, and APIs to score text for various attributes (see SageMaker JumpStart SEC Filings Retrieval w/Summarizer and Scoring). A second notebook (Multi-category ML on SEC Filings Data) demonstrates multicategory classification on SEC filings. A third notebook (ML on a TabText (Multimodal) Dataset) shows how to undertake ML on multimodal financial data using the Paycheck Protection Program (PPP) as an example. Four additional text models (RoBERTa-SEC-Base, RoBERTa-SEC-WIKI-Base, RoBERTa-SEC-Large, and RoBERTa-SEC-WIKI-Large) are provided to generate embeddings for transfer learning using pre-trained financial models that have been trained on Wiki text and 10 years of SEC filings.

[]Finally, a SageMaker JumpStart solution (Corporate Credit Rating Prediction) demonstrates how to use the pipeline of SEC filings (long-form text data) and financial ratios (tabular data) to build corporate credit rating prediction models. This is the model discussed in this post, which is the first in a series of posts that describe these new financial analysis ML tools. In this post, we explain how you can use this solution for credit scoring, which is fully customizable so you can accelerate your ML journey.

Credit assessment using ML: SageMaker JumpStart solution

[]We’re all familiar with individual credit scoring, especially our own credit scores, from FICO. In this notebook, we revisit the oldest and one of the most widely used models for corporate credit scoring, the Altman Z-score. The Altman model generates a credit score, where higher scores denote higher credit quality and lower scores denote lower quality firms.

[]Altman developed his model in 1968, using just 66 firms’ data to fit an accurate bankruptcy prediction model. It predicted which firms would default within 1 year. Altman fit this model using Linear Discriminant Analysis (LDA), arguably the first instance of the use of an ML algorithm in academic finance. This seminal paper has generated a family of Altman Z-score models that are used all over the globe. The model only requires a few inputs from a company’s financials and therefore may be applied to public and private firms, small and large. It’s in widespread use today. It uses tabular data.

[]In this post, you learn how to use a credit scoring model such as Altman’s Z-score, and enhance the model with financial text from SEC filings. The entire model is presented in the SageMaker JumpStart solution model card titled Corporate Credit Rating Prediction.


[]The preceding model card appears in SageMaker JumpStart. You can access this model card through SageMaker Studio.

[]Navigate to that card and deploy the model by choosing Launch.


[]The following page appears.


[]You can see a model that is deployed for inference and an endpoint. Wait until they’re ready and show the status Complete. Choose Open Notebook to open the first notebook, which is for training and endpoint deployment. You can work through this notebook to learn how to use this solution and then modify it for any other application you may want on your own data. The solution comes with synthetic data and uses a subset of it to exemplify the steps needed to train the model, deploy it to an endpoint, and invoke the endpoint for inference. The notebook also contains code to deploy an endpoint of your own.

[]To open the second notebook, choose Use Endpoint in Notebook. This opens the inference notebook to use the already deployed example endpoint. In the inference notebook, you can see how to prepare the data to invoke the example endpoint to do inference on a batch of examples. The endpoint returns predicted ratings, as shown in the following screenshot, in the last code block of the inference notebook.


[]You can use this solution as a template for a text-enhanced credit rating model. It shows how to take a model based on numeric features (in this case Altman’s famous five variables) combined with SEC filings text so as to achieve a material improvement in the prediction of credit ratings. You’re not restricted to the Altman variables and can add more variables as needed, or change out the variables completely. The main objective in this notebook is to show how to enhance Altman’s Z-score model with text so you can use ML techniques to achieve a best-in-class model.

[]The Altman model is widely used by a range of users and is therefore taught as part of required coursework by the Corporate Finance Institute (CFI). Altman himself offered a 50-year retrospective on the model in parts 1, 2, and 3, discussing its wide use and misuse. To learn more, watch him on video and read this article. For a critique and improvement on the model, see the following article by Seeking Alpha, a well-known investor community. The Z-score Plus model is even available as an app on mobile devices.

[]Therefore, think of this workflow as a well-established starting point for the use of ML for credit scoring.

How to use this solution

[]To begin, run the notebooks on the example data within it to gain an understanding of how simple this solution is to use. This initiates modification of the notebook for your own model. The modification includes the following steps:

  1. Bring in your tabular data, with one row for each firm’s financial data. This may be the same as that in the notebook (Altman’s variables) or any others you work with for credit modeling. There is no restriction on the number of variables.
  2. Bring in your text data. The example in this post uses the SEC 10-K/Q filings, specifically the Management Discussion and Analysis section of the filings. If you want to download the latest filings and use them, see the SageMaker JumpStart solution for doing this in a single API call on SageMaker, titled SEC Filings Retrieval w/Summarizer and Scoring. This not only downloads the text you want, but also allows you to enhance the data with NLP scores, summaries, and more as additional columns in the DataFrame so that you can use several features of the text, such as readability, positivity, risk, litigiousness, and sentiment.
  3. Join the data from Steps 1 and 2.
  4. Reuse the solution notebooks with your new dataset with minimal changes required. The training notebook shows how to use AWS MXNet with its AutoGluon package for multimodal ML in a few lines of code and deploy an endpoint. The inference notebook shows how to call the endpoint to get predictions.

[]That’s it! The solution is self-contained and works with a few clicks.

[]Important: This solution is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. The associated notebooks, including the trained model, use synthetic data, and are not intended for production.

Deep dive with examples

[]You may want to explore the solution further. In the appendix, we offer more detail on credit scoring and some additional simple code to show how to add SEC text to standard tabular features to undertake multimodal ML. All these functionalities are made simple using APIs in SageMaker JumpStart models. We cover the following:

  • A review of Altman’s Z-score, a popular credit scoring model that may be used on private and public entities.
  • A discussion of the multimodal dataset.
  • How to retrieve SEC filings and combining it with tabular data, denoted as TabText. We show how to use the API in the SageMaker JumpStart example notebook titled SEC Filings Retrieval w/Summarizer and Scoring.
  • How to read in the data.
  • How to train and test your model on tabular data only and TabText. You can observe the improvement in the model as we expand the feature set with text.
  • How to add NLP scores to enhance the feature set. The API for this is also demonstrated in the SageMaker JumpStart example notebook titled SEC Filings Retrieval w/Summarizer and Scoring.


[]We have seen how to enhance tabular ML models for credit scoring with long-form financial text. You can adapt the training notebook and the inference notebook in the JumpStart solution Corporate Credit Rating Prediction with your own data and labels as follows:

  • Bring a dataset of tabular features for each ticker.
  • For each ticker use JumpStart’s SEC retrieval engine to download the required SEC filings (for example, download the most recent 10-K or 10-Q). Then join the text DataFrame with the tabular DataFrame, and further enhance the DataFrame with engineered features using NLP scoring.
  • Add in labels. For credit, these could be any of the following:
    • Ratings
    • Changes in ratings
    • Z-score
    • Probability of default
    • Defaulted or not
    • Credit spreads
  • Use the preceding AutoML code to train a classification or regression model.
  • Deploy the model to an endpoint. You can then call this endpoint as needed for new data.

[]SEC filings aren’t the only text that you can use. You can use any text that contains information about the label. For example, the text of internal rating analyses may be even better than SEC filings.

[]To get started, you can find the Corporate Credit Rating Prediction solution in SageMaker JumpStart in SageMaker Studio. For more information, see SageMaker JumpStart.

[]Legal Disclaimers: This post is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. This post uses data obtained from the SEC EDGAR database. You are responsible for complying with EDGAR’s access terms and conditions.

[]Thanks to several team members for support with this work: Miyoung Choi, Vinay Hanumaiah, Cuong Nguyen, Xavier Ragot, Derrick Zhang, Li Zhang, Yue Zhao, Daniel Zhu


[]In this appendix, we discuss related topics to this solution.

What is Altman’s Z-score?

[]The model is based on a well-known bankruptcy prediction approach, from the original paper by Ed Altman (1968). For a brief summary, see Measuring the ‘fiscal-fitness’ of a company: The Altman Z-Score.

[]The original seminal paper by Altman is available at: Altman, Edward. (September 1968). “Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy“. Journal of Finance, v23(4): 189–209. [doi:10.1111/j.1540-6261.1968.tb00843.x]

[]The model uses eight inputs from a company’s financials:

  • Current Assets (CA)
  • Current Liabilities (CL)
  • Total Liabilities (TL)
  • EBIT (Earnings Before Interest and Taxes)
  • Total Assets (TA)
  • Net Sales (NS)
  • Retained Earnings (RE)
  • Market Value of Equity (MVE)

[]These eight inputs translate into the following five financial ratios:

  • A: EBIT / Total Assets
  • B: Net Sales / Total Assets
  • C: Market Value of Equity / Total Liabilities
  • D: Working Capital / Total Assets
  • E: Retained Earnings / Total Assets

[]These ratios are used to fit binary class data of companies that go bankrupt and those that do not. Altman fitted the model using Linear Discriminant Analysis, possibly the earliest use of ML in finance. The linear discriminant function is as follows:

[]Zscore = 3.3A+0.99B+0.6C+1.2D+1.4E

[]These translate into suggested company credit quality ranges, which may vary by use, such as in the following example:

  • 𝑍 > 3.0 : safe
  • 7 < 𝑍 < 2.99 : caution
  • 8 < 𝑍 < 2.7 : bankruptcy possible in 2 years
  • 𝑍 < 1.8 : high chance of bankruptcy

[]We enhance the Altman five-feature set (A,B,C,D,E stated above) with text from SEC filings to get an improved Z-score model.

Multimodal data

[]We created a synthetic dataset that combined randomly chosen SEC filings with simulated financial data. Briefly, we created the synthetic dataset using the following steps (ticker names have not been included, so as to not cause confusion with real tickers):

  1. We extracted the Management Discussion and Analysis (MDNA) section of the 10-K/Q filings for a sample of 3,286 firm filings. We added a column to this data for the industry code, because it may also be a useful feature given that firms within an industry may all be impacted by the same factors. See the following section for the use of SageMaker JumpStart to retrieve SEC filings and construct a DataFrame.
  2. We then scored the text for five positive attributes (positivity, sentiment, polarity, safety, certainty) and five negative attributes (negativity, litigiousness, fraud, risk, uncertainty). We added the positive scores and subtracted the negative scores to get a net rank score. We correlate the high rank score firms with high ratings and low rank firms with low ratings. This ensures that the text is correlated with the ratings label.
  3. Then, we used official government sites to get a US balance sheet, income statement, and market statistics as anchors to simulate financials for all 3,286 firms. The financials are normalized so that the total assets for each firm are 100. The eight financial values are checked for consistency (for example, the simulated current liabilities can’t exceed the simulated total liabilities). Such cases are discarded and the financials are regenerated. The anchor statistics are taken from the following:
    1. Balance sheet data
    2. Income statement data
    3. Price to book data, which is used to convert book value of equity into market value of equity (MVE).
    4. Altman’s Z-score averages for the US
  4. The eight financial values are converted into the five ratios needed by the Altman Z-score model.
  5. Z-scores are computed for each firm.
  6. The financial values for companies with high (low) Z-scores are concatenated to the text of companies with high (low) rank scores. Now we have a consolidated multimodal (text, numerical, categorical) dataset.
  7. The high (low) rank companies are assigned high (low) ratings, and the number and mean Z-score of companies is adapted to calibrate with the table from Altman’s slides (slide 9). Z-scores and rank scores are then discarded.

[]The final dataset (stored as CCR_data.csv) comprises the MD&A text, industry code, and eight financial variables. The last column contains the rating, namely, the label for classification. The data contains seven categories of labels: AAA, AA, A, BBB, BB, B, CCC. These labels are not reflective of companies’ actual credit ratings since they are based on synthetically generated data. This synthetic dataset is automatically downloaded when you run the training notebook in the JumpStart solution Corporate Credit Rating Prediction in the training notebook described earlier.

Curate TabText

[]The following code is a template for constructing a text-enhanced credit rating model. It shows how to take a model based on numeric features (in this case Altman’s five variables) combined with SEC filings text so as to achieve a material improvement in the prediction of credit ratings. In this example, we observe an 8% increase in accuracy (on our example test data) when text is added.

[]SEC filings are retrieved from the SEC’s Electronic Data Gathering, Analysis, and Retrieval (EDGAR) website, which provides open data access (note the disclaimer in this post). EDGAR is the primary system under the US Securities and Exchange Commission (SEC) for companies and others submitting documents under the Securities Act of 1933, the Securities Exchange Act of 1934, the Trust Indenture Act of 1939, and the Investment Company Act of 1940. EDGAR contains millions of company and individual filings. The system processes about 3,000 filings per day, serves up 3,000 terabytes of data to the public annually, and accommodates 40,000 new filers per year on average. In the following code, we provide a simple, single API call that creates a dataset in a few lines of code, for any period of time and for a large number of tickers.

[]The API contains three parts:

  • The first section specifies the following:
    • The tickers or SEC CIK codes for the companies whose forms are being extracted
    • The SEC forms types (in this case 10-K, 10-Q)
    • Data range of forms by filing date
    • The output CSV file and Amazon Simple Storage Service (Amazon S3) bucket to store the dataset
  • The second section shows how to assign system resources and has default values in place
  • The final section runs the API

[]This kicks off the processing job running in a SageMaker container and makes sure that even a very large retrieval can run without the notebook connection.

%%time dataset_config = EDGARDataSetConfig( tickers_or_ciks=[‘amzn’,…,’FB’], # list of stock tickers or CIKs form_types=[’10-K’, ’10-Q’], # list of SEC form types filing_date_start=’2019-01-01′, # starting filing date filing_date_end=’2020-12-31′, # ending filing date email_as_user_agent=’’) # user agent email data_loader = DataLoader( role=sagemaker.get_execution_role(), # loading job execution role instance_count=1, # instances number, limit varies with instance type instance_type=’ml.c5.2xlarge’, # instance type volume_size_in_gb=30, # size in GB of the EBS volume to use volume_kms_key=None, # KMS key for the processing volume output_kms_key=None, # KMS key ID for processing job outputs max_runtime_in_seconds=None, # timeout in seconds. Default is 24 hours. sagemaker_session=sagemaker.Session(), # session object tags=None) # a list of key-value pairs data_loader.load( dataset_config, ‘s3://{}/{}/{}’.format(bucket, sec_processed_folder, ‘output’), # output s3 prefix (both bucket and folder names are required) ‘dataset_10k_10q.csv’, # output file name wait=True, logs=True) []The data is stored in a file denoted dataset_10k_10q.csv as shown in the preceding code. The file may be examined as follows:

client = boto3.client(‘s3’) client.download_file(S3_BUCKET_NAME, ‘{}/{}’.format(S3_FOLDER_NAME, ‘dataset_10k_10q.csv’), ‘dataset_10k_10q.csv’) data_frame_10k_10q = pd.read_csv(‘dataset_10k_10q.csv’) data_frame_10k_10q.head() []

[]The mdna column of text from this DataFrame is then combined with financial data to create a composite dataset, stored in a file titled CCR_data.csv, which is read in next. We denoted the composite of tabular and text data as TabText.

Read in the TabText dataset

[]We read in this dataset and examine its properties. It has 11 features: one text column, one categorical column, eight numerical columns, and a label column (Ratings). Whereas the values from this dataset match the broad averages in the economy, and we trained a model on this data, this model should be trained on real data from the user.

%pylab inline import pandas as pd import os df = pd.read_csv(‘CCR_data.csv’) print(df.shape) df.head() []

[]Next, we convert the financial values into Altman’s five ratios, resulting in the final DataFrame we use for multimodal ML:

df[“A”] = df.EBIT/df.TotalAssets df[“B”] = df.NetSales/df.TotalAssets df[“C”] = df.MktValueEquity/df.TotalLiabs df[“D”] = (df.CurrentAssets-df.CurrentLiabs)/df.TotalAssets df[“E”] = df.RetainedEarnings/df.TotalAssets df = df.drop([“TotalAssets”,”CurrentLiabs”,”TotalLiabs”, “RetainedEarnings”, “CurrentAssets”, “NetSales”, “EBIT”, “MktValueEquity”], axis=1) df.head() []

[]The dataset has eight features: one text column, a categorical column, five numerical columns, and a label column. We have text of the MD&A section, industry code, five ratios (A, B, C, D, E) as described earlier developed by Altman. The label column is Rating.

[]As a cross-check, we compute the Z-score for each firm and examine the mean score by rating. The scores decline as the rating of firms drops. The confirms that the dataset captures the relationship between Z-scores and ratings. (Of course, we don’t use Z-score as a feature.)

df_z = df.drop([‘MDNA’,’industry_code’], axis=1) df_z[“Zscore”] = 3.3*df_z.A + 0.99*df_z.B + 0.6*df_z.C + 1.2*df_z.D + 1.4*df_z.E df_z = df_z.groupby(‘Rating’).mean().reset_index() df_z.index = [2,1,0,5,4,3,6] df_z[[“Rating”, “Zscore”]].sort_index() []

Train and test ML models

[]Our dataset is multimodal and contains the following:

  • A column of long text, with documents exceeding the maximum number of words allowed by transformer models such as BERT
  • A categorical column, industry code
  • Five numerical columns containing the features used by the Altman Z-score model
  • A label for the rating of the company

[]We use the GluonNLP library based on the MXNet framework. Install the required packages. You can update the following example code with newer releases of mxnet. For newer releases of autogluon, see GluonNLP: NLP made easy.

%%capture !pip install —upgrade pip !pip install —upgrade setuptools !pip install —upgrade “mxnet_cu110<2.0.0" !pip install autogluon==0.2.0

Use only tabular data

[]First, we mimic the original version of the Altman model with just five financial ratios and industry code—this is just tabular data. We later fit an extended model with text and tabular data.

[]To start, we also choose a binary classification problem, where 1 = {AAA,AA,A,BBB} (namely, investment grade firms) and 0 = {BB,B,CCC} (below investment grade firms). We drop the text (MDNA) column from the dataset. In the solution itself, you will see a multi-category classification task, which we briefly highlight towards the end of the post.

df_tabular = df.copy() df_tabular = df_tabular.drop([‘MDNA’], axis=1) []Prepare the binary label based on rating:

trans_func = lambda x: 1 if x in {‘AAA’, ‘AA’, ‘A’, ‘BBB’} else 0 df_tabular[‘Rating’] = df_tabular[‘Rating’].transform(trans_func) []Implement an 80/20 train/test split on the data:

from sklearn.model_selection import train_test_split train_data, test_data = train_test_split(df_tabular, test_size=0.2, random_state=42)


[]We use the parsimonious framework from AutoGluon. This library accepts DataFrames containing text, tabular, CV data, and fits models automatically using a set of well-known classifiers, such as 𝐾-nearest neighbors, Gradient Boosted models, Random Forest models, Boosted models, Extra Trees models, XGBoost, and Neural Net models. These models are then stack-ensembled to get the best weighted model. You can also perform hyperparameter tuning. For full details, see AutoGluon: AutoML for Text, Image, and Tabular Data. Complete the following steps:

  1. Instantiate the AutoGluon model.
  2. Indicate where the trained model will be stored.
  3. Fit the model on the DataFrame. The fit requires just a single line of code.

[]For full reference, see AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data.

%%time from autogluon.tabular import TabularPredictor predictor = TabularPredictor(label=”Rating”).fit(train_data=train_data) []Next, assess metrics to determine the best-performing model on the test data:

best_model = predictor.get_model_best() print(“Best model: ” + best_model) performance = predictor.evaluate(test_data) results = predictor.leaderboard(test_data) results []

[]Note that balanced accuracy is average recall on both classes. MCC is the Matthews Correlation Coefficient.

[]We can also see the leaderboard generated from the preceding code and presented in order of validation score.


Use multimodal data

[]We then combine the text and tabular data to get a final model to showcase multimodal ML. The steps remain exactly the same as before. You don’t need to perform vectorization of the text or one-hot encoding of the categorical variable. All this is handled by MXNet/AutoGluon. Even the label is auto-detected, so the class of problem doesn’t need to be specified.

[]Because the text in these sections is very long (thousands of words), we can’t use transformers, because they have a restricted number of words they can handle (usually less than 1000). Therefore, AutoGluon uses TF-IDF with n-grams to transform the text into numerical vectors and then apply ML to the text and tabular data.

[]We fit a model with very few lines of code. This time, we don’t drop the text column containing the MD&A:

df_tabtext = df.copy() # copy the full dataframe trans_func = lambda x: 1 if x in {‘AAA’, ‘AA’, ‘A’, ‘BBB’} else 0 df_tabtext[‘Rating’] = df_tabtext[‘Rating’].transform(trans_func) # Add binary label predictor = TabularPredictor(label=”Rating”, path=model_path).fit(train_data=train_data, excluded_model_types=[‘FASTAI’]) # Fit model print(“Best model: ” + predictor.get_model_best()) # Show best model performance = predictor.evaluate(test_data) print(predictor.leaderboard(test_data, silent=True)) []

[]Accuracy on the test dataset has increased to 93% (on TabText) versus 85% (on the tabular dataset).

[]We also see below the leaderboard generated from the preceding code and presented in order of validation score.


Further enhance the feature set with NLP scoring

[]SageMaker JumpStart has its own SDK with an API to further enhance the feature set with numerical values that score the text (in column MDNA) in the dataset for its various attributes. To see how to use this API, refer to the JumpStart example notebook SEC Filings Retrieval w/Summarizer and Scoring. This adds columns with additional values based on the percentage of words in the text that match separate word lists for each attribute, or the attribute may be based on an algorithm such as sentiment scoring and readability. You have 11 attributes: negative, certainty, risk, uncertainty, safe, fraud, litigious, positive, polarity, sentiment, and readability.

[]We use the Gunning fog index to calculate the readability score. Sentiment analysis uses VADER. Polarity calculation uses positive and negative word lists. The other NLP scores deliver the similarity (word frequency) with the default word lists (positive, negative, litigious, risk, fraud, safe, certainty, and uncertainty) provided through the smjsindustry library. You can also provide your own word list to calculate the NLP score of your own scoring types.

[]These numerical scores are added as new columns to the text DataFrame. This creates a multimodal DataFrame that is a mixture of tabular data and long-form text, called TabText. When you submit this DataFrame for ML, it’s a good idea to normalize the columns of NLP scores (usually with standard normalization or min-max scaling).

[]These scoring metrics are simple and report the proportion of words in a document that occur in a specified word list. The word lists aren’t the traditional financial word lists that are human curated, but are word lists that are generated from word embeddings that are close to the concepts that are being scored. Therefore, they may also contain words that don’t obviously relate to a concept (e.g., risk), but their occurrence implies the presence of discussion related to the concept. You can even bring your own word lists to quantify additional concepts (for example, ESG). This API call is shown in the following code:

import sagemaker from smjsindustry import NLPScoreType, NLPSCORE_NO_WORD_LIST from smjsindustry import NLPScorer, NLPScorerConfig score_type_list = list( NLPScoreType(score_type, []) for score_type in NLPScoreType.DEFAULT_SCORE_TYPES if score_type not in NLPSCORE_NO_WORD_LIST ) score_type_list.extend([NLPScoreType(score_type, None) for score_type in NLPSCORE_NO_WORD_LIST]) nlp_scorer_config = NLPScorerConfig(score_type_list) nlp_score_processor = NLPScorer( ROLE, 1, ‘ml.c5.18xlarge’, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, sagemaker_session=sagemaker.Session(), tags=None) nlp_score_processor.calculate( nlp_scorer_config, “MDNA”, “CCR_data_input.csv”, ‘s3://{}/{}’.format(BUCKET, “nlp_score”), ‘ccr_nlp_score_sample.csv’ ) []This generates an extended DataFrame.


[]Instead of training for binary classification as we did earlier, we can use the seven rating classes in the dataset for multicategory classification. The details of training the model on this extended DataFrame are provided in the training notebook Corporate Credit Rating Prediction solution in SageMaker JumpStart. The final performance on a sample of the data is shown in the confusion matrix.


[]We can observe that the trained model is accurate on the test dataset, even though we trained it on a small subset of the data.

[]SageMaker makes it simple to deploy the model to an endpoint. As we discussed, you can then use this for inference, and the technical details (a few lines of code) are also shown in the training and inference notebooks that come with this solution.

About the Authors

[]  Dr. Sanjiv Das is an Amazon Scholar and the Terry Professor of Finance and Data Science at Santa Clara University. He holds post-graduate degrees in Finance (M.Phil and Ph.D. from New York University), Computer Science (M.S. from UC Berkeley), an MBA from the Indian Institute of Management, Ahmedabad. Prior to being an academic, he worked in the derivatives business in the Asia-Pacific region as a Vice-President at Citibank. He works on multimodal machine learning in the area of financial applications.

[]Dr. John He is a senior software development engineer with Amazon AI, where he focuses on machine learning and distributed computing. He holds a PhD degree from CMU.

[]Shenghua Yue is a Software Development Engineer at Amazon SageMaker. She focuses on building machine learning tools and products for customers.


Continue Reading
Click to comment

Leave a Reply

Your email address will not be published.


Enhance the caller experience with hints in Amazon Lex

We understand speech input better if we have some background on the topic of conversation. Consider a customer service agent at an auto parts wholesaler helping with orders. If the agent knows that the customer is looking for tires, they’re more likely to recognize responses (for example, “Michelin”) on the phone. Agents often pick up…




We understand speech input better if we have some background on the topic of conversation. Consider a customer service agent at an auto parts wholesaler helping with orders. If the agent knows that the customer is looking for tires, they’re more likely to recognize responses (for example, “Michelin”) on the phone. Agents often pick up such clues or hints based on their domain knowledge and access to business intelligence dashboards. Amazon Lex now supports a hints capability to enhance the recognition of relevant phrases in a conversation. You can programmatically provide phrases as hints during a live interaction to influence the transcription of spoken input. Better recognition drives efficient conversations, reduces agent handling time, and ultimately increases customer satisfaction.

In this post, we review the runtime hints capability and use it to implement verification of callers based on their mother’s maiden name.

Overview of the runtime hints capability

You can provide a list of phrases or words to help your bot with the transcription of speech input. You can use these hints with built-in slot types such as first and last names, street names, city, state, and country. You can also configure these for your custom slot types.

You can use the capability to transcribe names that may be difficult to pronounce or understand. For example, in the following sample conversation, we use it to transcribe the name “Loreck.”

Conversation 1

IVR: Welcome to ACME bank. How can I help you today?

Caller: I want to check my account balance.

IVR: Sure. Which account should I pull up?

Caller: Checking

IVR: What is the account number?

Caller: 1111 2222 3333 4444

IVR: For verification purposes, what is your mother’s maiden name?

Caller: Loreck

IVR: Thank you. The balance on your checking account is 123 dollars.

Words provided as hints are preferred over other similar words. For example, in the second sample conversation, the runtime hint (“Smythe”) is selected over a more common transcription (“Smith”).

Conversation 2

IVR: Welcome to ACME bank. How can I help you today?

Caller: I want to check my account balance.

IVR: Sure. Which account should I pull up?

Caller: Checking

IVR: What is the account number?

Caller: 5555 6666 7777 8888

IVR: For verification purposes, what is your mother’s maiden name?

Caller: Smythe

IVR: Thank you. The balance on your checking account is 456 dollars.

If the name doesn’t match the runtime hint, you can fail the verification and route the call to an agent.

Conversation 3

IVR: Welcome to ACME bank. How can I help you today?

Caller: I want to check my account balance.

IVR: Sure. Which account should I pull up?

Caller: Savings

IVR: What is the account number?

Caller: 5555 6666 7777 8888

IVR: For verification purposes, what is your mother’s maiden name?

Caller: Jane

IVR: There is an issue with your account. For support, you will be forwarded to an agent.

Solution overview

Let’s review the overall architecture for the solution (see the following diagram):

  • We use an Amazon Lex bot integrated with an Amazon Connect contact flow to deliver the conversational experience.
  • We use a dialog codehook in the Amazon Lex bot to invoke an AWS Lambda function that provides the runtime hint at the previous turn of the conversation.
  • For the purposes of this post, the mother’s maiden name data used for authentication is stored in an Amazon DynamoDB table.
  • After the caller is authenticated, the control is passed to the bot to perform transactions (for example, check balance)

In addition to the Lambda function, you can also send runtime hints to Amazon Lex V2 using the PutSession, RecognizeText, RecognizeUtterance, or StartConversation operations. The runtime hints can be set at any point in the conversation and are persisted at every turn until cleared.

Deploy the sample Amazon Lex bot

To create the sample bot and configure the runtime phrase hints, perform the following steps. This creates an Amazon Lex bot called BankingBot, and one slot type (accountNumber).

  1. Download the Amazon Lex bot.
  2. On the Amazon Lex console, choose Actions, Import.
  3. Choose the file that you downloaded, and choose Import.
  4. Choose the bot BankingBot on the Amazon Lex console.
  5. Choose the language English (GB).
  6. Choose Build.
  7. Download the supporting Lambda code.
  8. On the Lambda console, create a new function and select Author from scratch.
  9. For Function name, enter BankingBotEnglish.
  10. For Runtime, choose Python 3.8.
  11. Choose Create function.
  12. In the Code source section, open and delete the existing code.
  13. Download the function code and open it in a text editor.
  14. Copy the code and enter it into the empty function code field.
  15. Choose deploy.
  16. On the Amazon Lex console, select the bot BankingBot.
  17. Choose Deployment and then Aliases, then choose the alias TestBotAlias.
  18. On the Aliases page, choose Languages and choose English (GB).
  19. For Source, select the bot BankingBotEnglish.
  20. For Lambda version or alias, enter $LATEST.
  21. On the DynamoDB console, choose Create table.
  22. Provide the name as customerDatabase.
  23. Provide the partition key as accountNumber.
  24. Add an item with accountNumber: “1111222233334444” and mothersMaidenName “Loreck”.
  25. Add item with accountNumber: “5555666677778888” and mothersMaidenName “Smythe”.
  26. Make sure the Lambda function has permissions to read from the DynamoDB table customerDatabase.
  27. On the Amazon Connect console, choose Contact flows.
  28. In the Amazon Lex section, select your Amazon Lex bot and make it available for use in the Amazon Connect contact flow.
  29. Download the contact flow to integrate with the Amazon Lex bot.
  30. Choose the contact flow to load it into the application.
  31. Make sure the right bot is configured in the “Get Customer Input” block.
  32. Choose a queue in the “Set working queue” block.
  33. Add a phone number to the contact flow.
  34. Test the IVR flow by calling in to the phone number.

Test the solution

You can now call in to the Amazon Connect phone number and interact with the bot.


Runtime hints allow you to influence the transcription of words or phrases dynamically in the conversation. You can use business logic to identify the hints as the conversation evolves. Better recognition of the user input allows you to deliver an enhanced experience. You can configure runtime hints via the Lex V2 SDK. The capability is available in all AWS Regions where Amazon Lex operates in the English (Australia), English (UK), and English (US) locales.

To learn more, refer to runtime hints.

About the Authors

Kai Loreck is a professional services Amazon Connect consultant. He works on designing and implementing scalable customer experience solutions. In his spare time, he can be found playing sports, snowboarding, or hiking in the mountains.

Anubhav Mishra is a Product Manager with AWS. He spends his time understanding customers and designing product experiences to address their business challenges.

Sravan Bodapati is an Applied Science Manager at AWS Lex. He focuses on building cutting edge Artificial Intelligence and Machine Learning solutions for AWS customers in ASR and NLP space. In his spare time, he enjoys hiking, learning economics, watching TV shows and spending time with his family.


Continue Reading


The Intel®3D Athlete Tracking (3DAT) scalable architecture deploys pose estimation models using Amazon Kinesis Data Streams and Amazon EKS

This blog post is co-written by Jonathan Lee, Nelson Leung, Paul Min, and Troy Squillaci from Intel.  In Part 1 of this post, we discussed how Intel®3DAT collaborated with AWS Machine Learning Professional Services (MLPS) to build a scalable AI SaaS application. 3DAT uses computer vision and AI to recognize, track, and analyze over 1,000…




This blog post is co-written by Jonathan Lee, Nelson Leung, Paul Min, and Troy Squillaci from Intel. 

In Part 1 of this post, we discussed how Intel®3DAT collaborated with AWS Machine Learning Professional Services (MLPS) to build a scalable AI SaaS application. 3DAT uses computer vision and AI to recognize, track, and analyze over 1,000 biomechanics data points from standard video. It allows customers to create rich and powerful biomechanics-driven products, such as web and mobile applications with detailed performance data and three-dimensional visualizations.

In Part 2 of this post, we dive deeper into each stage of the architecture. We explore the AWS services used to meet the 3DAT design requirements, including Amazon Kinesis Data Streams and Amazon Elastic Kubernetes Service (Amazon EKS), in order to scalably deploy the necessary pose estimation models for this software as a service (SaaS) application.

Architecture overview

The primary goal of the MLPS team was to productionalize the 2D and 3D pose estimation model pipelines and create a functional and scalable application. The following diagram illustrates the solution architecture.

The complete architecture is broken down into five major components:

  • User application interface layers
  • Database
  • Workflow orchestration
  • Scalable pose estimation inference generation
  • Operational monitoring

Let’s go into detail on each component, their interactions, and the rationale behind the design choices.

User application interface layers

The following diagram shows the application interface layers that provide user access and control of the application and its resources.

These access points support different use cases based on different customer personas. For example, an application user can submit a job via the CLI, whereas a developer can build an application using the Python SDK and embed pose estimation intelligence into their applications. The CLI and SDK are built as modular components—both layers are wrappers of the API layer, which is built using Amazon API Gateway to resolve the API calls and associated AWS Lambda functions, which take care of the backend logic associated with each API call. These layers were a crucial component for the Intel OTG team because it opens up a broad base of customers that can effectively use this SaaS application.

API layer

The solution has a core set of nine APIs, which correspond to the types of objects that operate on this platform. Each API has a Python file that defines the API actions that can be run. The creation of new objects is automatically assigned an object ID sequentially. The attributes of these objects are stored and tracked in the Amazon Aurora Serverless database using this ID. Therefore, the API actions tie back to functions that are defined in a central file that contains the backend logic for querying the Aurora database. This backend logic uses the Boto3 Amazon RDS DataService client to access the database cluster.

The one exception is the /job API, which has a create_job method that handles video submission for creating a new processing job. This method starts the AWS Step Functions workflow logic for running the job. By passing in a job_id, this method uses the Boto3 Step Functions client to call the start_execution method for a specified stateMachineARN (Amazon Resource Name).

The eight object APIs have the methods and similar access pattern as summarized in the following table.

Method Type Function Name Description
GET list_[object_name]s Selects all objects of this type from the database and displays.
POST create_[object] Inserts a new object record with required inputs into the database.
GET get_[object] Selects object attributes based on the object ID from the database and displays.
PUT update_[object] Updates an existing object record with the required inputs.
DELETE delete_[object] Deletes an existing object record from the database based on object ID.

The details of the nine APIs are as follows:

  1. /user – A user is the identity of someone authorized to submit jobs to this application. The creation of a user requires a user name, user email, and group ID that the user belongs to.
  2. /user_group – A user group is a collection of users. Every user group is mapped to one project and one pipeline parameter set. To have different tiers (in terms of infrastructural resources and pipeline parameters), users are divided into user groups. Each user can belong to only one user group. The creation of a user group requires a project ID, pipeline parameter set ID, user group name, and user group description. Note that user groups are different from user roles defined in the AWS account. The latter is used to provide different level of access based on their access roles (for example admin).
  3. /project – A project is used to group different sets of infrastructural resources together. A project is associated with a single project_cluster_url (Aurora cluster) for recording users, jobs, and other metadata, a project_queue_arn (Kinesis Data Streams ARN), and a compute runtime environment (currently controlled via Cortex) used for running inference on the frame batches or postprocessing on the videos. Each user group is associated to one project, and this mechanism is how different tiers are enabled in terms of latency and compute power for different groups of users. The creation of a project requires a project name, project cluster URL, and project queue ARN.
  4. /pipeline – A pipeline is associated with a single configuration for a sequence of processing containers that perform video processing in the Amazon EKS inference generation cluster coordinated by Cortex (see the section on video processing inference generation for more details). Typically, this consists of three containers: preprocessing and decoding, object detection, and pose estimation. For example, the decode and object detection step are the same for the 2D and 3D pipelines, but swapping out the last container using either HRNet or 3DMPPE results in the parameter set for 2D vs. 3D processing pipelines. You can create new configurations to define possible pipelines that can be used for processing, and it requires as input a new Python file in the Cortex repo that details the sequence of model endpoints call that define that pipeline (see the section on video processing inference generation). The pipeline endpoint is the Cortex endpoint that is called to process a single frame. The creation of a pipeline requires a pipeline name, pipeline description, and pipeline endpoint.
  5. /pipeline_parameter_set – A pipeline parameter set is a flexible JSON collection of multiple parameters (a pipeline configuration runtime) for a particular pipeline, and is added to provide flexibility for future customization when multiple pipeline configuration runtimes are required. User groups can be associated with a particular pipeline parameter set, and its purpose is to have different groups of parameters per user group and per pipeline. This was an important forward-looking addition for Intel OTG to build in customization that supports portability as different customers, particularly ISVs, start using the application.
  6. /pipeline_parameters – A single collection of pipeline parameters is an instantiation of a pipeline parameter set. This makes it a 1:many mapping of a pipeline parameter set to pipeline parameters. This API requires a pipeline ID to associate with the set of pipeline parameters that enables the creation of a pipeline for a 1:1 mapping of pipeline parameters to the pipeline. The other inputs required by this API are a pipeline parameter set ID, pipeline parameters value, and pipeline parameters name.
  7. /video – A video object is used to define individual videos that make up a .zip package submitted during a job. This file is broken up into multiple videos after submission. A video is related to the job_id for the job where the .zip package is submitted, and Amazon Simple Storage Service (Amazon S3) paths for the location of the raw separated videos and the postprocessing results of each video. The video object also contains a video progress percentage, which is consistently updated during processing of individual frame batches of that video, as well as a video status flag for complete or not complete. The creation of a video requires a job ID, video path, video results path, video progress percentage, and video status.
  8. /frame_batch – A frame_batch object is a mini-batch of frames created by sampling a single video. Separating a video into regular-sized frame batches provides a lever to tune latency and increases parallelization and throughput. This is the granular unit that is run through Kinesis Data Streams for inference. A creation of a frame batch requires a video ID, frame batch start number, frame batch end number, frame batch input path, frame batch results path, and frame batch status.
  9. /job – This interaction API is used for file submission to start a processing job. This API has a different function from other object APIs because it’s the direct path to interact with the video processing backend Step Functions workflow coordination and Amazon EKS cluster. This API requires a user ID, project ID, pipeline ID, pipeline parameter set ID, job parameters, and job status. In the job parameters, an input file path is specified, which is the location in Amazon S3 where the .zip package of videos to be processed is located. File upload is handled with the upload_handler method, which generates a presigned S3 URL for the user to place a file. A WORKFLOW_STATEMACHINE_ARN is an environment variable that is passed to the create_job API to specify where a video .zip package with an input file path is submitted to start a job.

The following table summarizes the API’s functions.

Method Type Function Description
GET list_jobs Selects all jobs from the database and displays.
POST create_ job Inserts a new job record with user ID, project ID, pipeline ID, pipeline parameter set ID, job results path, job parameters, and job status.
GET get_ job Selects job attributes based on job ID from the database and displays.
GET upload_handler Generates a presigned S3 URL as the location for the .zip file upload. Requires an S3 bucket name and expects an application/zip file type.

Python SDK layer

Building upon the APIs, the team created a Python SDK client library as a wrapper to make it easier for developers to access the API methods. They used the open-source Poetry, which handles Python packaging and dependency management. They created a file that contains functions wrapping each of the APIs using the Python requests library to handle API requests and exceptions.

For developers to launch the Intel 3DAT SDK, they need to install and build the Poetry package. Then, they can add a simple Python import of intel_3dat_sdk to any Python code.

To use the client, you can create an instance of the client, specifying the API endpoint:

client = intel_3dat_sdk.client(endpoint_url=”API Endpoint Here”)

You can then use the client to call the individual methods such as the create_pipeline method (see the following code), taking in the proper arguments such as pipeline name and pipeline description.

client.create_pipeline({ “pipeline_name”: “pipeline_1”, “pipeline_description”: “first instance of a pipeline” })

CLI layer

Similarly, the team built on the APIs to create a command line interface for users who want to access the API methods with a straightforward interface without needing to write Python code. They used the open-source Python package Click (Command Line Interface Creation Kit). The benefits of this framework are the arbitrary nesting of commands, automatic help page generation, and support of lazy loading of subcommands at runtime. In the same file as in the SDK, each SDK client method was wrapped using Click and the required method arguments were converted to command line flags. The flag inputs are then used when calling the SDK command.

To launch the CLI, you can use the CLI configure command. You’re prompted for the endpoint URL:

$ intel-cli configure Endpoint url:

Now you can use the CLI to call different commands related to the API methods, for example:

$ intel-cli list-users [ { “user_id”: 1, “group_id”: 1, “user_name”: “Intel User”, “user_email”: } ]


As a database, this application uses Aurora Serverless to store metadata associated with each of the APIs with MYSQL as the database engine. Choosing the Aurora Serverless database service adheres to the design principle to minimize infrastructural overhead by utilizing serverless AWS services when possible. The following diagram illustrates this architecture.

The serverless engine mode meets the intermittent usage pattern because this application scales up to new customers and workloads are still uncertain. When launching a database endpoint, a specific DB instance size isn’t required, only a minimum and maximum range for cluster capacity. Aurora Serverless handles the appropriate provisioning of a router fleet and distributes the workload amongst the resources. Aurora Serverless automatically performs backup retention for a minimum of 1 day up to 35 days. The team optimized for safety by setting the default to the maximum value of 35.

In addition, the team used the Data API to handle access to the Aurora Serverless cluster, which doesn’t require a persistent connection, and instead provides a secure HTTP endpoint and integration with AWS SDKs. This feature uses AWS Secrets Manager to store the database credentials so there is no need to explicitly pass credentials. CREATE TABLE scripts were written in .sql files for each of the nine tables that correspond to the nine APIs. Because this database contained all the metadata and state of objects in the system, the API methods were run using the appropriate SQL commands (for example select * from Job for the list_jobs API) and passed to the execute_statement method from the Amazon RDS client in the Data API.

Workflow orchestration

The functional backbone of the application was handled using Step Functions to coordinate the workflow, as shown in the following diagram.

The state machine consisted of a sequence of four Lambda functions, which starts when a job is submitted using the create_job method from the job API. The user ID, project ID, pipeline ID, pipeline parameter set ID, job results path, job parameters, and job status are required for job creation. You can first upload a .zip package of video files using the upload_handler method from the job API to generate a presigned S3 URL. During job submission, the input file path is passed via the job parameters, to specify the location of the file. This starts the run of the workflow state machine, triggering four main steps:

  • Initializer Lambda function
  • Submitter Lambda function
  • Completion Check Lambda function
  • Collector Lambda function

Initializer Lambda function

The main function of the Initializer is to separate the .zip package into individual video files and prepare them for the Submitter. First, the .zip file is downloaded, and then each individual file, including video files, is unzipped and extracted. The videos, preferably in .mp4 format, are uploaded back into an S3 bucket. Using the create_video method in the video API, a video object is created with the video path as input. This inserts data on each video into the Aurora database. Any other file types, such as JSON files, are considered metadata and similarly uploaded, but no video object is created. A list of the names of files and video files extracted is passed to the next step.

Submitter Lambda function

The Submitter function takes the video files that were extracted by the Initializer and creates mini-batches of video frames as images. Most current computer vision models in production have been trained on images so even when video is processed, they’re first separated into image frames before model inference. This current solution using a state-of-the-art pose estimation model is no different—the frame batches from the Submitter are passed to Kinesis Data Streams to initiate the inference generation step.

First, the video file is downloaded by the Lambda function. The frame rate and number of frames is calculated using the FileVideoStream module from the processing library. The frames are extracted and grouped according to a specified mini-batch size, which is one of the key tunable parameters of this pipeline. Using the Python pickle library, the data is serialized and uploaded to Amazon S3. Subsequently, a frame batch object is created and the metadata entry is created in the Aurora database. This Lambda function was built using a Dockerfile with dependencies on opencv-python, numpy, and imutils libraries.

Completion Check Lambda function

The Completion Check function continues to query the Aurora database to see, for each video in the .zip package for this current job, how many frame batches are in the COMPLETED status. When all frame batches for all videos are complete, then this check process is complete.

Collector Lambda function

The Collector function takes the outputs of the inferences that were performed on each frame during the Consumer stage and combines them across a frame batch and across a video. The combined merged data is then upload to an S3 bucket. The function then invokes the Cortex postprocessing API for a particular ML pipeline in order to perform any postprocessing computations, and adds the aggregated results by video to the output bucket. Many of these metrics are calculated across frames, such as speed, acceleration, and joint angle, so this calculation needs to be performed on the aggregated data. The main outputs include body key points data (aggregated into CSV format), BMA calculations (such as acceleration), and visual overlay of key points added to each frame in an image file.

Scalable pose estimation inference generation

The processing engine that powers the scaling of ML inference happens in this stage. It involves three main pieces, each with have their own concurrency levers that can be tuned for latency tradeoffs (see the following diagram).

This architecture allows for experimentation in testing latency gains, as well as flexibility for the future when workloads may change with different mixes of end-user segments that access the application.

Kinesis Data Streams

The team chose Kinesis Data Streams because it’s typically used to handle streaming data, and in this case is a good fit because it can handle frame batches in a similar way to provide scalability and parallelization. In the Submitter Lambda function, the Kinesis Boto3 client is used, with the put_record method passing in the metadata associated with a single frame batch, such as the frame batch ID, frame batch starting frame, frame batch ending frame, image shape, frame rate, and video ID.

We defined various job queue and Kinesis data stream configurations to set levels of throughput that tie back to the priority level of different user groups. Access to different levels of processing power is linked by passing a project queue ARN when creating a new project using the project API. Every user group is then linked to a particular project during user group creation. Three default stream configurations are defined in the AWS Serverless Application Model (AWS SAM) infrastructure template:

  • Standard – JobStreamShardCount
  • Priority – PriorityJobStreamShardCount
  • High priority – HighPriorityJobStreamShardCount

The team used a few different levers to differentiate the processing power of each stream or tune the latency of the system, as summarized in the following table.

Lever Description Default value
Shard A shard is native to Kinesis Data Streams; it’s the base unit of throughput for ingestion. The default is 1MB/sec, which equates to 1,000 data records per second. 2
KinesisBatchSize The maximum number of records that Kinesis Data Streams retrieves in a single batch before invoking the consumer Lambda function. 1
KinesisParallelizationFactor The number of batches to process from each shard concurrently. 1
Enhanced fan-out Data consumers who have enhanced fan-out activated have a dedicated ingestion throughput per consumer (such as the default 1MB/sec) instead of sharing throughput across consumers. Off

Consumer Lambda function

From the perspective of Kinesis Data Streams, a data consumer is an AWS service that retrieves data from a data stream shard as data is generated in a stream. This application uses a Consumer Lambda function, which is invoked when messages are passed from the data stream queues. Each Consumer function processes one frame batch by performing the following steps. First, a call is made to the Cortex processor API synchronously, which is the endpoint that hosts the model inference pipeline (see the next section regarding Amazon EKS with Cortex for more details). The results are stored in Amazon S3, and an update is made to the database by changing the status of the processed frame batch to Complete. Error handling is built in to manage the Cortex API call with a retry loop to handle any 504 errors from the Cortex cluster, with number of retries set to 5.

Amazon EKS with Cortex for ML inference

The team used Amazon EKS, a managed Kubernetes service in AWS, as the compute engine for ML inference. A design choice was made to use Amazon EKS to host ML endpoints, giving the flexibility of running upstream Kubernetes with the option of clusters both fully managed in AWS via AWS Fargate, or on-premises hardware via Amazon EKS Anywhere. This was a critical piece of functionality desired by Intel OTG, which provided the option to hook up this application to specialized on-premises hardware, for example.

The three ML models that were the building blocks for constructing the inference pipelines were a custom Yolo model (for object detection), a custom HRNet model (for 2D pose estimation), and a 3DMPPE model (for 3D pose estimation) (see the previous ML section for more details). They used the open-source Cortex library to handle deployment and management of ML inference pipeline endpoints, and launching and deployment of the Amazon EKS clusters. Each of these models were packaged up into Docker containers—model files were stored in Amazon S3 and model images were stored in Amazon Elastic Container Registry (Amazon ECR)—and deployed as Cortex Realtime APIs. Versions of the model containers that run on CPU and GPU were created to provide flexibility for the type of compute hardware. In the future, if additional models or model pipelines need to be added, they can simply create additional Cortex Realtime APIs.

They then constructed inference pipelines by composing together the Cortex Realtime model APIs into Cortex Realtime pipeline APIs. A single Realtime pipeline API consisted of calling a sequence of Realtime model APIs. The Consumer Lambda functions treated a pipeline API as a black box, using a single API call to retrieve the final inference output for an image. Two pipelines were created: a 2D pipeline and a 3D pipeline.

The 2D pipeline combines a decoding preprocessing step, object detection using a custom Yolo model to locate the athlete and produce bounding boxes, and finally a custom HRNet model for creating 2D key points for pose estimation.

The 3D pipeline combines a decoding preprocessing step, object detection using a custom Yolo model to locate the athlete and produce bounding boxes, and finally a 3DMPPE model for creating 3D key points for pose estimation.

After generating inferences on a batch of frames, each pipeline also includes a separate postprocessing Realtime Cortex endpoint that generates three main outputs:

  • Aggregated body key points data into a single CSV file
  • BMA calculations (such as acceleration)
  • Visual overlay of key points added to each frame in an image file

The Collector Lambda function submits the appropriate metadata associated with a particular video, such as the frame IDs and S3 locations of the pose estimation inference outputs, to the endpoint to generate these postprocessing outputs.

Cortex is designed to be integrated with Amazon EKS, and only requires a cluster configuration file and a simple command to launch a Kubernetes cluster:

cortex cluster up –config cluster_config_file.yaml

Another lever for performance tuning was the instance configuration for the compute clusters. Three tiers were created with varying mixes of M5 and G4dn instances, codified as .yaml files with specifications such as cluster name, Region, and instance configuration and mix. M5 instances are lower-cost CPU-based and G4dn are higher cost GPU-based to provide some cost/performance tradeoffs.

Operational monitoring

To maintain operational logging standards, all Lambda functions include code to record and ingest logs via Amazon Kinesis Data Firehose. For example, every frame batch processed from the Submitter Lambda function is logged with the timestamp, name of action, and Lambda function response JSON and saved to Amazon S3. The following diagram illustrates this step in the architecture.


Deployment is handled using AWS SAM, an open-source framework for building serverless applications in AWS. AWS SAM enables infrastructure design, including functions, APIs, databases, and event source mappings to be codified and easily deployed in new AWS environments. During deployment, the AWS SAM syntax is translated into AWS CloudFormation to handle the infrastructure provisioning.

A template.yaml file contains the infrastructure specifications along with tunable parameters, such as Kinesis Data Streams latency levers detailed in the preceding sections. A samconfig.toml file contains deployment parameters such as stack name, S3 bucket name where application files like Lambda function code is stored, and resource tags for tracking cost. A shell script with the simple commands is all that is required to build and deploy the entire template:

sam build -t template.yaml sam deploy –config-env “env_name” –profile “profile_name”

User work flow

To sum up, after the infrastructure has been deployed, you can follow this workflow to get started:

  1. Create an Intel 3DAT client using the client library.
  2. Use the API to create a new instance of a pipeline corresponding to the type of processing that is required, such as one for 3D pose estimation.
  3. Create a new instance of a project, passing in the cluster ARN and Kinesis queue ARN.
  4. Create a new instance of a pipeline parameter set.
  5. Create a new instance of pipeline parameters that map to the pipeline parameter set.
  6. Create a new user group that is associated with a project ID and a pipeline parameter set ID.
  7. Create a new user that is associated with the user group.
  8. Upload a .zip file of videos to Amazon S3 using a presigned S3 URL generated by the upload function in the job API.
  9. Submit a create_job API call, with job parameters that specify location of the video files. This starts the processing job.


The application is now live and ready to be tested with athletes and coaches alike. Intel OTG is excited to make innovative pose estimation technology using computer vision accessible for a variety of users, from developers to athletes to software vendor partners.

The AWS team is passionate about helping customers like Intel OTG accelerate their ML journey, from the ideation and discovery stage with ML Solutions Lab to the hardening and deployment stage with AWS ML ProServe. We will all be watching closely at the 2021 Tokyo Olympics this summer to envision all the progress that ML can unlock in sports.

Get started today! Explore your use case with the services mentioned in this post and many others on the AWS Management Console.

About the Authors

Han Man is a Senior Manager- Machine Learning & AI at AWS based in San Diego, CA. He has a PhD in engineering from Northwestern University and has several years of experience as a management consultant advising clients in manufacturing, financial services, and energy. Today he is passionately working with customers from a variety of industries to develop and implement machine learning & AI solutions on AWS. He enjoys following the NBA and playing basketball in his spare time.

Iman Kamyabi is an ML Engineer with AWS Professional Services. He has worked with a wide range of AWS customers to champion best practices in setting up repeatable and reliable ML pipelines.

Jonathan Lee is the Director of Sports Performance Technology, Olympic Technology Group at Intel. He studied the application of machine learning to health as an undergrad at UCLA and during his graduate work at University of Oxford. His career has focused on algorithm and sensor development for health and human performance. He now leads the 3D Athlete Tracking project at Intel.

Nelson Leung is the Platform Architect in the Sports Performance CoE at Intel, where he defines end-to-end architecture for cutting-edge products that enhance athlete performance. He also leads the implementation, deployment and productization of these machine learning solutions at scale to different Intel partners.

Troy Squillaci is a DecSecOps engineer at Intel where he delivers professional software solutions to customers through DevOps best practices. He enjoys integrating AI solutions into scalable platforms in a variety of domains.

Paul Min is an Associate Solutions Architect Intern at Amazon Web Services (AWS), where he helps customers across different industry verticals advance their mission and accelerate their cloud adoption. Previously at Intel, he worked as a Software Engineering Intern to help develop the 3D Athlete Tracking Cloud SDK. Outside of work, Paul enjoys playing golf and can be heard singing.


Continue Reading


Intelligently search your Jira projects with Amazon Kendra Jira cloud connector

Organizations use agile project management platforms such as Atlassian Jira to enable teams to collaborate to plan, track, and ship deliverables. Jira captures organizational knowledge about the workings of the deliverables in the issues and comments logged during project implementation. However, making this knowledge easily and securely available to users is challenging due to it…




Organizations use agile project management platforms such as Atlassian Jira to enable teams to collaborate to plan, track, and ship deliverables. Jira captures organizational knowledge about the workings of the deliverables in the issues and comments logged during project implementation. However, making this knowledge easily and securely available to users is challenging due to it being fragmented across issues belonging to different projects and sprints. Additionally, because different stakeholders such as developers, test engineers, and project managers contribute to the same issue by logging it and then adding attachments and comments, traditional keyword-based search is rendered ineffective when searching for information in Jira projects.

You can now use the Amazon Kendra Jira cloud connector to index issues, comments, and attachments in your Jira projects, and search this content using Amazon Kendra intelligent search, powered by machine learning (ML).

This post shows how to use the Amazon Kendra Jira cloud connector to configure a Jira cloud instance as a data source for an Amazon Kendra index, and intelligently search the contents of the projects in it. We use an example of Jira projects where team members collaborate by creating issues and adding information to them in the form of descriptions, comments, and attachments throughout the issue lifecycle.

Solution overview

A Jira instance has one or more projects, where each project has team members working on issues in that project. Each team member has set of permissions about the operations they can perform with respect to different issues in the project they belong to. Team members can create new issues, or add more information to the issues in the form of attachments and comments, as well as change the status of an issue from its opening to closure throughout the issue lifecycle defined for that project. A project manager creates sprints, assigns issues to specific sprints, and assigns owners to issues. During the course of the project, the knowledge captured in these issues keeps evolving.

In our solution, we configure a Jira cloud instance as a data source to an Amazon Kendra search index using the Amazon Kendra Jira connector. Based on the configuration, when the data source is synchronized, the connector crawls and indexes the content from the projects in the Jira instance. Optionally, you can configure it to index the content based on the change log. The connector also collects and ingests access control list (ACL) information for each issue, comment, and attachment. The ACL information is used for user context filtering, where search results for a query are filtered by what a user has authorized access to.


To try out the Amazon Kendra connector for Jira using this post as a reference, you need the following:

Jira instance configuration

This section describes the Jira configuration used to demonstrate how to configure an Amazon Kendra data source using the Jira connector, ingest the data from the Jira projects into the Amazon Kendra index, and make search queries. You can use your own Jira instance for which you have admin access or create a new project and carry out the steps to try out the Amazon Kendra connector for Jira.

In our example Jira instance, we created two projects to demonstrate that the search queries made by users return results from only the projects to which they have access. We used data from the following public domain projects to simulate the use case of real-life software development projects:

The following is a screenshot of our Kanban-style board for project 1.

Create an API token for the Jira instance

To get the API token needed to configure the Amazon Kendra Jira connector, complete the following steps:

  1. Log in to
  2. Choose Create API token.
  3. In the dialog box that appears, enter a label for your token and choose Create.
  4. Choose Copy and enter the token on a temporary notepad.

You can’t copy this token again, and you need it to configure the Amazon Kendra Jira connector.

Configure the data source using the Amazon Kendra connector for Jira

To add a data source to your Amazon Kendra index using the Jira connector, you can use an existing index or create a new index. Then complete the following steps. For more information on this topic, refer to Amazon Kendra Developer Guide.

  1. On the Amazon Kendra console, open your index and choose Data sources in the navigation pane.
  2. Choose Add data source.
  3. Under Jira, choose Add connector.
  4. In the Specify data source details section, enter the details of your data source and choose Next.
  5. In the Define access and security section, for Jira Account URL, enter the URL of your Jira cloud instance.
  6. Under Authentication, you have two options:
    1. Choose Create to add a new secret using the Jira API token you copied from your Jira instance and use the email address used to log in to Jira as the Jira ID. (This is the option we choose for this post.)
    2. Use an existing AWS Secrets Manager secret that has the API token for the Jira instance you want the connector to access.
  7. For IAM role, choose Create a new role or choose an existing IAM role configured with appropriate IAM policies to access the Secrets Manager secret, Amazon Kendra index, and data source.
  8. Choose Next.
  9. In the Configure sync settings section, provide information about your sync scope and run schedule.
  10. Choose Next.
  11. In the Set field mappings section, you can optionally configure the field mappings, or how the Jira field names are mapped to Amazon Kendra attributes or facets.
  12. Choose Next.
  13. Review your settings and confirm to add the data source.
  14. After the data source is added, choose Data sources in the navigation pane, select the newly added data source, and choose Sync now to start data source synchronization with the Amazon Kendra index.
    The sync process can take about 10–15 minutes. Let’s now enable access control for the Amazon Kendra index.
  15. In the navigation pane, choose your index.
  16. In the middle pane, choose the User access control tab.
  17. Choose Edit settings and change the settings to look like the following screenshot.
  18. Choose Next and then choose Update.

Perform intelligent search with Amazon Kendra

Before you try searching on the Amazon Kendra console or using the API, make sure that the data source sync is complete. To check, view the data sources and verify if the last sync was successful.

  1. To start your search, on the Amazon Kendra console, choose Search indexed content in the navigation pane.
    You’re redirected to the Amazon Kendra Search console.
  2. Expand Test query with an access token and choose Apply token.
  3. For Username, enter the email address associated with your Jira account.
  4. Choose Apply.

Now we’re ready to search our index. Let’s use the query “where does boto3 store security tokens?”

In this case, Kendra provides a suggested answer from one of the cards in our Kanban project on Jira.

Note that this is also a suggested answer pointing to an issue discussing AWS security tokens and Boto3. You may also build search experience with multiple data sources including SDK documentation and wikis with Amazon Kendra, and present results and related links accordingly. The following screenshot shows another search query made against the same index.

Note that when we apply a different access token (associate the search with a different user), the search results are restricted to projects that this user has access to.

Lastly, we can also use filters relevant to Jira in our search. First, we navigate to our index’s Facet definition page and check Facetable for j_status, j_assignee, and j_project_name. For every search, we can then filter by these fields, as shown in the following screenshot.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for Jira, delete that data source.


With the Amazon Kendra Jira connector, your organization can make invaluable knowledge in your Jira projects available to your users securely using intelligent search powered by Amazon Kendra.

To learn more about the Amazon Kendra Jira connector, refer to the Amazon Kendra Jira connector section of the Amazon Kendra Developer Guide.

For more information on other Amazon Kendra built-in connectors to popular data sources, refer to Unravel the knowledge in Slack workspaces with intelligent search using the Amazon Kendra Slack connector and Search for knowledge in Quip documents with intelligent search using the Quip connector for Amazon Kendra.

About the Authors

Shreyas Subramanian is an AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges on the AWS Cloud.

Abhinav JawadekarAbhinav Jawadekar is a Principal Solutions Architect focused on Amazon Kendra in the AI/ML language services team at AWS. Abhinav works with AWS customers and partners to help them build intelligent search solutions on AWS.


Continue Reading


Copyright © 2021 Today's Digital.