Define and run Machine Learning pipelines on Step Functions using Python, Workflow Studio, or States Language
You can use various tools to define and run machine learning (ML) pipelines or DAGs (Directed Acyclic Graphs). Some popular options include AWS Step Functions, Apache Airflow, KubeFlow Pipelines (KFP), TensorFlow Extended (TFX), Argo, Luigi, and Amazon SageMaker Pipelines. All these tools help you compose pipelines in various languages (JSON, YAML, Python, and more), followed…
You can use various tools to define and run machine learning (ML) pipelines or DAGs (Directed Acyclic Graphs). Some popular options include AWS Step Functions, Apache Airflow, KubeFlow Pipelines (KFP), TensorFlow Extended (TFX), Argo, Luigi, and Amazon SageMaker Pipelines. All these tools help you compose pipelines in various languages (JSON, YAML, Python, and more), followed by viewing and running them using a workflow orchestrator. A deep comparison of each of these options is out of scope for this post, and involves appropriately selecting and benchmarking tools for your specific use case.
In this post, we discuss how to author end-to-end ML pipelines in Step Functions using three different methods:
Solution overview
In this post, we create a simple workflow that involves a training step, creating a model, configuring an endpoint, and deploying the model
In this post, we use the MNIST dataset, which is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28×28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). The code used here closely follows a similar use case where the task is to classify each input image as one of the 10 digits (0–9).
for epoch in range(1, args.epochs + 1): model.train()
Finally, we save the model:
def save_model(model, model_dir): logger.info(“Saving the model.”) path = os.path.join(model_dir, ‘model.pth’) # recommended way from http://pytorch.org/docs/master/notes/serialization.html torch.save(model.cpu().state_dict(), path)
This code is stored in a file called mnist.py and used in later steps (see the full code on GitHub). The following are two important takeaways in connection to the pipeline:
You can use the data input location on Amazon Simple Storage Service (Amazon S3) as a parameter for the training step in a pipeline. This data is delivered to the training container, the local path of which is stored in an environment variable (for example, SAGEMAKER_CHANNEL_TRAINING).
The model is shown here as being saved locally in model_dir; the local path of the model directory (/opt/ml/model) is stored in an environment variable (SM_MODEL_DIR). At the end of the SageMaker training job, the model is copied to an Amazon S3 location so that model and endpoint related pipeline steps can access the model.
Now let’s look at our three methods to author end-to-end ML pipelines in Step Functions.
Use the Step Functions Data Science SDK
The Step Functions Data Science SDK is an open-source library that lets you create workflows entirely in Python. Installing this SDK is as simple as entering the following code:
pip install stepfunctions
The SDK allows you to do the following:
Create steps that accomplish tasks
Chain those steps together into workflows
Branch out to run steps in parallel or based on conditions
Include retry, succeed, or fail steps
Review a graphical representation and definition for your workflow
Create a workflow in Step Functions
Start and review runs in Step Functions
Although we don’t use many of these functions, the Step Functions Data Science SDK can include the following:
To get started, we first create a PyTorch estimator with the mnist.py file. We configure the estimator with the training script, an AWS Identity and Access Management (IAM) role, the number of training instances, the training instance type, and hyperparameters:
The Data Science SDK provides two ways of defining a pipeline. Firstly, you can use individual steps. For example, you can define a training step with the following code:
The workflow execution role allows you to create and run workflows in Step Functions. The following code creates the desired workflow and lets you render the same:
Finally, you can create and run the workflow using pipeline.create() and pipeline.execute().
An example output from the execute() statement looks as follows, and provides you with a link to Step Functions where you can view and monitor your execution:
Choose Next, review the generated code, and choose Next.
Step Functions can look at the resources you use and create a role. However, you may see the following message:
“Step Functions cannot generate an IAM policy if the RoleArn for SageMaker is from a Path. Hardcode the SageMaker RoleArn in your state machine definition, or choose an existing role with the proper permissions for Step Functions to call SageMaker.”
We use a role that we created in the Data Science SDK section instead.
Select Use existing role and use the role StepFunctionsWorkflowExecutionRole.
Choose Create state machine.
When you receive the message that the machine was successfully created, run it with the following input:
Both the methods we just discussed are great ways to quickly prototype a state machine on Step Functions. When you need to edit the Step Functions definition directly, you can use the States language. See the following code:
You can create a new Step Functions state machine on the Step Functions console by selecting Write your workflow in code.
A successful run shows each state in green.
Each state points to resources in SageMaker. In the following screenshot, the link under Resource points to the model created as a result of the CreateModel step.
When to use what?
The following table summarizes the supported service integrations.
Although most of what you typically need for your pipelines is included in the Step Functions Data Science SDK, you may need to integrate with other supported services that are supported by other choices shown in the preceding table.
In addition, consider the skillsets in your existing team—teams that are used to working with a particular tool may prefer sticking to the same for maximizing productivity. This is true when considering the options within Step Functions explored here, but also others such as the AWS Cloud Development Kit (AWS CDK), AWS Serverless Application Model (AWS SAM), Airflow, KubeFlow, and SageMaker Pipelines. Specifically around Pipelines, consider that data scientists and ML engineers may benefit from working on a single platform that includes the ability to not only maintain and run pipelines, but also manage models, endpoints, notebooks and other features.
Lastly, consider a hybrid set of services for using the right tool for the right job. For example:
You can use Pipelines for automating feature engineering pipelines using SageMaker Data Wrangler and SageMaker Feature Store. Pipelines is a purpose-built CI/CD tool for ML model building and deployment that not only includes workflow orchestration, but is also related to concepts such as model registry, lineage tracking, and projects. When upstream processes are already using Step Functions, for example, to prepare data using AWS services, consider using a hybrid architecture where both Step Functions and Pipelines are used. For more information, see Automate feature engineering pipelines with Amazon SageMaker.
When developers and data scientists need to write infrastructure as code with unit testing, consider using tools like the AWS Data Science SDK and Apache Airflow, because you can use Python to define end-to-end architectures and pipelines.
Summary
In this post, we looked at three different ways of authoring and running Step Functions pipelines, specifically for end-to-end ML workflows. Choosing a pipelining tool for ML is an important step for a team, and is a decision that needs to consider the context, existing skillsets, connections to various other teams with these skillsets in an organization, available service integration, service limits, and applicable quotas. Contact your AWS team to help guide you through these decisions; we are eager to help!
For further reading, check out the following:
About the Author
Shreyas Subramanian is a AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges on the AWS Cloud.
Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases
This post introduces a solution to reduce hallucinations in Large Language Models (LLMs) by implementing a verified semantic cache using Amazon Bedrock Knowledge Bases, which checks if user questions match curated and verified responses before generating new answers. The solution combines the flexibility of LLMs with reliable, verified answers to improve response accuracy, reduce latency,…
This post introduces a solution to reduce hallucinations in Large Language Models (LLMs) by implementing a verified semantic cache using Amazon Bedrock Knowledge Bases, which checks if user questions match curated and verified responses before generating new answers. The solution combines the flexibility of LLMs with reliable, verified answers to improve response accuracy, reduce latency, and lower costs while preventing potential misinformation in critical domains such as healthcare, finance, and legal services.
Orchestrate an intelligent document processing workflow using tools in Amazon Bedrock
This intelligent document processing solution uses Amazon Bedrock FMs to orchestrate a sophisticated workflow for handling multi-page healthcare documents with mixed content types. The solution uses the FM’s tool use capabilities, accessed through the Amazon Bedrock Converse API. This enables the FMs to not just process text, but to actively engage with various external tools…
This intelligent document processing solution uses Amazon Bedrock FMs to orchestrate a sophisticated workflow for handling multi-page healthcare documents with mixed content types. The solution uses the FM’s tool use capabilities, accessed through the Amazon Bedrock Converse API. This enables the FMs to not just process text, but to actively engage with various external tools and APIs to perform complex document analysis tasks.