Bring Your Amazon SageMaker model into Amazon Redshift for remote inference
Amazon Redshift, a fast, fully managed, widely used cloud data warehouse, natively integrates with Amazon SageMaker for machine learning (ML). Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Data analysts and database developers want to use this data to train ML models, which…
Amazon Redshift, a fast, fully managed, widely used cloud data warehouse, natively integrates with Amazon SageMaker for machine learning (ML). Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Data analysts and database developers want to use this data to train ML models, which can then be used to generate insights for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies.
Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. In a previous post, we covered how Amazon Redshift ML allows you to use your data in Amazon Redshift with SageMaker, a fully managed ML service, without requiring you to become an expert in ML. We also discussed how Amazon Redshift ML enables ML experts to create XGBoost or MLP models in an earlier post. Additionally, Amazon Redshift ML allows data scientists to either import existing SageMaker models into Amazon Redshift for in-database inference or remotely invoke a SageMaker endpoint.
This post shows how you can enable your data warehouse users to use SQL to invoke a remote SageMaker endpoint for prediction. We first train and deploy a Random Cut Forest model in SageMaker, and demonstrate how you can create a model with SQL to invoke that SageMaker predictions remotely. Then, we show how end users can invoke the model.
You also have to make sure that the SageMaker model is deployed and you have the endpoint. You can use the following AWS CloudFormationtemplate to provision all the required resources in your AWS accounts automatically.
Solution overview
Amazon Redshift ML supports text and CSV inference formats. For more information about various SageMaker algorithms and their inference formats, see Random Cut Forest (RCF) Algorithm.
Amazon SageMaker Random Cut Forest (RCF) is an algorithm designed to detect anomalous data points within a dataset. Examples of anomalies that are important to detect include when website activity uncharacteristically spikes, when temperature data diverges from a periodic behavior, or when changes to public transit ridership reflect the occurrence of a special event.
In this post, we use the SageMaker RCF algorithm to train an RCF model using the Notebook generated by the CloudFormation template on the Numenta Anomaly Benchmark (NAB) NYC Taxi dataset.
We downloaded the data and stored it in an Amazon Simple Storage Service (Amazon S3) bucket. The data consists of the number of New York City taxi passengers over the course of 6 months aggregated into 30-minute buckets. We naturally expect to find anomalous events occurring during the NYC marathon, Thanksgiving, Christmas, New Year’s Day, and on the day of a snowstorm.
We then use this model to predict anomalous events by generating an anomaly score for each data point.
The following figure illustrates how we use Amazon Redshift ML to create a model using the SageMaker endpoint.
Deploy the model
To deploy the model, go to the SageMaker console and open the notebook that was created by the CloudFormation template.
Then choose bring-your-own-model-remote-inference.ipynb.
Set up parameters as shown in the following screenshot and then run all cells.
Get the SageMaker model endpoint
On the Amazon SageMaker console, under Inference in the navigation pane, choose Endpoints to find your model name. You use this when you create the remote inference model in Amazon Redshift.
Prepare data to create a remote inference model using Amazon Redshift ML
Create the schema and load the data in Amazon Redshift using the following SQL:
DROP TABLE IF EXISTS public.rcf_taxi_data CASCADE; CREATE TABLE public.rcf_taxi_data ( ride_timestamp timestamp, nbr_passengers int ); COPY public.rcf_taxi_data FROM ‘s3://sagemaker-sample-files/datasets/tabular/anomaly_benchmark_taxi/NAB_nyc_taxi.csv’ iam_role ‘arn:aws:iam::::role/RedshiftML’ ignoreheader 1 csv delimiter ‘,’;
Amazon Redshift now supports attaching the default IAM role. If you have enabled the default IAM role in your cluster, you can use the default IAM role as follows.
Create a model in Amazon Redshift ML using the SageMaker endpoint you previously captured:
CREATE MODEL public.remote_random_cut_forest FUNCTION remote_fn_rcf(int) RETURNS decimal(10,6) SAGEMAKER ‘randomcutforest-xxxxxxxxx’ IAM_ROLE ‘arn:aws:iam:::role/RedshiftML’;
You can also use the default IAM role with your CREATE MODEL command as follows:
CREATE MODEL public.remote_random_cut_forest FUNCTION remote_fn_rcf(int) RETURNS decimal(10,6) SAGEMAKER ‘randomcutforest-xxxxxxxxx’ IAM_ROLE default;
Check model status
You can use the show model command to view the status of the model:
show model public.remote_random_cut_forest
You get output like the following screenshot, which shows the endpoint and function name.
Compute anomaly scores across the entire taxi dataset
Now, run the inference query using the function name from the create model statement:
select ride_timestamp, nbr_passengers, public.remote_fn_rcf(nbr_passengers) as score from public.rcf_taxi_data;
The following screenshot shows our results.
Now that we have our anomaly scores, we need to check for higher-than-normal anomalies.
Amazon Redshift ML has batching optimizations to minimize the communication cost with SageMaker and offers high-performance remote inference.
Check for high anomalies
The following code runs a query for any data points with scores greater than three standard deviations (approximately 99.9th percentile) from the mean score:
with score_cutoff as (select stddev(public.remote_fn_rcf(nbr_passengers)) as std, avg(public.remote_fn_rcf(nbr_passengers)) as mean, ( mean + 3 * std ) as score_cutoff_value from public.rcf_taxi_data) select ride_timestamp, nbr_passengers, public.remote_fn_rcf(nbr_passengers) as score from public.rcf_taxi_data where score > (select score_cutoff_value from score_cutoff) order by 2 desc;
The data in the following screenshot shows that the biggest spike in ridership occurs on November 2, 2014, which was the annual NYC marathon. We also see spikes on Labor Day weekend, New Year’s Day and the July 4th holiday weekend.
Conclusion
In this post, we used SageMaker Random Cut Forest to detect anomalous data points in a taxi ridership dataset. In this data, the anomalies occurred when ridership was uncharacteristically high or low. However, the RCF algorithm is also capable of detecting when, for example, data breaks periodicity or uncharacteristically changes global behavior.
We then used Amazon Redshift ML to demonstrate how you can make inferences on unsupervised algorithms (such as Random Cut Forest). This allows you to democratize ML by making predictions with Amazon Redshift SQL commands.
For more information about building different models with Amazon Redshift ML see the Amazon Redshift ML documentation.
About the Authors
Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.
Debu Panda, a principal product manager at AWS, is an industry leader in analytics, application platform, and database technologies and has more than 25 years of experience in the IT world.
Nikos Koulouris is a Software Development Engineer at AWS. He received his PhD from University of California, San Diego and he has been working in the areas of databases and analytics.
Murali Narayanaswamy is a principal machine learning scientist in AWS. He received his PhD from Carnegie Mellon University and works at the intersection of ML, AI, optimization, learning and inference to combat uncertainty in real-world applications including personalization, forecasting, supply chains and large scale systems.
Control data access to Amazon S3 from Amazon SageMaker Studio with Amazon S3 Access Grants
In this post, we demonstrate how to simplify data access to Amazon S3 from SageMaker Studio using S3 Access Grants, specifically for different user personas using IAM principals. Source
In this post, we demonstrate how to simplify data access to Amazon S3 from SageMaker Studio using S3 Access Grants, specifically for different user personas using IAM principals.
Scalable training platform with Amazon SageMaker HyperPod for innovation: a video generation case study
In this post, we share an ML infrastructure architecture that uses SageMaker HyperPod to support research team innovation in video generation. We will discuss the advantages and pain points addressed by SageMaker HyperPod, provide a step-by-step setup guide, and demonstrate how to run a video generation algorithm on the cluster. Source
In this post, we share an ML infrastructure architecture that uses SageMaker HyperPod to support research team innovation in video generation. We will discuss the advantages and pain points addressed by SageMaker HyperPod, provide a step-by-step setup guide, and demonstrate how to run a video generation algorithm on the cluster.
GenAI for Aerospace: Empowering the workforce with expert knowledge on Amazon Q and Amazon Bedrock
In this post we show how you can quickly launch generative AI-enabled expert chatbots, trained on your proprietary document sets, to empower your workforce across specific aerospace roles with Amazon Q and Amazon Bedrock. Source
In this post we show how you can quickly launch generative AI-enabled expert chatbots, trained on your proprietary document sets, to empower your workforce across specific aerospace roles with Amazon Q and Amazon Bedrock.