With the rapid growth of data, many organizations are finding it difficult to analyze their large datasets to gain insights. As businesses rely more and more on automation algorithms, machine learning (ML) has become a necessity to stay ahead of the competition. Amazon Redshift, a fast, fully managed, widely used cloud data warehouse, natively integrates…
With the rapid growth of data, many organizations are finding it difficult to analyze their large datasets to gain insights. As businesses rely more and more on automation algorithms, machine learning (ML) has become a necessity to stay ahead of the competition.
Amazon Redshift, a fast, fully managed, widely used cloud data warehouse, natively integrates with Amazon SageMaker for ML. With Amazon Redshift ML, you can use simple SQL statements to create and train ML models from your data in Amazon Redshift and then use these models for a variety of use cases, such as classification of a binary or multiclass outcome or predicting a numeric value through regression. Amazon SageMaker Autopilot provides all the benefits of automatic model creation, but as an advanced user, you can also influence the model training by providing different parameters such as model type, objective, and so on.
Amazon Redshift ML allows you to address several ML challenges, such as the following:
Multiclass classification – Identify the class of an input value within a discrete number of classes. For example, you can identify which will be the best-selling product.
Regression – Predict a numerical outcome, like the price of a house or how many people will use a city’s bike rental service.
You can use Amazon Redshift ML to automate data preparation, pre-processing, and selection of problem type as depicted in this blog post. In this post, we assume that you have a good understanding of your data and what problem type you want to use for your use case. We demonstrate how to use Amazon Redshift ML to solve a regression problem predicting bike rental counts. We also provide some best practices for creating test data, validating your model, and using it for inference. We also show you how you can use the SageMaker console to troubleshoot the training process as an advanced user.
In this post, we use Amazon Redshift ML to build a regression model that predicts the number of people that may use the city of Toronto’s bike sharing service at any given hour of a day. The model accounts for various aspects, including holidays and weather conditions. Because we need to predict a numerical outcome, we create a regression model.
We walk you through the following high-level steps:
Input the raw data.
Prepare the input data.
Create the model.
Validate the predictions.
Input the raw data
To prepare the raw data for this model, we populated three tables in Amazon Redshift using different public datasets.
You can use the following script to create and load the data in these tables in Amazon Redshift, and use the Amazon Redshift Query Editor to run these SQL scripts:
To load the data, use the following COPY commands. Replace the AWS Identity and Access Management (IAM) role with the IAM role that you created as part of the prerequisite steps earlier.
COPY ridership FROM ‘s3://redshift-ml-bikesharing-data/bike-sharing-data/ridership/’ IAM_ROLE ‘arn:aws:iam:::role/RedshiftML’ FORMAT csv IGNOREHEADER 1 DATEFORMAT ‘auto’ TIMEFORMAT ‘auto’ REGION ‘us-west-2’ gzip; COPY weather FROM ‘s3://redshift-ml-bikesharing-data/bike-sharing-data/weather/’ IAM_ROLE ‘arn:aws:iam:::role/RedshiftML’ FORMAT csv IGNOREHEADER 1 DATEFORMAT ‘auto’ TIMEFORMAT ‘auto’ REGION ‘us-west-2’ gzip; COPY holiday FROM ‘s3://redshift-ml-bikesharing-data/bike-sharing-data/holiday/’ IAM_ROLE ‘arn:aws:iam:::role/RedshiftML’ FORMAT csv IGNOREHEADER 1 DATEFORMAT ‘auto’ TIMEFORMAT ‘auto’ REGION ‘us-west-2’ gzip;
Data preparation
Let’s discuss about how the data can be biased and how selecting the right distribution of data impacts accuracy. For most ML problems, data preparation is the most time-consuming process; it involves preparing the data, finding relevant attributes, and cleaning and curating it to be used as input to the ML model. Bias or anomalies in the input data distribution also play a key role in model accuracy, therefore it’s very important to curate that as much as possible. Let’s explore and prepare our input dataset.
Ridership
The ridership table contains Bike Share Toronto’s ridership information for 2017 and 2018. We performed the following data preparation steps to make it more meaningful for our ML model:
Transform the records to all be in the local time zone. The data before July 2017 was provided in UTC, whereas all data after that was in Toronto’s local time zone, EST.
Discard trips with a duration under 60 seconds.
Discard trips with a duration longer than a day.
Aggregate data to an hourly basis (rather than minute-level granularity) for better generalization.
Break down the trip date column into day, month, year, quarter, month_week, and week_day for better correlation with the number of rides.
We carried out all these transformations using the following simple view:
CREATE OR REPLACE VIEW ridership_view AS SELECT trip_time , trip_count , TO_CHAR(trip_time,’hh24′) ::INT trip_hour , TO_CHAR(trip_time, ‘dd’) :: INT trip_day , TO_CHAR(trip_time, ‘mm’) :: INT trip_month , TO_CHAR(trip_time, ‘yy’) :: INT trip_year , TO_CHAR(trip_time, ‘q’) :: INT trip_quarter , TO_CHAR(trip_time, ‘w’) :: INT trip_month_week , TO_CHAR(trip_time, ‘d’) :: INT trip_week_day FROM (SELECT CASE WHEN TRUNC(r.trip_start_time) < '2017-07-01'::DATE THEN CONVERT_TIMEZONE('US/Eastern', DATE_TRUNC('hour',r.trip_start_time)) ELSE DATE_TRUNC('hour',r.trip_start_time) END trip_time , COUNT(1) trip_count FROM ridership r WHERE r.trip_duration_seconds BETWEEN 60 AND 60 * 60 * 24 GROUP BY 1);
Weather
The weather table contains Government of Canada historical weather data for 2017 and 2018 in UTC format. We performed the following data preparation activities:
Transform the records from UTC to US/Eastern time zone.
Aggregate the data to an hourly basis for better generalization.
Because the majority of the attributes like visibility_km and wind_chill are empty in this dataset, we only use temp_c and precip_amount_mm columns for our model.
We use the following view in Amazon Redshift to apply these transformations:
CREATE OR REPLACE VIEW weather_view AS SELECT CONVERT_TIMEZONE(‘US/Eastern’, DATE_TRUNC(‘hour’,datetime_utc)) daytime , ROUND(AVG(temp_c)) temp_c , ROUND(AVG(precip_amount_mm)) precip_amount_mm FROM weather GROUP BY 1;
Trip data
In this step, we create a new table, trip_data, in Amazon Redshift to combine all relevant attributes from the input tables. We also create a random ordered attribute named serial_number to denote the serial number of each record:
CREATE TABLE trip_data AS SELECT r.trip_time ,r.trip_count ,r.trip_hour ,r.trip_day ,r.trip_month ,r.trip_year ,r.trip_quarter ,r.trip_month_week ,r.trip_week_day ,w.temp_c ,w.precip_amount_mm ,CASE WHEN h.holiday_date IS NOT NULL THEN 1 WHEN TO_CHAR(r.trip_time,’D’)::INT IN (1,7) THEN 1 ELSE 0 END is_holiday , ROW_NUMBER() OVER (ORDER BY RANDOM()) serial_number FROM ridership_view r JOIN weather_view w ON ( r.trip_time = w.daytime ) LEFT OUTER JOIN holiday h ON ( TRUNC(r.trip_time) = h.holiday_date );
The following is an example SELECT query:
select * from trip_data limit 5;
The following table summarizes our output.
Trip time
trip count
trip hour
trip day
trip month
trip year
trip quarter
trip month week
trip week day
temp c
precip amount mm
is holiday
serial number
2017-07-04 23:00
105
23
4
7
17
3
1
3
20
0
0
1
2018-09-03 9:00
233
9
3
9
18
3
1
2
25
0
1
2
2017-09-29 20:00
185
20
29
9
17
3
5
6
12
0
0
3
2017-04-09 23:00
28
23
9
4
17
2
2
1
12
0
1
4
2017-02-01 16:00
175
16
1
2
17
1
1
4
2
0
0
5
We can create an ML model in Amazon Redshift to predict the trip_count column in the preceding table for any given trip time. But before that, we may want to view the relationship of different attributes of this table with our target attribute, trip_count. The Pearson correlation coefficient is a popular method to find how strong a relationship is between two attributes. It returns a value between -1 and 1, in which 1 indicates a strong positive relationship and -1 indicates a strong negative relationship. A result of zero indicates no relationship at all.
We use the stored procedure sp_correlation, available in the Amazon Redshift Utilities GitHub repo, to view the relationship between our target column trip_count and the other numeric attributes in our dataset. This stored procedure allows us to get correlation of an attribute with other attributes in a table in Amazon Redshift.
call sp_correlation(‘public’,’trip_data’,’trip_count’,’tmp_corr_table’); select * from tmp_corr_table;
The following table summarizes our output.
trip count
trip hour
trip day
trip month
trip year
trip quarter
trip month week
trip week day
temp c
precip amount mm
is holiday
serial number
1
0.32
0.02
0.19
0.12
0.20
0.02
0.02
0.52
-0.07
-0.12
0.00
The output shows that temperature and hour of the day have a strong positive correlation with the number of bike rentals. Similarly, precipitation and holidays have a negative correlation with the number of bike rentals. But serial number data has zero correlation because it’s a random number and we can safely exclude such columns in our ML model.
Create a model in Amazon Redshift
Now that our data preparation steps are complete, we can use the create model statement in Amazon Redshift to create our ML regression model. But before that, we want to divide our data into two sets by splitting 80% of the dataset for training and 20% for validation, which a common practice in ML. The training data is input to the ML model to identify the best possible algorithm for the model. After the model is created, we use the validation data to validate the model accuracy. We create the tables training_data and validation_data using the 20% threshold for our random ordered column serial_number with the following code:
CREATE TABLE training_data AS SELECT trip_count,trip_hour,trip_day,trip_month,trip_year,trip_quarter, trip_month_week, trip_week_day, temp_c, precip_amount_mm, is_holiday FROM trip_data WHERE serial_number > (SELECT COUNT(1) * 0.2 FROM trip_data); CREATE TABLE validation_data AS SELECT trip_count,trip_hour,trip_day,trip_month,trip_year,trip_quarter, trip_month_week, trip_week_day, temp_c, precip_amount_mm, is_holiday, trip_time FROM trip_data WHERE serial_number <= (SELECT COUNT(1) * 0.2 FROM trip_data);
After that, we run the create model command in Amazon Redshift, which creates a regression model to predict the trip_count value for any input date and time:
CREATE MODEL predict_rental_count FROM training_data TARGET trip_count FUNCTION predict_rental_count IAM_ROLE ‘arn:aws:iam:::role/RedshiftML’ PROBLEM_TYPE regression OBJECTIVE ‘mse’ SETTINGS (s3_bucket ‘redshiftml-‘, s3_garbage_collect off, max_runtime 5000);
We define the following parameters in our create model statement:
Problem type – Because we’re predicting a numeric outcome, we provide the problem type, but it’s an optional parameter.
Objective – We specified MSE (mean square error) as our objective, which is a common metric for evaluation of regression problems.
S3 garbage collect – Setting this parameter to OFF retains the intermediate dataset used to train the models by SageMaker, which may help us troubleshoot any issues.
Max runtime – This parameter denotes how long the model training can run. Specifying a larger value may help create a better tuned model. The default value for this parameter is 5,400 (90 minutes).
The preceding statement takes a few seconds to complete. It initiates an Autopilot process in the background to automatically build, train, and tune the best ML model for the input data. It then uses Amazon SageMaker Neo to deploy that model locally in the Amazon Redshift cluster as a user-defined function (UDF). You can use the SHOW MODEL command in Amazon Redshift to track the progress of your model creation, which should be in the READY state within the max_runtime parameter you defined while creating the model.
Validate predictions with mean square error
In this step, we evaluate the accuracy of our ML model against our validation data.
For regression problems, typical accuracy metrics are mean square error (MSE) and root mean square error (RMSE), which measure the distance between the predicted numeric target and the actual numeric answer, also known as ground truth. We square the differences between the actual and predicted values and then find the average to calculate the MSE. The square root of MSE computes the RMSE. A model with a low MSE and RMSE score indicates a good model.
While creating the model, Autopilot automatically splits the input data into train and validation sets and selects the model with the best objective metric, which is deployed in the Amazon Redshift cluster. We use the show model statement to view various metrics, including the calculated MSE by SageMaker while creating the model:
The RMSE value indicates that our model is accurate enough to the actual values from our validation dataset.
Troubleshooting
Although the Create Model statement in Amazon Redshift automatically takes care of initiating Autopilot process to build, train, and tune the best ML model and deploy that model in Amazon Redshift, you can also view the intermediate steps performed in this process, which may help you with troubleshooting if something goes wrong.
Amazon S3 bucket
When you create the model, you need to enter an Amazon Simple Storage Service (Amazon S3) bucket name as the value for the parameter s3_bucket. This bucket is used to share training data and artifacts between Amazon Redshift and SageMaker. Amazon Redshift creates a subfolder in this bucket prior to unloading the training data. When training is complete, it deletes this subfolder and its contents unless you set the parameter s3_garbage_collect to off, which may be used for troubleshooting purposes. For more information, see CREATE MODEL.
SageMaker console
If you navigate to the SageMaker console, you can view the training jobs and hyperparameter tuning jobs initiated by the Create Model process. You need appropriate access permissions, like the AmazonSageMakerReadOnly policy, to view these details. For more information, see AWS Managed (Predefined) Policies for Amazon SageMaker.
In the hyperparameter tuning jobs section on the SageMaker console, you can view the best training job selected by SageMaker based on the MSE metrics, along with different hyperparameter values. The following screenshot shows that our best training job had an MSE value of 1960.272217, similar to what we saw in the Amazon Redshift show model statement.
Amazon SageMaker Studio
Amazon SageMaker Studio is a web-based integrated development environment (IDE) for ML. You can view your model training activities on the Experiments and trials page. You can also view job metrics by choosing the option Describe AutoML Job, as shown in the following screenshot.
Studio also allows you to view two useful notebooks:
Data exploration notebook – Describes what Autopilot learned about the input data from Amazon Redshift
Candidate generation notebook – Shows how Autopilot used this information about the data to generate candidates
Amazon Redshift ML makes it easy for users of all skill levels to use ML technology. With no prior ML knowledge, you can use Amazon Redshift ML to gain business insights for your data. You provide the data that you want to train a model. Then Amazon Redshift ML creates a model that capture patterns in the input data. You can then use that model to make predictions using familiar SQL commands. To learn more about Amazon RedShift ML, visit the webpage.
About the Authors
Manash Deb is a Senior Analytics Specialist Solutions Architect at AWS. He has worked on building end-to-end data driven solutions in different database and data warehousing technologies for over fifteen years. He loves to learn new technologies and solving, automating and simplifying customer problems with easy-to-use cloud data solutions on AWS.
Debu Panda, a principal product manager at AWS, is an industry leader in analytics, application platform, and database technologies and has more than 25 years of experience in the IT world.
Gokul Soundararajan is a principal engineer at AWS and received a PhD from University of Toronto and has been working in the areas of storage, databases, and analytics.
Piali Das is a Senior Software Engineer in the Amazon SageMaker Autopilot team. She previously contributed to building SageMaker Algorithms. She enjoys scientific programming in general and has developed an interest in machine learning and distributed systems.
How Vidmob is using generative AI to transform its creative data landscape
In this post, we illustrate how Vidmob, a creative data company, worked with the AWS Generative AI Innovation Center (GenAIIC) team to uncover meaningful insights at scale within creative data using Amazon Bedrock. Source
In this post, we illustrate how Vidmob, a creative data company, worked with the AWS Generative AI Innovation Center (GenAIIC) team to uncover meaningful insights at scale within creative data using Amazon Bedrock.
Evaluating prompts at scale with Prompt Management and Prompt Flows for Amazon Bedrock
In this post, we demonstrate how to implement an automated prompt evaluation system using Amazon Bedrock so you can streamline your prompt development process and improve the overall quality of your AI-generated content. Source
In this post, we demonstrate how to implement an automated prompt evaluation system using Amazon Bedrock so you can streamline your prompt development process and improve the overall quality of your AI-generated content.
Build powerful RAG pipelines with LlamaIndex and Amazon Bedrock
In this post, we show you how to use LlamaIndex with Amazon Bedrock to build robust and sophisticated RAG pipelines that unlock the full potential of LLMs for knowledge-intensive tasks. Source
In this post, we show you how to use LlamaIndex with Amazon Bedrock to build robust and sophisticated RAG pipelines that unlock the full potential of LLMs for knowledge-intensive tasks.