Reduce computer vision inference latency using gRPC with TensorFlow serving on Amazon SageMaker
AWS customers are increasingly using computer vision (CV) models for improved efficiency and an enhanced user experience. For example, a live broadcast of sports can be processed in real time to detect specific events automatically and provide additional insights to viewers at low latency. Inventory inspection at large warehouses capture and process millions of images…
[]AWS customers are increasingly using computer vision (CV) models for improved efficiency and an enhanced user experience. For example, a live broadcast of sports can be processed in real time to detect specific events automatically and provide additional insights to viewers at low latency. Inventory inspection at large warehouses capture and process millions of images across their network to identify misplaced inventory.
[]CV models can be built with multiple deep learning frameworks like TensorFlow, PyTorch, and Apache MXNet. These models typically have a large input payload of images or videos of varying size. Advanced deep learning models for use cases like object detection return large response payloads ranging from tens of MBs to hundreds of MBs in size. Large request and response payloads can increase model serving latency and subsequently negatively impact application performance. You can further optimize model serving stacks for each of these frameworks for low latency and high throughput.
[]Amazon SageMaker helps data scientists and developers prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML. SageMaker provides state-of-the-art open-source serving containers for XGBoost (container, SDK), Scikit-Learn (container, SDK), PyTorch (container, SDK), TensorFlow (container, SDK) and Apache MXNet (container, SDK).
[]In this post, we show you how to serve TensorFlow CV models with SageMaker’s pre-built container to easily deliver high-performance endpoints using TensorFlow Serving (TFS). As with all SageMaker endpoints, requests arrive using REST, as shown in the following diagram. Inside of the endpoint, you can add preprocessing and postprocessing steps and dispatch the prediction to TFS using either RESTful APIs or gRPC APIs. For small payloads, either API yields similar performance. We demonstrate that for CV tasks like image classification and object detection, using gRPC inside of a SageMaker endpoint reduces overall latency by 75% or more. The code for these use cases is available in the following GitHub repo.
[]
Models
[]For image classification, we use a Keras model MobileNetV2 pre-trained with 1,000 classes from the ImageNet dataset. The default input image resolution is 224*224*3 and output is a dense vector of probabilities for each of the 1,000 classes. For object detection, we use a TensorFlow2 model EfficientDet D1 [alternative URL: https://tfhub.dev/tensorflow/efficientdet/d2/1] pre-trained with 91 classes from the COCO 2017 dataset. The default input image resolution is 640*640*3, and the output is a dictionary of number of detections, bounding box coordinates, detection classes, detection scores, raw detection boxes, raw detection scores, detection anchor indexes, and detection multiclass scores. You can fine-tune both models by a transfer learning task on a custom dataset with SageMaker, and use SageMaker to deploy and serve the models.
[]The following is an example of image classification.
Class ID : 281 , probability = 0.76 , class label = Tabby cat []
[]The following is an example of object detection.
[]The code to deploy the preceding pre-trained models is in the following GitHub repo. SageMaker provides a managed TensorFlow Serving environment that makes it easy to deploy TensorFlow models. The SageMaker TensorFlow Serving container works with any model stored in TensorFlow’s SavedModel format and allows you to add customized Python code to process input and output data.
[]We download the pre-trained models and extract them with the following code:
# Image classification from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input model = MobileNetV2() model.save(‘model/1/’) # Object detection !wget http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz !tar -xvf efficientdet_d1_coco17_tpu-32.tar.gz –no-same-owner !mv efficientdet_d1_coco17_tpu-32/saved_model/* model/1/ []SageMaker models need to be packaged in .tar.gz format. We archive the TensorFlow SavedModel bundle and upload it to Amazon Simple Storage Service (Amazon S3):
model └── ├── saved_model.pb └── variables └── … []We can add customized Python code to process input and output data via input_handler and output_handler methods. The customized Python code must be named inference.py and specified through the entry_point parameter. We add preprocessing to accept an image byte stream as input and read and transform the byte stream with tensorflow.keras.preprocessing:
# Pre-processing from tensorflow.keras.preprocessing import image from PIL import Image if context.request_content_type == ‘application/x-image’: stream = io.BytesIO(data.read()) img = Image.open(stream).convert(‘RGB’) img = img.resize((WIDTH, HEIGHT)) img_array = image.img_to_array(img) #Add additional model specific preprocessing img_array = img_array.reshape((HEIGHT, WIDTH, 3)).astype(np.uint8) #”channels_last” x = np.expand_dims(img_array, axis=0) #”(1,HEIGHT,WIDTH,3) []After we have the S3 model artifact path, we can use the following code to deploy a SageMaker endpoint:
from sagemaker.tensorflow.serving import TensorFlowModel model_data = ‘‘ model = TensorFlowModel(source_dir=’code’,entry_point=’inference.py’, model_data=model_data, role=sagemaker_role, framework_version=’2.2.0′) predictor = model.deploy(initial_instance_count=1, instance_type=’ml.g4dn.xlarge’) []Calling deploy starts the process of creating a SageMaker endpoint. This process includes the following steps:
Starts a TensorFlow Serving process configured to run your model.
Starts an HTTP server that provides access to TensorFlow Server through the SageMaker InvokeEndpoint
REST communication with TensorFlow Serving
[]We have complete control over the inference request by implementing the handler method in the entry point inference script. The Python service creates a context object. We convert the preprocessed image NumPy array to JSON and retrieve the REST URI from the context object to trigger a TFS invocation via REST.
# REST communication with TFS # Convert input image to json inst_json = json.dumps({‘instances’: instance.tolist()}) print(‘rest call’) # Use context object to retrieve the rest uri response = requests.post(context.rest_uri, data=inst_json)
gRPC communication with TensorFlow Serving
[]Alternatively, we can use gRPC for in-server communication with TFS via the handler method. We import the gRPC libraries, retrieve the gRPC port from the context object, and trigger a TFS invocation via gRPC:
import grpc from tensorflow.compat.v1 import make_tensor_protofrom tensorflow_serving.apis import predict_pb2from tensorflow_serving.apis import prediction_service_pb2_grpc request = predict_pb2.PredictRequest() request.model_spec.name = ‘model’ # specify the serving signature from the model request.model_spec.signature_name = ‘serving_default’ request.inputs[‘input_tensor’].CopyFrom(make_tensor_proto(instance)) options = [ (‘grpc.max_send_message_length’, MAX_GRPC_MESSAGE_LENGTH), (‘grpc.max_receive_message_length’, MAX_GRPC_MESSAGE_LENGTH) ] # retrieve the gRPC port from the context object channel = grpc.insecure_channel(f’0.0.0.0:{context.grpc_port}’, options=options) stub = prediction_service_pb2_grpc.PredictionServiceStub(channel) # make a call that immediately and without blocking returns a # gRPC future for the asynchronous-in-the-background gRPC. result_future = stub.Predict.future(request, 30) # 5 seconds # retrieve the output based on the model output types output_tensor_proto = result_future.result().outputs[‘predictions’] output_shape = [dim.size for dim in output_tensor_proto.tensor_shape.dim] # convert bytes to numpy array output_np = np.array(output_tensor_proto.float_val).reshape(output_shape) # create JSON response prediction_json = {‘predictions’: output_np.tolist()}
Prediction invocation comparison
[]We can invoke the deployed model with an input image to retrieve image classification or object detection outputs:
import boto3 input_image = open(‘image.jpg’, ‘rb’).read() runtime_client = boto3.client(‘runtime.sagemaker’) response = runtime_client.invoke_endpoint( EndpointName=endpoint_name, ContentType=’application/x-image’, Body=input_image) res = response[‘Body’].read().decode(‘ascii’) []We then trigger 100 invocations to generate latency statistics for comparison:
import time results = [] for i in (1,100): start = time.time() response = runtime_client.invoke_endpoint( EndpointName=endpoint_name, ContentType=’application/x-image’, Body=input_image) results.append((time.time() – start) * 1000) print(“nPredictions for TF2 serving : n”) print(‘nP95: ‘ + str(np.percentile(results, 95)) + ‘ msn’) print(‘P90: ‘ + str(np.percentile(results, 90)) + ‘ msn’) print(‘Average: ‘ + str(np.average(results)) + ‘ msn’) []The following table summarizes our results from the invocation tests. The results show a 75% improvement in latency with gRPC compared to REST calls to TFS for image classification, and 85% improvement for object detection models. We observe that the performance improvement depends on the size of the request payload and response payload from the model.
[]In this post, we demonstrated how to reduce model serving latency for TensorFlow computer vision models on SageMaker via in-server gRPC communication. We walked through a step-by-step process of in-server communication with TensorFlow Serving via REST and gRPC and compared the performance using two different models and payload sizes. For more information, see Maximize TensorFlow performance on Amazon SageMaker endpoints for real-time inference to understand the throughput and latency gains you can achieve from tuning endpoint configuration parameters such as the number of threads and workers.
[]SageMaker provides a powerful and configurable platform for hosting real-time computer vision inference in the cloud with low latency. In addition to using gRPC, we suggest other techniques to further reduce latency and improve throughput, such as model compilation, model server tuning, and hardware and software acceleration technologies. Amazon SageMaker Neo lets you compile and optimize ML models for various ML frameworks to a wide variety of target hardware. Select the most appropriate SageMaker compute instance for your specific use case, including g4dn featuring NVIDIA T4 GPUs, a CPU instance type coupled with Amazon Elastic Inference, or inf1 featuring AWS Inferentia.
About the Authors
[]Hasan Poonawala is a Machine Learning Specialist Solutions Architect at AWS, based in London, UK. Hasan helps customers design and deploy machine learning applications in production on AWS. He is passionate about the use of machine learning to solve business problems across various industries. In his spare time, Hasan loves to explore nature outdoors and spend time with friends and family.
[]
[]
[]Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.
How Vidmob is using generative AI to transform its creative data landscape
In this post, we illustrate how Vidmob, a creative data company, worked with the AWS Generative AI Innovation Center (GenAIIC) team to uncover meaningful insights at scale within creative data using Amazon Bedrock. Source
In this post, we illustrate how Vidmob, a creative data company, worked with the AWS Generative AI Innovation Center (GenAIIC) team to uncover meaningful insights at scale within creative data using Amazon Bedrock.
Evaluating prompts at scale with Prompt Management and Prompt Flows for Amazon Bedrock
In this post, we demonstrate how to implement an automated prompt evaluation system using Amazon Bedrock so you can streamline your prompt development process and improve the overall quality of your AI-generated content. Source
In this post, we demonstrate how to implement an automated prompt evaluation system using Amazon Bedrock so you can streamline your prompt development process and improve the overall quality of your AI-generated content.
Build powerful RAG pipelines with LlamaIndex and Amazon Bedrock
In this post, we show you how to use LlamaIndex with Amazon Bedrock to build robust and sophisticated RAG pipelines that unlock the full potential of LLMs for knowledge-intensive tasks. Source
In this post, we show you how to use LlamaIndex with Amazon Bedrock to build robust and sophisticated RAG pipelines that unlock the full potential of LLMs for knowledge-intensive tasks.