Automating machine learning lifecycle with AWS
11 min to read
Machine Learning and data science life cycle involved several phases. Each phase requires complex tasks executed by different teams, as explained by Microsoft in this article. To solve the complexity of these tasks, cloud providers like Amazon, Microsoft, and Google services automate these tasks that speed up end to end the machine learning lifecycle. This article explains Amazon Web Services (AWS) cloud services used in different tasks in a machine learning life cycle. To better understand each service, I will write a brief description, a use case, and a link to the documentation. In this article, machine learning lifecycle can be replaced with data science lifecycle.
Streaming data are acquired continuously in small sizes. Streaming data are real-time time data. Examples are data from social media, purchase activities on eCommerce platforms, and data from IoT devices. Amazon services used for streaming data is Amazon Kinesis.
Amazon Kinesis is used for capturing, processing, and analysing real-time streaming data. Amazon Kinesis can be divided into
Amazon Kinesis Video Streams — for capturing, processing, and storing video data for analytics and machine learning. This can be used for capturing video data during a video consultation on a web browser by a doctor, as explained in this article.
Amazon Kinesis Data Firehose — is used for providing real-time streaming data to Amazon S3, endpoints, and other destinations. For example, for automatically detecting sport highlights, you can refer to this article.
Batch Data is historical data, in other words, data stored over a period of time. An example is customer transaction data stored over 1 year. Since batch data focuses on storing data. Next, we discuss data lake and databases in AWS that are used to store data.
Data Lake is used to store large amounts of data in its native format. Amazon S3 is used as a data lake.
Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. Amazon S3 is a data lake. It is similar to the hard drive on your computer with file folders but on the cloud. This is an example.
I wrote an article on selecting the right database in AWS.
Data processing involves converting raw data to a format that can be used for machine learning and other processes. Amazon EMR and Amazon MSK are well-known services for processing data.
According to Amazon, Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark. Amazon EMR can be Amazon EMR on EC2, Amazon EMR on EKS, and Amazon serverless. It processes and analyzes vast amounts of data. In lane man’s words, it is used for processing big data in parallel. For example creating training data for machine learning.
Amazon MSK enables you to build and run applications that use Apache Kafka to process streaming data. Essential components of Amazon MSK are broker nodes, zookeeper nodes, producers, consumers, topic creators, cluster operations. This can be used by a company that just became a unicorn and needs to process unicorn requests.
Data Cleaning and Wrangling
Data wrangling is the process of cleaning messy and complex data to a useable format.
SageMaker Data Wrangler is a feature of SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze and export data. This reduces the time to clean, aggregate, and prepare data for machine learning. This article explains how car charging stations data was cleaned and aggregated with sagemaker data wrangler.
Data labelling involves adding informative labels for identifying raw data.
Amazon Sagemaker Data Labeling can be divided into SageMaker Ground Truth and SageMaker Ground Truth Plus.
Amazon SageMaker Ground Truth Plus is a turnkey service that allows you to build high-quality training datasets at scale without using your own resources. Example of creating image data with sagemaker ground truth plus.
Amazon SageMaker Ground Truth
Amazon SageMaker Ground Truth provides the flexibility to build and manage your data labeling workflows and manage your own data labeling workforce. An example is text labeling for aspect-based sentiment analysis.
Data visualization is the graphical representation of data. If a picture is worth a thousand words, data visualization is worth thousands of data points. There are a lot of visualization tools like Tableau and Power BI. The good thing about Amazon QuickSight for visualization is the easy integration with other Amazon services.
Amazon QuickSight is a cloud-scale business intelligence (BI) service that you can use to deliver easy-to-understand insights to the people who you work with, wherever they are.
Feature Engineering is the process of converting raw data to features use to train machine learning models.
Amazon SageMaker Feature Store is used for creating, storing, and sharing machine learning (ML) features. Feature stores can be online or offline. This blog explains how GoDaddy uses feature store in their machine learning life cycle.
Amazon SageMaker Notebook SageMaker is a jupyter notebook connected to Amazon EMR for interactively exploring, visualizing, and preparing petabyte-scale data for machine learning (ML). It supports sharing your notebook with colleagues for collaboration through the UI.
Model Training in Data science requires providing compute instances for machine learning.
Amazon EC2 provides scalable computing capacity. Amazon EC2 eliminates your need to invest in hardware up front, so you can develop and deploy applications faster. The compute is scalable and resizable according to your workload.
AWS Batch is used for planning, scheduling and executing jobs on AWS compute services like AWS EC2, AWS Fargate and spot instance. Its provisions compute resources based on the job submitted.
SageMaker Training Compiler is a compute for more efficiently training deep learning(DL) models.
It is the process of selecting the best configuration and model for a machine learning task.
SageMaker Auto Tuning automates the process of running several training jobs for selecting the best performing model.
Amazon SageMaker Autopilot automatically builds, trains, and tunes the best machine learning models based on your data while allowing you to maintain full control and visibility. An example is the use of Amazon SageMaker Autopilot to tackle regression and classification tasks on large datasets up to 100 GB.
Amazon SageMaker Experiment is a capability of Amazon SageMaker that lets you organize, track, compare, and evaluate your machine learning experiments.
Amazon SageMaker ML Lineage Tracking creates and stores information about the steps of a machine learning (ML) workflow from data preparation to model deployment. An example of how model lineage is usaged.
Amazon SageMaker Debugger profiles and debugs training jobs to help resolve such problems such as system bottlenecks, overfitting, saturated activation functions, and vanishing gradients, which can compromise model performance and improve your ML model’s compute resource utilization and performance. Building churn prediction with sagemaker debugger.
Amazon SageMaker Model Monitor continuously monitors the quality of Amazon SageMaker machine learning models in production. An example is Detect NLP data drift using custom Amazon SageMaker Model Monitor.
Amazon SageMaker Clarify provides machine learning developers with greater visibility into their training data and models so they can identify and limit bias and explain predictions.
Feature of sagemaker model registry
- Catalog models for production.
- Manage model versions.
- Associate metadata, such as training metrics, with a model.
- Manage the approval status of a model.
- Deploy models to production.
- Automate model deployment with CI/CD.
Amazon SageMaker Serverless Inference is a purpose-built inference option that makes it easy for you to deploy and scale ML models. Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts.
Amazon ECR is an AWS managed container image registry service that is secure, scalable, and reliable. For example How ReliaQuest uses Amazon SageMaker to accelerate its AI innovation by 35x.
Amazon Elastic Kubernetes Service is a managed service that you can use to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane or nodes. Example is Evolution of Cresta’s machine learning architecture: Migration to AWS and PyTorch.
SageMaker Project teams of data scientists and developers can work on machine learning business problems by creating a SageMaker project with a SageMaker-provided MLOps template that automates the model building and deployment pipelines using continuous integrations and continuous delivery (CI/CD). Build Custom SageMaker Project Templates — Best Practices.
SageMaker Neo helps data scientists and machine learning engineers to train models once and run them anywhere. It is a deep learning performance optimization for multiple frameworks and hardware. How Deloitte is Improving Animal Welfare with AI at the Edge Using AWS Panorama and AWS Neo.
Amazon Step Functions is used for creating and managing workflow. It is an orchestrator for building applications. It is a low-code, visual workflow service developers use to build distributed applications, automate IT and business processes, and build data and machine learning pipelines using AWS services. An example is orchestrating forecasting pipeline using AWS Step Functions.
Amazon CodeCommit is a source control service for private git repositories. It is used to privately store and manage assets (such as documents, source code, and binary files) in the cloud. Using CodeCommit for a Continuous Data Journey Towards Developmental Agility and Faster Data Delivery.
Amazon CodeBuild is a fully managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy. Codebuild was used in this blog to optimize budget and time by submitting Amazon Polly voice synthesis tasks in bulk.
Amazon CodePipeline is a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable application and infrastructure updates. A good use case is using code pipeline to improve your data science workflow with a multi-branch training MLOps pipeline using AWS.
CodeDeploy automates application deployments to Amazon EC2 instances, on-premises instances, serverless Lambda functions, or Amazon ECS services. It maximizes application availability.
Amazon CodeGuru automates code reviews and application profiling. It helps to improve code quality and reduce application performance issues. It helps to identify and estimate expensive lines of code. It uses program analysis and machine learning to improve code.
CodeArtifact is a package manager. Features of CodeArtifact
- securely store packages
- sharing packages during application development
- ingest from third party repositories making it easy for organizations to securely store and share software packages used for application development.Use case using codeartifact for developing serverless application.
This article discusses all Amazon Web Services used in different data science life cycle stages. We gave a brief description of each service.
Originally published at https://trojrobert.github.io on January 30, 2022.