Selecting the right database in Amazon Web Service(AWS)
Deciding on the database to use for our application or workload can be very tricky. Since I join AWS Community Builder, I spend at least 1 hour every day exploring AWS based on use cases. Amazon Web Service(AWS) provides several options for databases; we can be confused about the right one to choose. This article is a documentation of what I learned and the resources I used in understanding the various databases in AWS and how to decide when to use them. I hope it will be of value to you. I will like to have feedback on what you think I should add or remove or improve on as I continue exploring AWS and other cloud services.
There are a lot of criteria that could help us in selecting the right database in AWS. To make it easier, I summarize it into the following 4 criteria
- Type of Data
- Size of Data
- Structure or Shape of Data
- Activities that will be done on the Data
Now that we have an idea of the criteria we can use to select the right database, let us dive into each of these databases. All databases in AWS are known to have the following properties
- fully managed by AWS
- scalable; that is, increase and decrease based on demand
- highly available; that is, the databases are guaranteed to be always up
Amazon Relational Database Service(RDS)
Amazon RDS is not a database itself but is used to set up, operate, and scale relational databases in the cloud. It enables us to provision Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. It is like an administrator for these databases. It automates failover, backups, restore, disaster recovery, access management, encryption, security, monitoring, and performance optimization. It has two major backup solutions which are automated backups and manual snapshots. It has a maximum of 5 replicas. Its replicas can be multi-availability zone replica, cross-region replicas, and read replicas. But the resources aren’t replicated across AWS Regions by default except you set it specifically.
When to use Amazon Relational Database Service(RDS)
- If we need to use any of these six databases Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server.
Pricing
Amazon Relational Database Services(RDS) pricing depends on either we are using it On-Demand or Reserved Instances.
Official Resources
Other Resources
- AWS Certified Solutions Architect — Associate 2020 (6:23:16–7:00:11)
- Amazon Relational Database Service (Amazon RDS) by Vlad Vlasceanu
- Amazon Relational Database Service (RDS) (DAT302) by Brian Welcker
Amazon Aurora
Amazon Aurora has MySQL and PostgreSQL compatibility; it is five times faster than standard MySQL and three times faster than standard PostgreSQL database. Amazon Aurora is 90% cheaper than standard MySQL and PostgreSQL databases. It has a maximum size of 128 TB. Amazon Aurora defines a scaling policy of a maximum of 15 Aurora Replicas. Aurora Backup and Failover are automatic. Amazon Aurora supports cross-region replication. Aurora MySQL DB Cluster and PostgreSQL are created using the Amazon Relation Database Service console. Aurora Serverless gives Amazon Aurora the ability to automatically scale up, scale down, start up, and shut down(auto-scaling).
When to use Amazon Aurora
Aurora Serverless is best used when you are building an application that is not frequently used, building a new application, building an application with varying and unpredictable workloads.
Pricing
Amazon Aurora Pricing is based on either we select the MySQL edition or the PostgreSQL edition. Aurora Serverless is charged based on Aurora Capacity Unit(ACU)
Official Resources
Other Resources
- Amazon Aurora Introduction
- AWS re:Invent 2019: [REPEAT] Amazon Aurora storage demystified: How it all works (DAT309-R) by Murali Brahmadesa and Tobias Ternström
- Amazon Aurora Global Database Deep Dive by Aditya Samant
- AWS Certified Solutions Architect — Associate 2020 (7:02:14–7:06:56)
- Amazon Aurora ascendant: How we designed a cloud-native relational database
Amazon Redshift
Amazon Redshift is columnar storage used for data warehousing. It is used to analyse and quickly get insight from large data by executing complex queries on them. These data are usually at rest and historical data. It contains a cluster of nodes. It could be in single-node mode or multi-node mode. There are two types of nodes in Amazon Redshift, namely leader node and compute node. The leader node stores SQL endpoints, metadata and coordinates parallel SQL processing. Compute nodes stores the data, and execute the queries. Amazon Redshift stores data on a single Availability Zone. Amazon Redshift spectrum is used to query Amazon Simple Storage Service(Amazon S3) directly. Amazon Redshift federated queries enable us to query and analyze live data across databases, data warehouses, and data lakes.
When to use Amazon Redshift
- for Online Analytical Processing
- if we need to run queries across multiple data sources. For instance, we can copy data from different storages like Amazon EMR and Amazon S3 into Amazon Redshift.
- Amazon Redshift is suitable for generating reports for business intelligence
Pricing
Amazon Redshiftp pricing — the basic price for Amazon Redshift starts from $0.25 per hour. There are several other features that can influence the price such as Amazon Redshift Spectrum pricing, Concurrency Scaling pricing, Redshift managed pricing, and Redshift ML pricing.
Official Resources
Other Resources
- AWS re:Invent 2017: Best Practices for Data Warehousing with Amazon Redshift & Redsh (ABD304) by Tony Gibbs
- Getting Started with Amazon Redshift — AWS Online Tech Talks by Greg Khairallah and Harshida Patel
- Amazon Redshift Tutorial | Amazon Redshift Architecture | AWS Tutorial For Beginners | Simplilearn
- Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Training | Edureka
- AWS Certified Solutions Architect — Associate 2020 (7:07:58)
- What is Amazon Redshift? By Kevin Goldberg
DynamoDB
DynamoDB is a NoSQL database, key/value, and document database. That is, it support document and key/value structures. DynamoDB’s major components are tables, items, attributes, keys, and values. A table is a collection of items, and an item is a collection of attributes. Items are similar to rows, while attributes are similar to columns in a traditional database. A key is used to identify attributes, and value is the data itself. The Major API components in DynamoDB are control plane, data plane, DynamoDB streams, and transactions. On-Demand and Provisioned Mode are the read/write capacity modes in DynamoDB. Amazon DynamoDB provides us with the ability to specify our Provisioned capacity based on Read Capacity Units(RCU) and Write Capacity Units(WCU). Amazon DynamoDB creates partitions based on size, Read Capacity Units and Write Capacity Units. The criteria required for partitioning are size of 10GB, RCU of 3000, and WCU of 1000. Encrypt data at rest (inactive data), data that is not moving from one device to another or from one network to another. DynamoDB has a Point in time recovery feature, that is, we can restore your data to any point in time. Amazon DynamoDB Accelerator(DAX) enables us to manage write-through cache on DynamoDB. It reduces response time from milliseconds to microseconds. Amazon DynamoDB uses SSD storage and stores its data across 3 different availability zone.
When to use Amazon DynamoDB
- for Online Transaction Processing(OTP).
- to store real-time data from an IoT device.
- to store activities and events on a web application such as clicks.
- to store items in a Web application like user profile, user events used by advertising, gaming, retail, finance, and media companies.
- for Data that requires high request rate(millions of requests per seconds).
- it is best used in situations that require high consistency.
Pricing
Amazon DynamoDB pricing depends on on-demand capacity mode and provisioned capacity mode.
Hands-On
Creating Tables and Loading Data
Sample Code
Create a ToDo Web App Storing your data in Amazon DynamoDB
Official Resources
Other Resources
- AWS DynamoDB Tutorial | AWS Services | AWS Tutorial For Beginners | AWS Training Video | Simplilearn
- AWS Certified Developer — Associate 2020 (5:05:21)
- AWS re:Invent 2018: Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database by Jaso Sorenson
- AWS re:Invent 2018: Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB (DAT401) by Rick Houlihan
- Building a Mars Rover Application with DynamoDB
Amazon DocumentDB
This is a document database that supports MongoDB. It has the capability to easily store, query, and index JSON data. It has about 15 read replicas and scales vertically with very low impact. It has flexible schema and Ad hoc query capability. It is easy to index and can be used for operational and analytics workloads.
Pricing
Amazon DocumentDB pricing is based on On-demand instances, database I/O, database storage and backup storage.
When to use Amazon DocumentDB
- it is Amazon version of MongoDB, it is used when you need to run MongoDB at scale
- best for Profile management, Content, and catalog management.
Official Resources
Other Resources
- AWS re:Invent 2019: Amazon DocumentDB deep dive (DAT326) by Joseph Idziorek and Antra Grover
- Building with Amazon DocumentDB (with MongoDB compatibility) — AWS Virtual Workshop by Meet Bhagdev
DynamoDB vs AWS DocumentDB vs MongoDB
- MongoDB vs. DocumentDB: Which Is Right for You?Difference between
- AWS DynamoDB vs AWS DocumentDB vs MongoDB?
Amazon Neptune
Amazon Neptune is a graph database, it works with highly connected datasets. It checks for relations or similarities in data. For instance, the similarity between the movies a user watches on Netflix. Its components are node(data entities), edges(relationships) and properties. Amazon Neptune support property graph like open-source Apache TinkerPop Germlin and Resource Description Framework(RDF) SPARQL. Amazon Neptune replicates data 6 times across 3 Availability Zones. Amazon Neptune Streams can be used to capture changes in a graph.
Pricing
Amazon Neptune pricing is influenced by On-demand instances, database I/O, database storage, backup storage and data transfer.
When to use Amazon Neptune
- Amazon Neptune is best used when we have relationships in the data.
- for recommendation engines, fraud detection, and drug discovery.
- for knowledge base applications such as Wikidata.
- for identity graphs to show unified view of customers and prospects based on their interactions with a product or a website.
- for social Networking applications to store user profiles and interactions.
Official Resources
Other Resources
- AWS re:Invent 2019: Deep dive on Amazon Neptune (DAT361) by Bradley Bebee, Karthik Bharathy, and Ora Lassila
- Nike: A Social Graph at Scale with Amazon Neptune
- Homesite: Event-Driven Data Analytics Platform Using Amazon Neptune
- AWS on Air 2020: AWS What’s Next ft. Amazon Neptune ML
Amazon ElastiCache
Amazon ElastiCache use to manage in-memory caching. Caching is storing data in a temporary storage area. This data is stored on the RAM which is volatile, that is the data can get lost easily and can be accessed fast. It stores frequently accessed data to improve performance, this helps to avoid application latency and throughput. It caches data from the database which is different from CloudFront(Content Delivery Network). Amazon ElastiCache stores important data in memory. Amazon Cloudfront stores static files, for example, HTML, audio, video, media files required by a web app. Amazon ElastiCache accesses only resources in the same VPC.
Amazon ElastiCache has two engines
- Amazon ElastiCache for Redis
- Amazon ElastiCache for Memcached.
Pricing
Amazon ElastiCache pricing is based on the node types, backup storage, and data transfer.
When to use Amazon ElastiCache
- it is best used when you need microseconds latency, key-based queries, and specialized data structures.
- for situations like leader boards and real-time caching
- if the data is on every page load or every request.
Official Resources
- Amazon ElastiCache
- Amazon ElastiCache for Redis Documentation
- Amazon ElastiCache for Memcached Documentation
- Amazon ElastiCacke Blog
Other Resources
- AWS re:Invent 2019: Supercharge your real-time apps with Amazon ElastiCache (DAT208) by Pratibha Suryadevara
- AWS re:Invent 2018: ElastiCache Deep Dive: Design Patterns for In-Memory Data Stores (DAT302-R1) byMichael Labib
- AWS Certified Solutions Architect — Associate 2020(8:38:40)
Amazon Timestream
Amazon Timestream is a serverless time-series database for IoT and operational applications. Time series data are recorded over a period of time such as stock data and temperatures of a device. Amazon Timestream can be used to store and analyze trillions of events per day up to 1,000 times faster and at as little as 1/10th the cost of relational databases. One major advantage of Amazon Timestream database is its capability to move historical data to low-cost storage(magnetic store) but retain recent data(hot data) in-memory(SSD store). Queries can be run on both historical data and recent data. In addition, Amazon Timestream has a built-in time-series analytics function such as smoothing, approximation, and interpolation which helps in detecting patterns in data. Major concepts on Amazon Timestream are record, dimension, measure, timestamp, table, and Database. Records cannot be deleted or updated.
Pricing
Amazon Timestream pricing is based on writes, SSD store, magnetic store, data transfer and queries.
When to use Amazon Timestream
- for time series data from IoT devices
- collecting and analysing operational metrics
- analytical application
Sample code
Getting started with Amazon Timestream with Python
Official Resources
Other Resources
- Getting Started with Amazon Timestream by Tony Gibbs
- Deep Dive on Amazon Timestream by Tony Gibbs
Amazon Quantum Ledger Database(QLDB)
Amazon QLDB is a fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log owned by a central trusted authority. It is used to track data changes in applications. It can be used for storing audit logs. It uses a SQL like language called PartiQL. It is immutable, transparent, cryptographically verifiable, and serverless.
Pricing
Amazon Quantum Ledger Database(QLDB) pricing is based on the data transfer, data storage and I/O.
When to use Amazon QLDB
- best suited for economic and financial record
- for application data
- used in finance for tracking credit and debit transactions
- for HR systems to track employee payroll, bonus, benefits and other details
- for manufacturing to track product manufacturing history
Official Resources
- Amazon Quantum Ledger Database(QLDB)
- Amazon Quantum Ledger Database Documentation
- Amazon Quantum Ledger Database Blog
Other Resources
- AWS re:Invent 2019: Amazon QLDB: An engineer’s deep dive on why this is a game changer (DAT380) by Andrew Certain
- Building System of Record Applications with Amazon QLDB — AWS Online Tech Talks by Philip Simko and Michael Labib
Other resources on selecting the right database
- AWS re:Invent 2017: [REPEAT] Which Database to Use When? (DAT310-R) by Tony Petrossian and Ian Meyers
- Selecting the Right Database for Your Application by Joseph Idziorek
- Choosing The Right Database by Randall Hunt
Whoa, so many databases and terminologies. I am sure you need a break. I hope you understood the different databases in AWS, when to use them, and links resources that will give you a deep dive.
Originally published at https://trojrobert.github.io on January 18, 2021.