Selecting the right database in Amazon Web Service(AWS)

Robert John
11 min readJan 18, 2021

--

Databases in AWS

Deciding on the database to use for our application or workload can be very tricky. Since I join AWS Community Builder, I spend at least 1 hour every day exploring AWS based on use cases. Amazon Web Service(AWS) provides several options for databases; we can be confused about the right one to choose. This article is a documentation of what I learned and the resources I used in understanding the various databases in AWS and how to decide when to use them. I hope it will be of value to you. I will like to have feedback on what you think I should add or remove or improve on as I continue exploring AWS and other cloud services.

There are a lot of criteria that could help us in selecting the right database in AWS. To make it easier, I summarize it into the following 4 criteria

  1. Type of Data
  2. Size of Data
  3. Structure or Shape of Data
  4. Activities that will be done on the Data

Now that we have an idea of the criteria we can use to select the right database, let us dive into each of these databases. All databases in AWS are known to have the following properties

  • fully managed by AWS
  • scalable; that is, increase and decrease based on demand
  • highly available; that is, the databases are guaranteed to be always up

Amazon Relational Database Service(RDS)

Relational Databases in Amazon

Amazon RDS is not a database itself but is used to set up, operate, and scale relational databases in the cloud. It enables us to provision Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. It is like an administrator for these databases. It automates failover, backups, restore, disaster recovery, access management, encryption, security, monitoring, and performance optimization. It has two major backup solutions which are automated backups and manual snapshots. It has a maximum of 5 replicas. Its replicas can be multi-availability zone replica, cross-region replicas, and read replicas. But the resources aren’t replicated across AWS Regions by default except you set it specifically.

When to use Amazon Relational Database Service(RDS)

  • If we need to use any of these six databases Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server.

Pricing

Amazon Relational Database Services(RDS) pricing depends on either we are using it On-Demand or Reserved Instances.

Official Resources

Other Resources

Amazon Aurora

Amazon Aurora has MySQL and PostgreSQL compatibility; it is five times faster than standard MySQL and three times faster than standard PostgreSQL database. Amazon Aurora is 90% cheaper than standard MySQL and PostgreSQL databases. It has a maximum size of 128 TB. Amazon Aurora defines a scaling policy of a maximum of 15 Aurora Replicas. Aurora Backup and Failover are automatic. Amazon Aurora supports cross-region replication. Aurora MySQL DB Cluster and PostgreSQL are created using the Amazon Relation Database Service console. Aurora Serverless gives Amazon Aurora the ability to automatically scale up, scale down, start up, and shut down(auto-scaling).

When to use Amazon Aurora

Aurora Serverless is best used when you are building an application that is not frequently used, building a new application, building an application with varying and unpredictable workloads.

Parallel Query for Amazon Aurora

Pricing

Amazon Aurora Pricing is based on either we select the MySQL edition or the PostgreSQL edition. Aurora Serverless is charged based on Aurora Capacity Unit(ACU)

Official Resources

Other Resources

Amazon Redshift

Amazon Redshift is columnar storage used for data warehousing. It is used to analyse and quickly get insight from large data by executing complex queries on them. These data are usually at rest and historical data. It contains a cluster of nodes. It could be in single-node mode or multi-node mode. There are two types of nodes in Amazon Redshift, namely leader node and compute node. The leader node stores SQL endpoints, metadata and coordinates parallel SQL processing. Compute nodes stores the data, and execute the queries. Amazon Redshift stores data on a single Availability Zone. Amazon Redshift spectrum is used to query Amazon Simple Storage Service(Amazon S3) directly. Amazon Redshift federated queries enable us to query and analyze live data across databases, data warehouses, and data lakes.

Amazon Redshift Architecture

When to use Amazon Redshift

  • for Online Analytical Processing
  • if we need to run queries across multiple data sources. For instance, we can copy data from different storages like Amazon EMR and Amazon S3 into Amazon Redshift.
  • Amazon Redshift is suitable for generating reports for business intelligence

Pricing

Amazon Redshiftp pricing — the basic price for Amazon Redshift starts from $0.25 per hour. There are several other features that can influence the price such as Amazon Redshift Spectrum pricing, Concurrency Scaling pricing, Redshift managed pricing, and Redshift ML pricing.

AWS Redshift Data lake Integration

Official Resources

Other Resources

DynamoDB

DynamoDB is a NoSQL database, key/value, and document database. That is, it support document and key/value structures. DynamoDB’s major components are tables, items, attributes, keys, and values. A table is a collection of items, and an item is a collection of attributes. Items are similar to rows, while attributes are similar to columns in a traditional database. A key is used to identify attributes, and value is the data itself. The Major API components in DynamoDB are control plane, data plane, DynamoDB streams, and transactions. On-Demand and Provisioned Mode are the read/write capacity modes in DynamoDB. Amazon DynamoDB provides us with the ability to specify our Provisioned capacity based on Read Capacity Units(RCU) and Write Capacity Units(WCU). Amazon DynamoDB creates partitions based on size, Read Capacity Units and Write Capacity Units. The criteria required for partitioning are size of 10GB, RCU of 3000, and WCU of 1000. Encrypt data at rest (inactive data), data that is not moving from one device to another or from one network to another. DynamoDB has a Point in time recovery feature, that is, we can restore your data to any point in time. Amazon DynamoDB Accelerator(DAX) enables us to manage write-through cache on DynamoDB. It reduces response time from milliseconds to microseconds. Amazon DynamoDB uses SSD storage and stores its data across 3 different availability zone.

Data in Amazon DynamoDB

When to use Amazon DynamoDB

  • for Online Transaction Processing(OTP).
  • to store real-time data from an IoT device.
  • to store activities and events on a web application such as clicks.
  • to store items in a Web application like user profile, user events used by advertising, gaming, retail, finance, and media companies.
  • for Data that requires high request rate(millions of requests per seconds).
  • it is best used in situations that require high consistency.

Pricing

Amazon DynamoDB pricing depends on on-demand capacity mode and provisioned capacity mode.

Hands-On

Creating Tables and Loading Data

Sample Code

Create a ToDo Web App Storing your data in Amazon DynamoDB

Official Resources

Other Resources

Amazon DocumentDB

This is a document database that supports MongoDB. It has the capability to easily store, query, and index JSON data. It has about 15 read replicas and scales vertically with very low impact. It has flexible schema and Ad hoc query capability. It is easy to index and can be used for operational and analytics workloads.

Use case of Amazon DocumentDB

Pricing

Amazon DocumentDB pricing is based on On-demand instances, database I/O, database storage and backup storage.

When to use Amazon DocumentDB

  • it is Amazon version of MongoDB, it is used when you need to run MongoDB at scale
  • best for Profile management, Content, and catalog management.

Official Resources

Other Resources

DynamoDB vs AWS DocumentDB vs MongoDB

Amazon Neptune

Amazon Neptune is a graph database, it works with highly connected datasets. It checks for relations or similarities in data. For instance, the similarity between the movies a user watches on Netflix. Its components are node(data entities), edges(relationships) and properties. Amazon Neptune support property graph like open-source Apache TinkerPop Germlin and Resource Description Framework(RDF) SPARQL. Amazon Neptune replicates data 6 times across 3 Availability Zones. Amazon Neptune Streams can be used to capture changes in a graph.

Knowledge graph use case diagram

Pricing

Amazon Neptune pricing is influenced by On-demand instances, database I/O, database storage, backup storage and data transfer.

When to use Amazon Neptune

  • Amazon Neptune is best used when we have relationships in the data.
  • for recommendation engines, fraud detection, and drug discovery.
  • for knowledge base applications such as Wikidata.
  • for identity graphs to show unified view of customers and prospects based on their interactions with a product or a website.
  • for social Networking applications to store user profiles and interactions.

Official Resources

Other Resources

Amazon ElastiCache

Amazon ElastiCache use to manage in-memory caching. Caching is storing data in a temporary storage area. This data is stored on the RAM which is volatile, that is the data can get lost easily and can be accessed fast. It stores frequently accessed data to improve performance, this helps to avoid application latency and throughput. It caches data from the database which is different from CloudFront(Content Delivery Network). Amazon ElastiCache stores important data in memory. Amazon Cloudfront stores static files, for example, HTML, audio, video, media files required by a web app. Amazon ElastiCache accesses only resources in the same VPC.

source

Amazon ElastiCache has two engines

  1. Amazon ElastiCache for Redis
  2. Amazon ElastiCache for Memcached.

Pricing

Amazon ElastiCache pricing is based on the node types, backup storage, and data transfer.

When to use Amazon ElastiCache

  • it is best used when you need microseconds latency, key-based queries, and specialized data structures.
  • for situations like leader boards and real-time caching
  • if the data is on every page load or every request.

Official Resources

Other Resources

Amazon Timestream

Amazon Timestream is a serverless time-series database for IoT and operational applications. Time series data are recorded over a period of time such as stock data and temperatures of a device. Amazon Timestream can be used to store and analyze trillions of events per day up to 1,000 times faster and at as little as 1/10th the cost of relational databases. One major advantage of Amazon Timestream database is its capability to move historical data to low-cost storage(magnetic store) but retain recent data(hot data) in-memory(SSD store). Queries can be run on both historical data and recent data. In addition, Amazon Timestream has a built-in time-series analytics function such as smoothing, approximation, and interpolation which helps in detecting patterns in data. Major concepts on Amazon Timestream are record, dimension, measure, timestamp, table, and Database. Records cannot be deleted or updated.

Pricing

Amazon Timestream pricing is based on writes, SSD store, magnetic store, data transfer and queries.

When to use Amazon Timestream

  • for time series data from IoT devices
  • collecting and analysing operational metrics
  • analytical application

Sample code

Getting started with Amazon Timestream with Python

Official Resources

Other Resources

Amazon Quantum Ledger Database(QLDB)

Amazon QLDB is a fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log owned by a central trusted authority. It is used to track data changes in applications. It can be used for storing audit logs. It uses a SQL like language called PartiQL. It is immutable, transparent, cryptographically verifiable, and serverless.

Pricing

Amazon Quantum Ledger Database(QLDB) pricing is based on the data transfer, data storage and I/O.

When to use Amazon QLDB

  • best suited for economic and financial record
  • for application data
  • used in finance for tracking credit and debit transactions
  • for HR systems to track employee payroll, bonus, benefits and other details
  • for manufacturing to track product manufacturing history

Official Resources

Other Resources

Other resources on selecting the right database

Whoa, so many databases and terminologies. I am sure you need a break. I hope you understood the different databases in AWS, when to use them, and links resources that will give you a deep dive.

Originally published at https://trojrobert.github.io on January 18, 2021.

--

--

Robert John
Robert John

Written by Robert John

I develop machine learning models and deploy them to production using cloud services.

No responses yet