Comparison of Various Amazon Data Processing Services – Amazon Athena vs Amazon EMR vs Amazon Redshift vs Amazon Kinesis vs Amazon SageMaker vs Amazon Elasticsearch
Service | Cost | Performance | Scalability and elasticity |
Amazon Athena | Pay for the resources you consume. Priced per query, per TB of data scanned, and charges based on the amount of data scanned by the query. You can save significantly on per-query costs and get better performance by compressing, partitioning, and converting your data into columnar formats. This allows Athena to read only the columns it needs for the query. | You can improve the performance of your query by compressing, partitioning, and converting your data into columnar formats. This means Athena can scan less data from Amazon S3 when executing your query. Amazon Athena supports open source columnar data formats such as Apache Parquet and Apache ORC. | Athena is serverless, so there is no infrastructure to set up or manage, and it can scale automatically, as needed. |
Amazon Elasticsearch | Pay only for what you use. You are charged for Amazon ES instance hours, Amazon EBS storage (if you choose this option), and standard data transfer fees. | Performance depends on multiple factors including: instance type,workload,index,number of shards used,read replica configuration,storage configuration (instance or EBS) Indexes are made up of shards of data which can be distributed on different instances in multiple Availability Zones. The read replica of the shards is maintained by Amazon ES in a different Availability Zone if zone awareness is checked. A search engine makes heavy use of storage devices. Making disks faster will result in faster query and search performance. Amazon ES can use either the fast SSD instance storage for storing indexes or multiple EBS volumes. | You can add or remove instances, and modify Amazon EBS volumes to accommodate data growth. The Sizing Amazon ES Domains page of the developer guide provides general recommendations for calculating what you need. You can write code to monitor your domain using CloudWatch metrics and call the Amazon ES Service API to scale up and down based on thresholds you set. The service will execute the scaling without any downtime. |
Amazon EMR | You only pay for the hours the cluster is up. You can launch a persistent cluster that stays up indefinitely or a temporary cluster that terminates after the analysis is complete. Amazon EMR supports a variety of Amazon EC2 instance types and all Amazon EC2 pricing options. | Amazon EMR performance is driven by the type and number of EC2 instances you choose to run your analytics. Consider processing requirements, sufficient memory, storage, and processing power. For best performance, you should launch the cluster in the same region as your data and use the same region for all of your AWS resources that will be used with the cluster. For low latency workloads that need to run in close proximity to on-premises resources, consider Amazon EMR on AWS Outposts. Consider scaling back debugging when you’ve finished development and put your data processing application into full production to save on log costs and reduce the processing load on the cluster. | You can resize your cluster to add instances for peak workloads and remove instances to control costs when peak workloads subside. You can add core nodes that hold the Hadoop Distributed File System (HDFS) at any time to increase processing power and HDFS storage capacity (and throughput). You can also add and remove task nodes at any time which can process Hadoop jobs but do not maintain HDFS. You can decouple memory and compute from storage by using Amazon S3 on EMRFS along with or instead of local HDFS. This provides greater flexibility and cost-efficiency. |
Amazon Kinesis | Kinesis Data Streams pricing ************************* There are just two pricing components, an hourly charge per shard and a charge for each 1 million PUT transactions. The use of enhanced fan-out configuration or extended retention periods has additional costs. Kinesis Data Firehose pricing ************************* Pricing is based on the volume of data ingested, which is calculated as (the number of data records you send to the service) *( the size of each record rounded up to the nearest 5KB). If you configure your delivery stream to convert the incoming data into Apache Parquet or Apache ORC format before the data is delivered to destinations, format conversion charges apply based on the volume of the incoming data. Kinesis Data Analytics pricing ************************** You are charged an hourly rate based on the average number of Kinesis Processing Units (or KPUs) used to run your stream processing application. A single KPU is a unit of stream processing capacity comprised of 1 vCPU compute and 4 GB memory. For Java applications, you are charged a single additional KPU per application for application orchestration. Java applications are also charged for running application storage and durable application backups. Kinesis Video Streams pricing ************************** You pay only for the volume of data you ingest, store, and consume through the service. Optional capabilities incur additional charges. | Kinesis Data Streams performance ****************************** Choose throughput capacity in terms of shards. The enhanced fan-out option can improve performance by increasing the throughput available to each individual consumer. Kinesis Data Firehose performance ******************************* Specify a batch size or batch interval and data compression to control how quickly data is uploaded to destinations. Amazon Kinesis Data Firehose exposes several metrics through the console and through Amazon CloudWatch. These include metrics on the volume of data submitted, the volume of data uploaded to a destination, the time from source to destination, and the upload success rate. Kinesis Data Analytics performance ******************************* Amazon Kinesis Data Analytics elastically scales your application to accommodate the data throughput of your source stream and your query complexity for most scenarios. Kinesis Data Analytics provisions capacity in the form of Kinesis Processing Units (KPU). A single KPU provides you with memory and corresponding computing and networking. If your source stream’s throughput exceeds the throughput of a single in-application input stream, you can use the InputParallelism parameter to explicitly increase the number of in-application input streams that your application uses. | All of the Amazon Kinesis services are designed to handle any amount of streaming data and process data from hundreds of thousands of sources with very low latencies. Kinesis Data Streams ******************** The initial scale is based on the number of shards you select for the stream. You can increase or decrease the capacity of the stream at any time. Use API calls or development tools to automate scaling. Kinesis Data Firehose ******************* Streams automatically scale up and down based on the data rate you specify for the stream. Kinesis Data Analytics ******************* Set up your application for your future scaling needs by proactively increasing the number of input in-application streams from the default (one). Use multiple streams and Kinesis Data Analytics for SQL applications if your application has scaling needs beyond 100 MB/second. Use Kinesis Data Analytics for Java Applications if you want to use a single stream and application. Kinesis Video Streams Automatically provisions and elastically scales to millions of devices and scales down when the devices are not transmitting video. |
Amazon Redshift | No long-term commitments or upfront costs. Charges are based on the size and number of nodes of your cluster. No additional charge for backup storage up to 100% of your provisioned storage. Backup storage beyond the provisioned storage size, and backups stored after your cluster is terminated, are billed at standard Amazon S3 rates. No data transfer charge for communication between Amazon S3 and Amazon Redshift. | Amazon Redshift uses a variety of innovations to obtain very high performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. Factors that can affect query performance include: data characteristics,cluster configuration,database operations | You can easily change the number or type of nodes in your data warehouse as your performance or capacity needs change. You can use elastic resize to scale your cluster by changing the number of nodes. Or, you can use classic resize to scale the cluster by specifying a different node type. While resizing, Amazon Redshift places your existing cluster into read-only mode, provisions a new cluster of your chosen size, and then copies data from your old cluster to your new one in parallel. During this process, you pay only for the active Amazon Redshift cluster. You can continue running queries against your old cluster while the new one is being provisioned. After your data has been copied to your new cluster, Amazon Redshift automatically redirects queries to your new cluster and removes the old cluster. |
Amazon SageMaker | Pay only for what you use. The process of building, training, and deploying ML models is billed by the second, with no minimum. Pricing is broken down by on-demand ML instances, ML storage, and fees for data processing in hosting instances. For training your ML models, you have the choice of using Amazon EC2 Spot instances with Managed Spot Training. This option can help reduce the cost of training your machine learning models by up to 90%. Once a Managed Spot Training job completes, you can calculate the cost savings as the percentage difference between the duration for which the training job ran and the duration for which you were billed. | Amazon SageMaker is a fully managed service designed for high availability. There are no maintenance windows or scheduled downtimes. The Amazon SageMaker API runs in Amazon’s proven, high-availability data centers, with service stack replication configured across three facilities in each AWS region to provide fault tolerance in the event of a server failure or Availability Zone outage. You can one-click deploy your model onto auto-scaling Amazon Machine Learning (ML) instances across multiple availability zones for high redundancy. | Amazon SageMaker hosting automatically scales to the performance needed for your application using Application Auto Scaling. |