Cloud Streaming Analytics Platforms Comparision across AWS, Azure, GCP
In this blog post, we will compare the Cloud Streaming Analytics Platforms across AWS, Azure, GCP
What is Streaming data and analytics ?
Streaming data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). Streaming analytics is the process to ingest, analyze and act on Streaming data from streaming data sources in real time to quickly identify patterns and automate actions.
Streaming Analytics Workflow
Following is high level Streaming Analytics Workflow
- Data sources can be Mobile apps, application logs, click stream data, IOT sensors, Smart Devices
- Streaming Ingestion can be from multiple cloud vendor product clients, third party tools
- Stream Storage and Processing can be from multiple cloud vendor products
- Destination can be Data lakes, Data ware house , Databases
Cloud Streaming Analytic Platforms Comparison
Following chart shows various Cloud streaming analytic platforms available
Vendor | Stream Ingestion & Processing | Stream Processing & Analytics | Stream Destination |
AWS | Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon Managed Streaming for Apache Kafka (Amazon MSK) | EMR Spark Streaming AWS Lambda AWS Glue streaming Amazon Kinesis Data Analytics for SQL Amazon Kinesis Data Analytics for Apache Flink | Amazon S3 Amazon Redshift Amazon Elastic Search |
Azure | Azure Event Hubs Azure HDInsight (Apache Kafka) Azure IoT Hub | Azure Stream Analytics Azure HDInsight (Apache Kafka, Apache Spark Streaming) Azure Databricks Spark Streaming Azure Functions | Azure Synapse Azure Blobs Azure SQL |
GCP | DataStream Google Cloud Pub/Sub | Dataflow Dataproc Cloud Functions | BigQuery |
Following is high level overview of each Platform Service
- Amazon Kinesis Data Streams is a fully managed, serverless data streaming service that stores and ingests various streaming data in real time at any scale.
- Amazon Kinesis Data Firehose is an extract, transform, and load (ETL) service that reliably captures, transforms, and delivers streaming data to data lakes, data stores, and analytics services.
- Amazon MSK is a fully managed, secure, and highly available Apache Kafka service that makes it easy to ingest and process streaming data in real time
- Azure Event Hubs is a fully managed, real-time data ingestion service that’s simple, trusted, and scalable.
- Azure HDInsight is a cloud distribution of Hadoop components.
- Azure Stream Analytics is a fully managed, real-time analytics service designed to help you analyze and process fast moving streams of data that can be used to get insights, build reports or trigger alerts and actions.
- GCP Dataflow is Unified stream and batch data processing that’s serverless, fast, and cost-effective.
- GCP Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks.