AWS Presto

  • Post author:
  • Post category:Amazon EMR

Overview of AWS Presto on Amazon EMR

What is Amazon EMR ?

Amazon EMR is a Cloud big data Platform for large scale data processing , interactive SQL queries, ML (Machine Learning) applications using widely used Open-Source frameworks – Apache Spark, Presto, Hadoop, Hive, Trino, HBase & Flink

To find out more information on Amazon EMR , check out our post – http://www.cloudinfonow.com/amazon-emr-eks-serverless/

EMR Presto

Presto is a fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources. Presto originated in Facebook and later open-sourced. Presto is a in-memory SQL engine for ad hoc analysis across multiple data sources.

Running Presto on Amazon EMR is a popular choice because Amazon EMR provides the latest, stable, open-source community Presto innovations and Amazon EMR platform-level optimizations for Presto workloads.

Presto is highly SQL Compatible and supports standard ANSI SQL, including complex queries, aggregations, joins and window functions. Presto is not a database; it is a query engine. Presto Can integrate with a wide variety of data stores across data formats, such as ORC, Parquet, Avro and CSV files stored in HDFS or in object stores such as Amazon S3 and NoSQL data stores.

Following are some of Key Features of Presto

  1. Presto is a query-only engine, it separates compute and storage and relies on different connectors to connect to various data sources.
  2. Presto has efficient cost based optimizer to improve query performance
  3. Presto Supports majority of ANSI SQL features like joins, subqueries, scalar functions
  4. Presto Integrates with majority of BI tools – Tableau, Alteryx, Business Objects, Looker , through JDBC/ODBC Drivers
  5. Presto can be accessed by SQL tools like SQL Workbench, DBeaver, AirPal
  6. Presto can utilize the Hive Metastore as Data Catalog
  7. Presto Supports Supports LDAP based user Authentication.

Following are some of the the limitations of Presto

  1. Presto does not support INSERT OVERWRITE statements. Make sure delete the table before INSERT INTO.
  2. Presto does not support TRUNCATE from table. use DELETE statement.
  3. Presto does not support views defined in Hive. Define the same view in presto and use
  4. Presto performance will degrade as number of joins increase.
  5. Presto should not be used for batch analytics. It is recommended for Interactive query analytics
  6. Presto is not recommended to use in data warehouses with dimensional modeling schemas
  7. Presto frequently throws JVM errors for inefficient queries and need heavy JVM tuning to resolve the issues.

EMR Presto Architecture

Following is the high level architecture of EMR Presto.

EMR Presto Architecture

Amazon EMR now includes EMR runtime for Presto, a performance-optimized runtime environment for Presto that includes custom performance improvements. With EMR runtime for Presto, your queries run up to 2.6 times faster. EMR runtime for Presto is 100% API compatible with open-source Presto.

EMR Presto Best practices

Following are some of the Best practices , performance tuning tips for PrestoDB on Amazon EMR

  1. Select appropriate EMR node types. EMR consists of master, core and task nodes. Since Presto doesn’t utilize Hadoop framework, minimize the core nodes which has HDFS storage. Ideal configuration is to have cluster with one Master node, 3 to 5 Core nodes and remaining task nodes depending on usage , workload.
  2. Use Spot Instances where possible for cost optimization. Recommended for non critical workloads which can withstand node failures
  3. Select the appropriate EC2 instance types. Presto is in-memory SQL engine which requires memory intensive node types. Recommended to utilize R series – R4, R5, R6. Also, enable Spill to DISK feature on presto to avoid query failures due to memory full issues.
  4. EMR as of Dec 2021 still doesn’t support Presto Scaling. It is recommended to use custom Presto metrics and enable custom scaling.
  5. Fine tune Presto server configuration properties as per you workloads. Some of the parameters which needs tuning – query.max-memory, query.max-memory-per-node, query.max-total-memory-per-node, task.concurrency

EMR Presto Pricing

Presto is one of the framework on Amazon EMR service. Hence, EMR Presto Pricing depends on usage of EMR clusters. More details of EMR pricing are available on our post – http://www.cloudinfonow.com/aws-emr-pricing/

This Post Has One Comment

Comments are closed.