Trino is an open supply distributed SQL question engine designed for interactive analytic workloads. On AWS, you may run Trino on Amazon EMR, the place you will have the pliability to run your most popular model of open supply Trino on Amazon Elastic Compute Cloud (Amazon EC2) cases that you simply handle, or on Amazon Athena for a serverless expertise. If you use Trino on Amazon EMR or Athena, you get the most recent open supply neighborhood improvements together with proprietary, AWS developed optimizations.
Ranging from Amazon EMR 6.8.0 and Athena engine model 2, AWS has been creating question plan and engine conduct optimizations that enhance question efficiency on Trino. On this put up, we examine Amazon EMR 6.15.0 with open supply Trino 426 and present that TPC-DS queries ran as much as 2.7 instances sooner on Amazon EMR 6.15.0 Trino 426 in comparison with open supply Trino 426. Later, we clarify a number of of the AWS-developed efficiency optimizations that contribute to those outcomes.
Benchmark setup
In our testing, we used the three TB dataset saved in Amazon S3 in compressed Parquet format and metadata for databases and tables is saved within the AWS Glue Knowledge Catalog. This benchmark makes use of unmodified TPC-DS information schema and desk relationships. Reality tables are partitioned on the date column and contained 200-2100 partitions. Desk and column statistics weren’t current for any of the tables. We used TPC-DS queries from the open supply Trino Github repository with out modification. Benchmark queries have been run sequentially on two completely different Amazon EMR 6.15.0 clusters: one with Amazon EMR Trino 426 and the opposite with open supply Trino 426. Each clusters used 1 r5.4xlarge coordinator and 20 r5.4xlarge employee cases.
Outcomes noticed
Our benchmarks present constantly higher efficiency with Trino on Amazon EMR 6.15.0 in comparison with open supply Trino. The entire question runtime of Trino on Amazon EMR was 2.7 instances sooner in comparison with open supply. The next graph exhibits efficiency enhancements measured by the whole question runtime (in seconds) for the benchmark queries.
Most of the TPC-DS queries demonstrated efficiency positive factors over 5 instances sooner in comparison with open supply Trino. Some queries confirmed even higher efficiency, like question 72 which improved by 160 instances. The next graph exhibits the highest 10 TPC-DS queries with the biggest enchancment in runtime. For succinct illustration and to keep away from skewness of efficiency enhancements within the graph, we’ve excluded q72.
Efficiency enhancements
Now that we perceive the efficiency positive factors with Trino on Amazon EMR, let’s delve deeper into a few of the key improvements developed by AWS engineering that contribute to those enhancements.
Selecting a greater be a part of order and be a part of kind is important to higher question efficiency as a result of it could possibly have an effect on how a lot information is learn from a specific desk, how a lot information is transferred to the intermediate phases by means of the community, and the way a lot reminiscence is required to construct up a hash desk to facilitate a be a part of. Be a part of order and be a part of algorithm choices are sometimes a perform carried out by cost-based optimizers, which makes use of statistics to enhance question plans by deciding how tables and subqueries are joined.
Nevertheless, desk statistics are sometimes not accessible, outdated, or too costly to gather on massive tables. When statistics aren’t accessible, Amazon EMR and Athena use S3 file metadata to optimize question plans. S3 file metadata is used to deduce small subqueries and tables within the question whereas figuring out the be a part of order or be a part of kind. For instance, take into account the next question:
The syntactical be a part of order is store_sales
joins store_returns
joins call_center
. With the Amazon EMR be a part of kind and order choice optimization guidelines, optimum be a part of order is set even when these tables don’t have statistics. For the previous question if call_center
is taken into account a small desk after estimating the approximate measurement by means of S3 file metadata, EMR’s be a part of optimization guidelines will be a part of store_sales
with call_center
first and convert the be a part of to a broadcast be a part of, speeding-up the question and lowering reminiscence consumption. Be a part of reordering minimizes the intermediate consequence measurement, which helps to additional scale back the general question runtime.
With Amazon EMR 6.10.0 and later, S3 file metadata-based be a part of optimizations are turned on by default. In case you are utilizing Amazon EMR 6.8.0 or 6.9.0, you may activate these optimizations by setting the session properties from Trino purchasers or including the next properties to the trino-config classification when creating your cluster. Confer with Configure functions for particulars on find out how to override the default configurations for an software.
Configuration for Be a part of kind choice:
Configuration for Be a part of reorder:
Conclusion
With Amazon EMR 6.8.0 and later, you may run queries on Trino considerably sooner than open supply Trino. As proven on this weblog put up, our TPC-DS benchmark confirmed a 2.7 instances enchancment in whole question runtime with Trino on Amazon EMR 6.15.0. The optimizations mentioned on this put up, and plenty of others, are additionally accessible when operating Trino queries on Athena the place comparable efficiency enhancements are noticed. To study extra, consult with the Run queries 3x sooner with as much as 70% value financial savings on the most recent Amazon Athena engine.
In our mission to innovate on behalf of shoppers, Amazon EMR and Athena continuously launch efficiency and reliability enhancements on their newest variations. Test the Amazon EMR and Amazon Athena launch pages to find out about new options and enhancements.
Concerning the Authors
Bhargavi Sagi is a Software program Improvement Engineer on Amazon Athena. She joined AWS in 2020 and has been engaged on completely different areas of Amazon EMR and Athena engine V3, together with engine improve, engine reliability, and engine efficiency.
Sushil Kumar Shivashankar is the Engineering Supervisor for EMR Trino and Athena Question Engine group. He has been focusing within the large information analytics house since 2014.