Use Apache Iceberg in your knowledge lake with Amazon S3, AWS Glue, and Snowflake


This put up is co-written with Andries Engelbrecht and Scott Teal from Snowflake.

Companies are continually evolving, and knowledge leaders are challenged each day to satisfy new necessities. For a lot of enterprises and enormous organizations, it’s not possible to have one processing engine or instrument to cope with the assorted enterprise necessities. They perceive {that a} one-size-fits-all method now not works, and acknowledge the worth in adopting scalable, versatile instruments and open knowledge codecs to assist interoperability in a contemporary knowledge structure to speed up the supply of recent options.

Prospects are utilizing AWS and Snowflake to develop purpose-built knowledge architectures that present the efficiency required for contemporary analytics and synthetic intelligence (AI) use instances. Implementing these options requires knowledge sharing between purpose-built knowledge shops. This is the reason Snowflake and AWS are delivering enhanced assist for Apache Iceberg to allow and facilitate knowledge interoperability between knowledge companies.

Apache Iceberg is an open-source desk format that gives reliability, simplicity, and excessive efficiency for giant datasets with transactional integrity between varied processing engines. On this put up, we talk about the next:

  • Benefits of Iceberg tables for knowledge lakes
  • Two architectural patterns for sharing Iceberg tables between AWS and Snowflake:
    • Handle your Iceberg tables with AWS Glue Information Catalog
    • Handle your Iceberg tables with Snowflake
  • The method of changing present knowledge lakes tables to Iceberg tables with out copying the information

Now that you’ve a high-level understanding of the matters, let’s dive into every of them intimately.

Benefits of Apache Iceberg

Apache Iceberg is a distributed, community-driven, Apache 2.0-licensed, 100% open-source knowledge desk format that helps simplify knowledge processing on massive datasets saved in knowledge lakes. Information engineers use Apache Iceberg as a result of it’s quick, environment friendly, and dependable at any scale and retains information of how datasets change over time. Apache Iceberg gives integrations with fashionable knowledge processing frameworks resembling Apache Spark, Apache Flink, Apache Hive, Presto, and extra.

Iceberg tables keep metadata to summary massive collections of information, offering knowledge administration options together with time journey, rollback, knowledge compaction, and full schema evolution, lowering administration overhead. Initially developed at Netflix earlier than being open sourced to the Apache Software program Basis, Apache Iceberg was a blank-slate design to resolve frequent knowledge lake challenges like consumer expertise, reliability, and efficiency, and is now supported by a strong neighborhood of builders centered on frequently enhancing and including new options to the challenge, serving actual consumer wants and offering them with optionality.

Transactional knowledge lakes constructed on AWS and Snowflake

Snowflake gives varied integrations for Iceberg tables with a number of storage choices, together with Amazon S3, and a number of catalog choices, together with AWS Glue Information Catalog and Snowflake. AWS gives integrations for varied AWS companies with Iceberg tables as effectively, together with AWS Glue Information Catalog for monitoring desk metadata. Combining Snowflake and AWS offers you a number of choices to construct out a transactional knowledge lake for analytical and different use instances resembling knowledge sharing and collaboration. By including a metadata layer to knowledge lakes, you get a greater consumer expertise, simplified administration, and improved efficiency and reliability on very massive datasets.

Handle your Iceberg desk with AWS Glue

You should utilize AWS Glue to ingest, catalog, remodel, and handle the information on Amazon Easy Storage Service (Amazon S3). AWS Glue is a serverless knowledge integration service that means that you can visually create, run, and monitor extract, remodel, and cargo (ETL) pipelines to load knowledge into your knowledge lakes in Iceberg format. With AWS Glue, you may uncover and connect with greater than 70 various knowledge sources and handle your knowledge in a centralized knowledge catalog. Snowflake integrates with AWS Glue Information Catalog to entry the Iceberg desk catalog and the information on Amazon S3 for analytical queries. This tremendously improves efficiency and compute price compared to exterior tables on Snowflake, as a result of the extra metadata improves pruning in question plans.

You should utilize this identical integration to reap the benefits of the information sharing and collaboration capabilities in Snowflake. This may be very highly effective when you have knowledge in Amazon S3 and must allow Snowflake knowledge sharing with different enterprise items, companions, suppliers, or clients.

The next structure diagram gives a high-level overview of this sample.

The workflow contains the next steps:

  1. AWS Glue extracts knowledge from purposes, databases, and streaming sources. AWS Glue then transforms it and masses it into the information lake in Amazon S3 in Iceberg desk format, whereas inserting and updating the metadata concerning the Iceberg desk in AWS Glue Information Catalog.
  2. The AWS Glue crawler generates and updates Iceberg desk metadata and shops it in AWS Glue Information Catalog for present Iceberg tables on an S3 knowledge lake.
  3. Snowflake integrates with AWS Glue Information Catalog to retrieve the snapshot location.
  4. Within the occasion of a question, Snowflake makes use of the snapshot location from AWS Glue Information Catalog to learn Iceberg desk knowledge in Amazon S3.
  5. Snowflake can question throughout Iceberg and Snowflake desk codecs. You possibly can share knowledge for collaboration with a number of accounts in the identical Snowflake area. It’s also possible to use knowledge in Snowflake for visualization utilizing Amazon QuickSight, or use it for machine studying (ML) and synthetic intelligence (AI) functions with Amazon SageMaker.

Handle your Iceberg desk with Snowflake

A second sample additionally gives interoperability throughout AWS and Snowflake, however implements knowledge engineering pipelines for ingestion and transformation to Snowflake. On this sample, knowledge is loaded to Iceberg tables by Snowflake by integrations with AWS companies like AWS Glue or by different sources like Snowpipe. Snowflake then writes knowledge on to Amazon S3 in Iceberg format for downstream entry by Snowflake and varied AWS companies, and Snowflake manages the Iceberg catalog that tracks snapshot areas throughout tables for AWS companies to entry.

Just like the earlier sample, you need to use Snowflake-managed Iceberg tables with Snowflake knowledge sharing, however you may as well use S3 to share datasets in instances the place one occasion doesn’t have entry to Snowflake.

The next structure diagram gives an outline of this sample with Snowflake-managed Iceberg tables.

This workflow consists of the next steps:

  1. Along with loading knowledge through the COPY command, Snowpipe, and the native Snowflake connector for AWS Glue, you may combine knowledge through the Snowflake Information Sharing.
  2. Snowflake writes Iceberg tables to Amazon S3 and updates metadata robotically with each transaction.
  3. Iceberg tables in Amazon S3 are queried by Snowflake for analytical and ML workloads utilizing companies like QuickSight and SageMaker.
  4. Apache Spark companies on AWS can entry snapshot areas from Snowflake through a Snowflake Iceberg Catalog SDK and immediately scan the Iceberg desk information in Amazon S3.

Evaluating options

These two patterns spotlight choices obtainable to knowledge personas in the present day to maximise their knowledge interoperability between Snowflake and AWS utilizing Apache Iceberg. However which sample is right on your use case? When you’re already utilizing AWS Glue Information Catalog and solely require Snowflake for learn queries, then the primary sample can combine Snowflake with AWS Glue and Amazon S3 to question Iceberg tables. When you’re not already utilizing AWS Glue Information Catalog and require Snowflake to carry out reads and writes, then the second sample is probably going a superb resolution that enables for storing and accessing knowledge from AWS.

Contemplating that reads and writes will in all probability function on a per-table foundation somewhat than your entire knowledge structure, it’s advisable to make use of a mixture of each patterns.

Migrate present knowledge lakes to a transactional knowledge lake utilizing Apache Iceberg

You possibly can convert present Parquet, ORC, and Avro-based knowledge lake tables on Amazon S3 to Iceberg format to reap the advantages of transactional integrity whereas enhancing efficiency and consumer expertise. There are a number of Iceberg desk migration choices (SNAPSHOT, MIGRATE, and ADD_FILES) for migrating present knowledge lake tables in-place to Iceberg format, which is preferable to rewriting the entire underlying knowledge information—a pricey and time-consuming effort with massive datasets. On this part, we give attention to ADD_FILES, as a result of it’s helpful for customized migrations.

For ADD_FILES choices, you need to use AWS Glue to generate Iceberg metadata and statistics for an present knowledge lake desk and create new Iceberg tables in AWS Glue Information Catalog for future use with no need to rewrite the underlying knowledge. For directions on producing Iceberg metadata and statistics utilizing AWS Glue, confer with Migrate an present knowledge lake to a transactional knowledge lake utilizing Apache Iceberg or Convert present Amazon S3 knowledge lake tables to Snowflake Unmanaged Iceberg tables utilizing AWS Glue.

This selection requires that you just pause knowledge pipelines whereas changing the information to Iceberg tables, which is an easy course of in AWS Glue as a result of the vacation spot simply must be modified to an Iceberg desk.

Conclusion

On this put up, you noticed the 2 structure patterns for implementing Apache Iceberg in a knowledge lake for higher interoperability throughout AWS and Snowflake. We additionally supplied steering on migrating present knowledge lake tables to Iceberg format.

Join AWS Dev Day on April 10 to get hands-on not solely with Apache Iceberg, but additionally with streaming knowledge pipelines with Amazon Information Firehose and Snowpipe Streaming, and generative AI purposes with Streamlit in Snowflake and Amazon Bedrock.


In regards to the Authors

Andries Engelbrecht is a Principal Companion Options Architect at Snowflake and works with strategic companions. He’s actively engaged with strategic companions like AWS supporting product and repair integrations in addition to the event of joint options with companions. Andries has over 20 years of expertise within the area of knowledge and analytics.

Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in massive knowledge companies. He’s captivated with serving to clients construct trendy knowledge architectures on the AWS Cloud. He has helped clients of all sizes implement knowledge administration, knowledge warehouse, and knowledge lake options.

Brian Dolan joined Amazon as a Navy Relations Supervisor in 2012 after his first profession as a Naval Aviator. In 2014, Brian joined Amazon Internet Providers, the place he helped Canadian clients from startups to enterprises discover the AWS Cloud. Most just lately, Brian was a member of the Non-Relational Enterprise Improvement staff as a Go-To-Market Specialist for Amazon DynamoDB and Amazon Keyspaces earlier than becoming a member of the Analytics Worldwide Specialist Group in 2022 as a Go-To-Market Specialist for AWS Glue.

Nidhi Gupta is a Sr. Companion Resolution Architect at AWS. She spends her days working with clients and companions, fixing architectural challenges. She is captivated with knowledge integration and orchestration, serverless and large knowledge processing, and machine studying. Nidhi has intensive expertise main the structure design and manufacturing launch and deployments for knowledge workloads.

Scott Teal is a Product Advertising and marketing Lead at Snowflake and focuses on knowledge lakes, storage, and governance.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here

Stay on op - Ge the daily news in your inbox