Information Catalogs Vs. Metadata Catalogs: What’s the Distinction?


Information catalogs and metadata catalogs share some similarities, significantly of their practically an identical names. And whereas they’ve some frequent features, there are additionally necessary variations between the 2 entities that massive information practitioners ought to learn about.

Metadata catalogs, that are generally referred to as metastores or technical information catalogs, have been within the information these days. For those who’re a daily Datanami reader (and we definitely hope you’re!), you’ll have learn so much metadata catalogs on the Snowflake and Databricks conferences final month, when the 2 opponents dedicated to open sourcing their respective metadata catalogs, Polaris and Unity Catalog.

So what’s a metadata catalog, and why do they matter? (We’re glad you requested!) Learn on to be taught extra.

Metadata Catalogs

A metadata catalog is outlined because the place the place one shops the technical metadata describing the info you’ve got saved as a tabular construction in an information lake or a lakehouse.

Essentially the most generally used metadata catalog is the Hive Metastore, which was the central repository for metadata describing the contents of Apache Hive tables. Hive, in fact, was the relational framework that allowed Hadoop customers to question HDFS-based information utilizing good outdated SQL, versus MapReduce.

Hive and the Hive Metastore are nonetheless round, however they’re within the technique of being changed by a more moderen technology of know-how. Desk codecs, similar to Apache Iceberg, Apache Hudi, and Databricks Delta Desk, convey many benefits over Hive tables, together with help for transactions, which boosts the accuracy of knowledge.

These desk codecs additionally require a technical layer–the metadata catalog–to assist customers know what information exists within the tables and to grant or deny entry to that information. Databricks helps this operate in its Unity Catalog. For Iceberg, merchandise similar to Challenge Nessie, which was developed by engineers at Dremio, sought to be the “transactional catalog” brokering information entry to numerous open and business information engines, together with Hive, Dremio, Spark, and AWS Athena (primarily based on Presto), amongst others.

Snowflake developed and launched (or pledged to launch, anyway) Polaris to be the usual metadata catalog for the Apache Iceberg ecosystem. Like Nessie, Polaris makes use of Iceberg’s open REST-based API to get entry to the descriptive metadata of the Parquet information that Iceberg shops. This REST API then serves because the interface between the info saved in Iceberg tables and information processing engines, similar to Snowflake’s native SQL engine in addition to a wide range of open-source engines.

Information Catalogs

Information catalogs are sometimes third-party instruments that corporations use to prepare all the information they’ve saved throughout their organizations. They sometimes embody some facility that permits customers to seek for information their group might personal, which suggests information catalogs typically have some information discovery element.

Many information catalogs, similar to Alation’s catalog, have additionally advanced to incorporate entry management performance, in addition to information lineage monitoring and governance capabilities. In some instances, information administration instrument distributors that started off offering information governance and entry management, similar to Collibra, have advanced the opposite means, to additionally embody information catalogs and information discovery capabilities.

And like metadata catalogs, common information catalogs–or what some within the trade time period “enterprise” information catalogs–are additionally absolutely concerned in gobbling up metadata to assist them observe varied information belongings. One enterprise information catalog vendor, Atlan, focuses its efforts on unifying the metadata generated by completely different datasets and synchronizing them by way of a metadata “management airplane,” thereby making certain that the enterprise metrics don’t get too out of whack.

By now, you’re most likely questioning “So what the heck is the distinction?! They each observe metadata, and so they each have “information catalog” of their identify. So what’s the distinction between a metadata catalog and an information catalog.

So What’s The Distinction?!

To assist us decode the variations between these two catalog sorts, Datanami just lately talked to Felix Van de Maele, the CEO and co-founder of Collibra, one of many main information catalog distributors within the massive information house.

“They’re very various things,” Van de Maele stated. “If you concentrate on Polaris catalog and Unity Catalog from Databricks–and AWS and Google and Microsoft all have their catalogs–it’s actually this concept that you simply’re capable of retailer your information anyplace, on any clouds…And I can use any type of information engine like a Databricks, like a Snowflake, like a Google, AWS, and so forth, to eat that information.”

However what Collibra and different enterprise information catalogs do is sort of completely different, Van de Maele stated.

Felix Van de Maele is the CEO and co-founder of Collibra

“What we do is we offer far more of the enterprise context,” he stated. “We offer what we name that data graph, that enterprise context the place you’re truly defining and managing your insurance policies. Insurance policies similar to what’s the standard of my information? What enterprise guidelines does my information have to comply to? What privateness insurance policies does my information have to comply to? Who must approve it? How can we seize attestations? How can we do certification? How do I construct a enterprise glossary with enterprise phrases and clear definitions?

“That’s very completely different than a Polaris catalog on prime of Iceberg that’s the bodily metadata. And that’s an actual differentiation,” he stated.

Van de Maele helps the open information lakehouse structure that has emerged, which provides prospects the liberty to retailer their information in open desk codecs, similar to Iceberg, Delta, and Hudi, and question it with any engine. His prospects, lots of that are Fortune 500 enterprises, retailer information throughout many information platforms and use the Collibra Information Intelligence platform to assist management and govern entry to that information.

Completely different Roles

Clients ought to perceive that, whereas the names are comparable, metadata catalogs and information catalogs play very completely different roles.

“The best way I differentiate between the 2 is we do coverage definition and administration, they do coverage enforcement,” Van de Maele stated. “And really I feel that’s the suitable structure.”

(Den Rise/Shutterstock)

The metadata catalogs sometimes wouldn’t have performance to permit customers to arrange enterprise insurance policies round information entry. As an example, they received’t allow you to arrange entry controls to allow a advertising crew to entry all buyer information apart from something that’s been marked “categorized,” by which case it should be masked, Van de Mael stated.

“We are able to have advertising information in Databricks, we’ve advertising information in Salesforce, we’ve advertising information in Google, and anyplace persons are utilizing advertising information, I have to be sure that the suitable information is classed and masked,” he stated. “So we push that down in Databricks, in Snowflake, in Google, in Amazon and in Microsoft.”

Clients may outline their very own information entry insurance policies with no instrument like Collibra’s, Van de Mael stated. In spite of everything, it’s simply SQL on the finish of the day. However then they would wish another technique to maintain observe of the tens of millions of columns unfold throughout varied information platforms. Offering perception into what information exists and the place, after which making certain prospects are accessing it in accordance with the corporate’s governance guidelines, is the function Collibra serves.

On the similar time, Collibra relies upon metadata catalogs for the enforcement mechanisms. Different enforcement mechanisms have been tried, similar to proxies and drivers, Van de Maele stated, however none of it really works.

“We expect the metadata catalog strategy with open desk format is definitely the suitable strategy,” he stated. “We wish to have these information platforms be capable to try this natively, in any other case scalability and efficiency all the time turn into an issue.”

Databricks Unity Catalog seems to be the exception right here. Unity Catalog, which Databricks simply open sourced final month, gives the low-level management over technical metadata in addition to higher-level features, similar to information governance, entry management, auditing, and lineage. In that respect, Unity Catalog seems to compete with the enterprise information catalog distributors.

Associated Objects:

What the Large Fuss Over Desk Codecs and Metadata Catalogs Is All About

Databricks to Open Supply Unity Catalog

What to Search for in a Information Catalog

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here

Stay on op - Ge the daily news in your inbox