The Lakehouse was proposed in this paper. It argues that we are ready for a new platform for our data access to not be the data warehouse but rather what is called a Lakehouse.

The basics are you have all of your data stored in blob storage with meta data and a catalog for the semantics of the structures. The “data lake” technologies that we have at our disposal in Open Source are Delta Lake, Apache Hudi and Apache Iceberg. The metadata, caching and indexing layer should be all you need for all access to the data.
Now, I am not going to-do your research for you but let me say a few things. Be careful to not take what you find and over index on one or two metrics. Some important metrics like number of integration in the open source community and ease of implementation and support with the eco system aren’t always discussed. Also sometimes entire sets of features are missing from all of them that requires you to get it from something else and where and how you get that is an influence to your decision.
The origins of Open Source communities are part of the story. DeltaLake was started by Databricks. Apache Hudi was started by Uber Engineering. And Apache Iceberg from Netflix Engineering. All have different meanings and are subjective so I am not going to get into it but reflect on this as a metric.
Now, digging in a little more all of them requires you to run a metastore database to keep track of tables and such. Each technology supports different metastores and as a result gives you different tradeoffs to make.
First lets talk about Delta Lake. According to the documentation, catalogs are not a topic to be mentioned. Digging around the internet and the source code it looks like the Hive metastore and AWS Glue are options. Probably anything that spark.sql.catalog.spark_catalog supports will do. Now, while Hive metastore is technically Open Source it will require a steep learning curve pulling on the Hadoop eco system components or opting to go with a vendor so you need to evaluate that. And as far as AWS Glue goes you may or may not be in with Amazon enough to utilize this for implementation.
Apache Hudi is clear about what they support and what to-do about it. They support AWS Glue, DataHub, Hive and Google BigQuery.
Apache Iceberg is also clear about what catalogs they support.
- REST – a server-side catalog that’s exposed through a REST API. Here is an example so you can build your own implementation.
- Hive Metastore – tracks namespaces and tables using a Hive
- JDBC – tracks namespaces and tables in a simple JDBC database
- Nessie – a transactional catalog that tracks namespaces and tables in a database with git-like version control
<nessie-side-node>
Nessie is an interesting Open Source project.
- Changes to the contents of the data lake are recorded in Nessie as commits without copying the actual data.
- Add meaning to the changes to your data lake.
- Always-consistent view to all the data.
- Sets of changes, like the whole work of a distributed Spark job. or experiments of data engineers are isolated in Nessie via branches. Failed jobs do not add additional harm to the data.
- Known, fixed versions of all data can be tagged.
- Automatic removal of unused data files (garbage collection).
All of these features are controls within a risk aware environment where compliance and regulations are a factor. While you will often get this from vendors, Nessie makes for a nice integration to get audibility of all of your data changes. In addition, it supports a lot of backend systems. Including but not limited to: Apache Cassandra, Google BigTable, AWS DynamoDB, MongoDB, RocksDB and more.
</nessie-side-node>
Next up I am going to be writing about Getting Started with Apache Spark, Apache Kafka and Apache Iceberg.
Thanx =8^) Joe Stein
http://www.twitter.com/charmalloc
http://www.linkedin/in/charmalloc
Leave a Reply