nerogc.blogg.se

Databricks lakehouse
Databricks lakehouse








databricks lakehouse

Databricks added this capability to its Unified Analytics Platform (which provides Spark-based data processing for data in AWS or Microsoft Azure cloud storage) in April 2019 with the launch of Delta Lake. One of the key enablers of the lakehouse concept is a structured transactional layer. The data lakehouse blurs the lines between data lakes and data warehousing by maintaining the cost and flexibility advantages of persisting data in cloud storage while enabling schema to be enforced for curated subsets of data in specific conceptual zones of the data lake, or an associated analytic database, in order to accelerate analysis and business decision-making. While the data lake concept promises a more agile environment than a traditional data warehouse approach – one that is better suited to changing data and evolving business use cases – early initiatives suffered from a lack of appropriate data engineering processes and data management and governance functionality, making them relatively inaccessible for general business or self-service users. As we noted in July when we examined Databricks' evolving strategy, we see wisdom in the desire to bring the structured analytics advantages of data warehousing to data stored in low-cost cloud-based data lakes, especially for data types and workloads that do not lend themselves naturally to relational databases. So what exactly is a data lakehouse?Īs noted above, the simplest description of a data lakehouse is that it is an environment designed to combine the data structure and data management features of a data warehouse with the low-cost storage of a data lake. Since a quick internet search returns nearly twice as many results for 'data lakehouse' than 'data lake house,' we will continue to use the former from this point on, unless specifically referring to AWS's 'lake house architecture.' Either way, it is worth exploring the term, and the products and services it is being applied to, in more detail. While Snowflake's marketing has not run with the lakehouse terminology, preferring the term 'data cloud' to describe its ability to support multiple data processing and analytics workloads, AWS has very much picked up on it as a term to describe its combined portfolio of data and analytics services – placing its 'lake house architecture' front and center of its data and analytics announcements at re:Invent 2020. In fact, the first use of the term by a vendor we have found can be attributed to Snowflake, which in late 2017 promoted that its customer, Jellyvision, was using Snowflake to combine schemaless and structured data processing in what Jellyvision described as a data lakehouse. Amazon Web Services (AWS) previously used the term (or in its case, 'lake house') in late 2019 in relation to Amazon Redshift Spectrum, its service that enables users of its Amazon Redshift data warehouse service to apply queries to data stored in its Amazon S3 cloud service. Often uttered flippantly to describe the result of the theoretical combination of a data warehouse with data lake functionality, usage of the term became more serious and more widespread in early 2020 as Databricks adopted it to describe its approach of marrying the data structure and data management features of the data warehouse with the low-cost storage used for data lakes.ĭatabricks was not the first to start using the data lakehouse terminology, however. The term 'data lakehouse' entered the data and analytics lexicon over the last few years.










Databricks lakehouse