Databricks Medallion Architecture: Data Modeling Guide, Best practices, Standards, Examples

- December 07, 2024

Hello Data Pros, and welcome back to another exciting blog in our Databricks learning series!

In our last blog, we explored various platform architectures, specifically focusing on the modern Data Lakehouse architecture and its implementation using the Delta Lake framework.

But today, we're moving to the next big question: With this powerful platform architecture in place, how do we organize and model our data effectively?

And that’s where the Medallion Architecture comes in!

So, what exactly is Medallion Architecture?

It’s a data design pattern, developed to logically structure and organize your data within a Lakehouse!

Its main purpose is to progressively improve the quality and usability of your data, as it moves through different stages, such as Bronze, Silver, and Gold. Think of it as a transformation journey, where raw data is refined step by step into a polished and analysis-ready state!

Some people call it a multi-hop architecture because the data flows through several transformation stages, and with each step, it is further refined—adding more value and structure along the way!

Let’s break down the typical layers of the Medallion Architecture!

First up, we have the Bronze layer!

This is where we bring in raw data from external systems in its original form. The goal is simple: capture everything as it was received.

By doing so, you preserve your business event history, and enable lineage tracking from the very start of the data transformation journey.

Additionally, this helps meet auditing and compliance requirements, and is useful if you ever need to reprocess historical data.

The data stored can vary based on your specific business or project.

For example, you might store transaction records from point-of-sale systems, clickstream data from websites, system log events from application servers, or sensor readings from IoT devices—along with useful metadata like timestamps or process IDs.

Typically, data in this layer is not directly used for analytics. Therefore, access to this data is often restricted to data engineers, who work with it and transform it further to build the subsequent layers.

Please take a look at these examples to get a high-level understanding of how the data in this layer looks. If your source is relational, you can replicate the same relational structure, and optionally add metadata columns, such as process ID and a timestamp for when the data was processed in the Bronze layer.

For non-relational data, such as XML, JSON, plain texts, images, and videos, it’s recommended to store the data in its unaltered form, potentially as a string or binary column in a Delta table.

Up next, we have the Silver layer!

This is where things start to take shape. The raw data from the Bronze layer undergoes initial transformations to become more structured and usable. At this stage, we perform tasks like removing duplicates, cleansing the data, standardizing formats, splitting data into distinct tables, and more.

The goal is to create an enterprise-wide well-structured data repository; that is maintainable, optimized for frequent write operations, and capable of supporting machine learning and AI workloads, as well as on-demand ad hoc analytical queries.

Here’s an example to illustrate the structure of the data in the Silver layer.

In this layer, data is often transformed into a 3NF normalized schema, and organized into multiple well-structured tables.

Each table represents a distinct aspect of your business, making it easier to understand, maintain, and scale over time. This approach ensures data consistency, eliminates data redundancy, and provides the flexibility required to meet both current and future business needs.

And finally, the Gold layer—where the data undergoes its final transformations.

The Gold layer contains business-ready datasets, tailored to support specific organizational needs. These include tables for dashboards, visualizations, and financial reports, all designed to enable quick, data-driven decision-making.

Queries in the Gold layer run faster than in the Silver layer because data here is typically aggregated and denormalized.

Modelling techniques like star schema are applied to organize the data into fact and dimension tables. This structure makes it easy for analysts, executives, and decision-makers to quickly access insights without dealing with complex joins or transformations.

In many projects, analysts first create ad hoc reports using the Silver layer. Once these reports are finalized and need to be transitioned into regular BAU reports, summarized and denormalized tables are setup in the Gold layer to support faster and more efficient report generation.

Please take a look at these examples to get a high-level understanding of how the data in this layer is structured.

Before I conclude, I’d like to share my thoughts on the topic we’ve been discussing!

Many of you with extensive experience in the data industry might feel that Medallion Architecture is not something new, and that we've been doing this for years, but may be with slightly different names.

Yes, you are absolutely correct! You can easily relate the bronze layer to the data lake, the silver layer to the data warehouse, and the gold layer to the data mart.

However, the key distinction is that while we used different platforms and tools for each layer in the past, in Databricks, all three layers sit on a single unified platform built on top of highly affordable and scalable cloud object storage!

That's all for today! Please stay tuned for our next video where we’ll explore more Databricks topics!

Search This Blog

SleekData

Databricks Medallion Architecture: Data Modeling Guide, Best practices, Standards, Examples

Comments

Post a Comment

Popular posts from this blog

How to Install Airflow on Windows

How to Install DBT and Set Up a Project, Create Your First dbt Model

Airflow DAGs, Operators, Tasks & Providers