Posts

Showing posts from December, 2023

Databricks Medallion Architecture: Data Modeling Guide, Best practices, Standards, Examples

  Hello Data Pros, and welcome back to another exciting blog in our Databricks learning series! In our last blog, we explored various platform architectures, specifically focusing on the modern Data Lakehouse architecture and its implementation using the Delta Lake framework. But today, we're moving to the next big question: With this powerful platform architecture in place, how do we organize and model our data effectively? And that’s where the Medallion Architecture comes in! So, what exactly is Medallion Architecture? It’s a data design pattern, developed to logically structure and organize your data within a Lakehouse! Its main purpose is to progressively improve the quality and usability of your data, as it moves through different stages, such as Bronze, Silver, and Gold. Think of it as a transformation journey, where raw data is refined step by step into a polished and analysis-ready state! Some people call it a multi-hop architecture because the data flows through several tr...

Datasets and data-aware scheduling - Apache Airflow

  Hello Data Pros,  In our last blog, we explored airflow hooks, and discussed how they differ from Airflow operators!   Today, we'll focus on Airflow datasets!  and demonstrate the workings of data-aware scheduling and cross- dag  dependencies!   Data-aware scheduling is a feature of Apache Airflow, that enables the execution of  dags  based on the availability of files or datasets. This stands in contrast to the time-based scheduling that we're already familiar with through our previous examples.   A dataset is a logical representation of the underlying data. You can simply create a Dataset in airflow, by instantiating the dataset class.   Within the dataset you can use the complete URI of your physical data set, or just a descriptive string. Even if you give the complete URI, the airflow does not try to access, or validate the data represented by the dataset URI .   Instead, it’s just treated as a string or identifier, and b...