Posts

Showing posts from November, 2023

Databricks Medallion Architecture: Data Modeling Guide, Best practices, Standards, Examples

  Hello Data Pros, and welcome back to another exciting blog in our Databricks learning series! In our last blog, we explored various platform architectures, specifically focusing on the modern Data Lakehouse architecture and its implementation using the Delta Lake framework. But today, we're moving to the next big question: With this powerful platform architecture in place, how do we organize and model our data effectively? And that’s where the Medallion Architecture comes in! So, what exactly is Medallion Architecture? It’s a data design pattern, developed to logically structure and organize your data within a Lakehouse! Its main purpose is to progressively improve the quality and usability of your data, as it moves through different stages, such as Bronze, Silver, and Gold. Think of it as a transformation journey, where raw data is refined step by step into a polished and analysis-ready state! Some people call it a multi-hop architecture because the data flows through several tr...

Airflow Tutorial - Hooks | Hooks vs Operators | airflow hooks example | When and How to use

  Hello Data Pros,  In our last blog, we uncovered the need for airflow X-coms! and demonstrated how to leverage them effectively in your dags! Today, we're shifting our focus to Airflow hooks!  We’re going to cover what hooks are! How they differ from Airflow operators! Lastly, when and how to use hooks, in your dags! Let's dive right in!   Technically, Hooks are pre-built Python classes. They simplify our interactions with external systems and services. For instance, the popular S3Hook, which is part of the AWS provider package, offers various methods to interact with S3 Storage.   For example, the create bucket method, Creates an Amazon S3 bucket! Load string method – can load a string value as a file in S3! Delete objects method - can be used to delete an S3 file.   Now, let's dive into the source code of this Hook! As you can see, it's low-level Python code. And if AWS has not provided this hook, you might find yourself having to write all this complex...

Airflow Tutorial - Xcom | How to Pass data between tasks

  Hello Data Pros,  In our last blog, we covered deferrable operators and triggers! Now, it’s time to explore Airflow's X-com feature! Let's dive right in! By design, Airflow tasks are isolated! which means they cannot exchange data with each other at run time! However, we frequently come across situations that require sharing data between tasks.   For instance, you might need to extract a value from a table, and based on that value, perform something in the next task! or you may need to create a file with a dynamic name, such as one with a timestamp, and process the same file in the next task.   This is where X-com comes into play. X-com, abbreviated as 'cross-communication,' provides a mechanism that allows tasks to exchange data with each other.   Let’s consider this example Dag. In the first task, we create a file! And in the second task, we upload the same file to S3. With the current setup, this process works well! because we have the 'replace' parameter s...