Posts

Showing posts from December, 2023

Datasets and data-aware scheduling - Apache Airflow

Image
  Hello Data Pros,  In our last blog, we explored airflow hooks, and discussed how they differ from Airflow operators!   Today, we'll focus on Airflow datasets!  and demonstrate the workings of data-aware scheduling and cross- dag  dependencies!   Data-aware scheduling is a feature of Apache Airflow, that enables the execution of  dags  based on the availability of files or datasets. This stands in contrast to the time-based scheduling that we're already familiar with through our previous examples.   A dataset is a logical representation of the underlying data. You can simply create a Dataset in airflow, by instantiating the dataset class.   Within the dataset you can use the complete URI of your physical data set, or just a descriptive string. Even if you give the complete URI, the airflow does not try to access, or validate the data represented by the dataset URI .   Instead, it’s just treated as a string or identifier, and be used to establish producer and consumer relations