Datasets and data-aware scheduling - Apache Airflow
Hello Data Pros, In our last blog, we explored airflow hooks, and discussed how they differ from Airflow operators! Today, we'll focus on Airflow datasets! and demonstrate the workings of data-aware scheduling and cross- dag dependencies! Data-aware scheduling is a feature of Apache Airflow, that enables the execution of dags based on the availability of files or datasets. This stands in contrast to the time-based scheduling that we're already familiar with through our previous examples. A dataset is a logical representation of the underlying data. You can simply create a Dataset in airflow, by instantiating the dataset class. Within the dataset you can use the complete URI of your physical data set, or just a descriptive string. Even if you give the complete URI, the airflow does not try to access, or validate the data represented by the dataset URI . Instead, it’s just treated as a string or identifier, and be used to establish producer and consumer relations