Time concepts in Airflow

3 min read

In the last post, I showed how to use Airflow (opens in a new tab) in Windows OS with Docker (opens in a new tab).

So, you're ready to dive into the world of Airflow. Basically, it provides you a way to better manage schedules (enabling more sophisticated scheduling than crontab).

Caveat is, it does its job well, as long as you understand how it processes the time. Even before considering complex schedules, one thing that discourages wanna-be user is its concept of time.

In fact, from the start, users have been quite confused with its concept of time, who just wishes set up the scheduling.

Essential dates in Airflow

Although there are several concepts in Airflow, two concepts, confusing, are the most important to know. Namely, DataInterval and run_after.

DataInterval

Throughout the whole Airflow system, DataInterval is the time concept mentioned the most (and confusing). As name implies, it's not just a single date or time. Rather it is composed of two time (startdate and enddate).

Hearing the term startdate, you might think that it's the time when your schedule starts working (on your machine).

Unfortunately not. It's the start of data gathering. What the heck? Theoretically, as Airflow is best used for ETL processes, this concept fits that purpose. In other words, DataInterval means the range of data the run should cover.

However, I'm not so sure that it really matters in reality. For instance, if I run the web scraping script along with DataInterval from yesterday, does that ignore the scraping dated earlier than startdate? I don't think so.

Therefore, this was the most tricky concept for me.

run_after

If you're like me, this is the concept that you might be the most interested in. It's the time when the run starts (period). While it's the concept I only care about, the confusing concept DataInterval is closely related with this.

The time of run_after is the same of the enddate of DataInterval.

I'm not sure if I can set run_after later than enddate of DataInterval, so I'm just sticking to the standard (end_date = run_after).

If you understand the concept of DataInterval, the way run_after is set as the same of the enddate of DataInterval makes sense.

The time range we like to gather data has passed, so we now start the process.

Let's summarize

To sum up, these concepts can be drawn as follows.

Extending these concepts, when you see last_start, last_end or next_start, next_end, be mindful that all of these are part of DataInterval.

Specifically, next_end == run_after and last_end == next_start

These concepts are crucial if you plan to customize the schedule using timetable.

Be sure to have concrete understanding of them before diving.

CC BY-NC 4.0 © min park.RSS