Replies: 6 comments
-
Feel free to explain it better. You are probably the best person in the world to know how to phrase it better as you just struggled with it. |
Beta Was this translation helpful? Give feedback.
-
Assigned you to it. |
Beta Was this translation helpful? Give feedback.
-
https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html#data-interval |
Beta Was this translation helpful? Give feedback.
-
Looking at the documentation, I also had the same interpretation as @jceresini. However, I agree with @uranusjr that the ideal would be to link the section related to the data-interval. |
Beta Was this translation helpful? Give feedback.
-
Reading the documentation (linked in my previous comment), one additional thing I would add to clarify against the confusion is that, A But there are discrete schedules as well, where data intervals are not continuous. In those cases, Anyone’s planning on improving the docs, it’d be most awesome if you come in with a clean head, remember what you know before and after you entered this conversation, and document exactly that. This would be most beneficial to regular Airflow users since they would likely be at the place as you were, and the best way to help them figure this out is to take them through the exact think process you did. |
Beta Was this translation helpful? Give feedback.
-
This is still unclear in the docs. The link proposed by @uranusjr does not mention the template variables
And maybe some discrete schedule examples. |
Beta Was this translation helpful? Give feedback.
-
What do you see as an issue?
The documentation for the template variable
data_interval_end
states simply that its the "End of the data interval". Initially I took that to mean the final second/microsecond (depending on precision) of the data interval, but its actually the start of the next interval.For example, given a schedule that runs every 5 minutes, the variables are set as follows (for a simple test I just ran):
As opposed to:
It makes sense as implemented, but I'd like to see the documentation state the behavior explicitly.
Solving the problem
The documentation I read when using the template variables is https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html
I'm not sure how to word it, but it would be helpful to indicate that
data_interval_end
is effectively thedata_interval_start
of the next interval. Or that the interval the DAG is operating on, using mathematical interval notation, is[data_interval_start, data_interval_end)
Anything else
Just an explanation of how we ran into this:
We are running DAGS periodically that pull timeseries data from some API. The jobs query an api with filters like this:
We noticed we were getting some duplicated data, specifically data that happened exactly on the
data_interval_start
value. The simple fix (once we saw the behavior of the variables) was to remove the=
from the second filter:Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions