External triggers #503

ctrebing · 2015-10-12T14:28:40Z

I am aware that this pull request is not finished (tests, error handling, documentation). I would like to start a more concrete discussion on the externally triggered DAGs as mentioned in issue #417.

Within my company (http://blue-yonder.com) we are evaluating whether we could use airflow and I would really love to do so. Especially I liked the model you have chosen in the APIs and the possibilies to define the DAGs in Python.

What we really need is the possibility to trigger DAGs externally. I read the discussion in the roadmap issue #417 and liked the ideas expressed there. I did a first prototype for the DagRun object and using this in the scheduler. Before investing further work in stabilizing this, I would like to get your feedback on whether this approach fits with the existing concepts. Does it make sense from your point of view to further work on that, or do you already have different plans/implementations?

mistercrunch · 2015-10-14T16:24:14Z

Nice. This is a very good start. Much inline with what I was thinking.

mistercrunch · 2015-10-14T16:26:08Z

airflow/models.py

+    dag_id = Column(String(ID_LEN), primary_key=True)
+    execution_date = Column(DateTime, primary_key=True)
+    run_id = Column(String(ID_LEN))
+    external_trigger = Column(Boolean, default=False)


we may want to add a state here, so that the scheduler could completely disregard DAGs that are fully processed. I'm not sure whether it should just be boolean or if having more states would help.

I would use a string to be able to mark failed dags as well.

mistercrunch · 2015-10-14T17:12:29Z

We may need a property DAG.active_runs that would return the list of active runs according to DagRun.state, and maintain the state by checking if len(tasks) is the same as successful tasks instances for that date.

mistercrunch · 2015-10-14T17:18:45Z

Somewhat unrelated: I've been planning on allowing for the scheduler to be distributable (many scheduler instances running concurrently). It would be a matter of taking locks in DagModel, adding a DAG.schedule_frequency param and looking at DagModel.last_run to sort on which DAG should be processed first. Maybe a way to identify old dead lock and autounlock a DAG that's been locked for more than say 10 minutes.

ctrebing · 2015-10-15T09:09:35Z

airflow/jobs.py

+                TI.dag_id == run.dag_id,
+                TI.execution_date == run.execution_date
+            ).all()
+            if len(task_instances) == len(dag.tasks):


Can a dag have branches with tasks that are never executed within one run? If yes, then this check would not be sufficient.

mistercrunch · 2015-10-28T22:45:39Z

#540

ctrebing added 5 commits October 12, 2015 16:10

add dag_run table

1754261

adapt scheduler to use DagRun table for triggering

df04c52

add CLI to insert new DagRuns

9392939

fix typos in comments

7de18a3

add missing import

b04ce7b

mistercrunch reviewed Oct 14, 2015
View reviewed changes

add state to DagRun

b22052f

ctrebing reviewed Oct 15, 2015
View reviewed changes

mistercrunch closed this Oct 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External triggers #503

External triggers #503

ctrebing commented Oct 12, 2015

mistercrunch commented Oct 14, 2015

mistercrunch Oct 14, 2015

ctrebing Oct 15, 2015

mistercrunch commented Oct 14, 2015

mistercrunch commented Oct 14, 2015

ctrebing Oct 15, 2015

mistercrunch commented Oct 28, 2015

External triggers #503

External triggers #503

Conversation

ctrebing commented Oct 12, 2015

mistercrunch commented Oct 14, 2015

mistercrunch Oct 14, 2015

Choose a reason for hiding this comment

ctrebing Oct 15, 2015

Choose a reason for hiding this comment

mistercrunch commented Oct 14, 2015

mistercrunch commented Oct 14, 2015

ctrebing Oct 15, 2015

Choose a reason for hiding this comment

mistercrunch commented Oct 28, 2015