Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python 3+ type-annotations #4

Merged
merged 4 commits into from
Feb 26, 2019
Merged

Conversation

y2k-shubham
Copy link
Contributor

Use typing module to add Type-Annotations as per Python 3+

Copy link
Contributor

@ajbosco ajbosco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@y2k-shubham Thanks for the PR. This looks great! Are you using dag-factory at Zomato?

@ajbosco ajbosco merged commit b897dfe into astronomer:master Feb 26, 2019
@y2k-shubham
Copy link
Contributor Author

@ajbosco thanks a ton! (that was quick)
Heads up for this excellent project; it elegantly throws light on the right way of constructing DAGs.
.
At @Zomato we aren't using dag-factory at the moment but we are actively exploring it and other similar solutions (like etsy/boundary-layer) for inspiration.

@Jaikant
Copy link

Jaikant commented Dec 30, 2020

@y2k-shubham Which one did you choose at Zomato - dag-factory or boundary-layer?

@y2k-shubham
Copy link
Contributor Author

y2k-shubham commented Dec 30, 2020

@Jaikant we didnt picky etsy/boundary-layer because

  1. adoption seemed risky since it hadn't caught community traction (the last thing one wants is to end up running a platform built on an abandoned library / framework)
  2. merely scratching it on the surface revealed several shortcomings (#31, #33, #43) that would've required us to customize it (since community support was non-existent)
    .
    and as far as dag-factory is concerned, it is more of an idea (for a general approach) than a full-fledged solution
    .
    so taking inspiration from dag-factory and also analyzing our requirements, we ended up implementing our custom solution (took us ~ 1 year to evolve and mature the platform) where
  3. we have JSON configs to generate DAGs (1 JSON per DAG)
  4. generated DAGs can't be arbitrary, rather they have a well defined structure (in terms on operators, and their connections)
  5. generated DAGs are one of several possible (small no of) types or categories; each having a specific structure in terms of operator and their connections
  6. because of point 3. above, we now have JSONs with pre-defined structure (each category of DAGs has a specific structure of JSON) that can be validated beforehand. Also since structure of DAGs is known ahead of time (needs not be inferred from JSON), we have a DagBuilder class per category of DAG that generates DAGs of that category while passing appropriate args (read from JSON config) to operators
  7. Much like dag-factory (and unlike boundary-layer), we dont perform any code-generation. Rather DAGs are materialized in memory only
    .
    why we picked this approach
  • we had to support a large no of (150+) internal users who had little or no knowledge of underlying Airflow platform (over 90% of users aren't even developers, they are very good business analysts who know SQL like the back of their hands). So asking them to specify operators for their DAGs was dumb (and error prone). Plus by not putting many Airflow-specific info in JSON configs (and abstracting out Airflow details from end users), we have the liberty to move to a different platform in future with minimal effort
  • We wanted to have tight control over the logic of DAGs themselves partly because we were the owners of the platform and also because our users weren't well versed with Airflow. So by writing the DAG generation code ourselves, we made sure that their were no surprises (other than those where JSON configs themselves were broken)
  • Since JSONs had well defined structure. we had the option to run validations on them (so that our DAG generation code doesn't error out). Validations could be like all necessary keys are present in JSON, all values of expected types etc.
    .
    One thing to point out, by employing this approach, we were also able to achieve dependednt DAGs (one DAG triggers another set of DAGs upon completion) simply by having user specify in their JSONs "depends_on": ["dag_a", "dag_b"] (now this DAG will be triggered by either of dag_a or dag_b (whichever finishes later)
    .
    I haven't worked with data-platform team at my org for past ~ 1 year; but for more info you can reach out to Ayush Chauhan, Rajat Taya & Palash Goel

@Jaikant
Copy link

Jaikant commented Dec 30, 2020

Awesome. Sounds like you did all the right things. I am still in two minds about adopting airflow and somewhere I feel dagster-io has better abstractions built in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants