Skip to content

Latest commit

 

History

History
868 lines (613 loc) · 38.8 KB

CONTRIBUTING.rst

File metadata and controls

868 lines (613 loc) · 38.8 KB

Contributions are welcome and are greatly appreciated! Every little bit helps, and credit will always be given.

If you are new to the project, you might need some help in understanding how the dynamics of the community works and you might need to get some mentorship from other members of the community - mostly committers. Mentoring new members of the community is part of committers job so do not be afraid of asking committers to help you. You can do it via comments in your Pull Request, asking on a devlist or via Slack. For your convenience, we have a dedicated #newbie-questions Slack channel where you can ask any questions you want - it's a safe space where it is expected that people asking questions do not know a lot about Airflow (yet!).

If you look for more structured mentoring experience, you can apply to Apache Software Foundation's Official Mentoring Programme. Feel free to follow it and apply to the programme and follow up with the community.

Report bugs through Apache JIRA.

Please report relevant information and preferably code that exhibits the problem.

Look through the JIRA issues for bugs. Anything is open to whoever wants to implement it.

Look through the Apache JIRA for features.

Any unassigned "Improvement" issue is open to whoever wants to implement it.

We've created the operators, hooks, macros and executors we needed, but we've made sure that this part of Airflow is extensible. New operators, hooks, macros and executors are very welcomed!

Airflow could always use better documentation, whether as part of the official Airflow docs, in docstrings, docs/*.rst or even on the web as blog posts or articles.

The best way to send feedback is to open an issue on Apache JIRA.

If you are proposing a new feature:

  • Explain in detail how it would work.
  • Keep the scope as narrow as possible to make it easier to implement.
  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

The latest API documentation is usually available here.

To generate a local version:

  1. Set up an Airflow development environment.
  2. Install the doc extra.
pip install -e '.[doc]'
  1. Generate and serve the documentation as follows:
cd docs
./build.sh
./start_doc_server.sh

Note

The docs build script build.sh requires bash 4.0 or greater. If you are building on Mac OS, you can install latest version of bash with homebrew.

Known issues:

If you are creating a new directory for new integration in the airflow.providers package, you should also update the docs/autoapi_templates/index.rst file.

If you are creating a hooks, sensors, operators directory in the airflow.providers package, you should also update the docs/operators-and-hooks-ref.rst file.

If you are creating example_dags directory, you need to create example_dags/__init__.py with Apache license or copy another __init__.py file that contains the necessary license.

Before you submit a pull request (PR) from your forked repo, check that it meets these guidelines:

  • Include tests, either as doctests, unit tests, or both, to your pull request.

    The airflow repo uses Travis CI to run the tests and codecov to track coverage. You can set up both for free on your fork (see Travis CI Testing Framework usage guidelines). It will help you make sure you do not break the build with your PR and that you help increase coverage.

  • Follow our project's Coding style and best practices.

    These are things that aren't currently enforced programtically (either because they are too hard or just not yet done.)

  • Rebase your fork, squash commits, and resolve all conflicts.

  • When merging PRs, wherever possible try to use Squash and Merge instead of Rebase and Merge.

  • Make sure every pull request introducing code changes has an associated JIRA ticket. The JIRA link should also be added to the PR description. In case of documentation only changes the JIRA ticket is not necessary.

  • Preface your commit's subject & PR title with [AIRFLOW-NNNN] COMMIT_MSG where NNNN is the JIRA number. For example: [AIRFLOW-5574] Fix Google Analytics script loading. In case of documentation only changes you should put "[AIRFLOW-XXXX]" instead. We compose Airflow release notes from all commit titles in a release. By placing the JIRA number in the commit title and hence in the release notes, we let Airflow users look into JIRA and GitHub PRs for more details about a particular change.

  • Add an Apache License header to all new files.

    If you have pre-commit hooks enabled, they automatically add license headers during commit.

  • If your pull request adds functionality, make sure to update the docs as part of the same PR. Doc string is often sufficient. Make sure to follow the Sphinx compatible standards.

  • Make sure your code fulfils all the static code checks we have in our code. The easiest way to make sure of that is to use pre-commit hooks

  • Run tests locally before opening PR.

  • Make sure the pull request works for Python 3.6 and 3.7.

  • Adhere to guidelines for commit messages described in this article. This makes the lives of those who come after you a lot easier.

All new development in Airflow happens in the master branch. All PRs should target that branch. We also have a v1-10-test branch that is used to test 1.10.x series of Airflow and where committers cherry-pick selected commits from the master branch. Cherry-picking is done with the -x flag.

The v1-10-test branch might be broken at times during testing. Expect force-pushes there so committers should coordinate between themselves on who is working on the v1-10-test branch - usually these are developers with the release manager permissions.

Once the branch is stable, the v1-10-stable branch is synchronized with v1-10-test. The v1-10-stable branch is used to release 1.10.x releases.

There are two environments, available on Linux and macOS, that you can use to develop Apache Airflow:

The table below summarizes differences between the two environments:

Property Local virtualenv Breeze environment
Test coverage
  • (-) unit tests only
  • (+) integration and unit tests
Setup
  • (+) automated with breeze cmd
  • (+) automated with breeze cmd
Installation difficulty
  • (-) depends on the OS setup
  • (+) works whenever Docker works
Team synchronization
  • (-) difficult to achieve
  • (+) reproducible within team
Reproducing CI failures
  • (-) not possible in many cases
  • (+) fully reproducible
Ability to update
  • (-) requires manual updates
  • (+) automated update via breeze cmd
Disk space and CPU usage
  • (+) relatively lightweight
  • (-) uses GBs of disk and many CPUs
IDE integration
  • (+) straightforward
  • (-) via remote debugging only

Typically, you are recommended to use both of these environments depending on your needs.

All details about using and running local virtualenv environment for Airflow can be found in LOCAL_VIRTUALENV.rst.

Benefits:

  • Packages are installed locally. No container environment is required.
  • You can benefit from local debugging within your IDE.
  • With the virtualenv in your IDE, you can benefit from autocompletion and running tests directly from the IDE.

Limitations:

  • You have to maintain your dependencies and local environment consistent with other development environments that you have on your local machine.

  • You cannot run tests that require external components, such as mysql, postgres database, hadoop, mongo, cassandra, redis, etc.

    The tests in Airflow are a mixture of unit and integration tests and some of them require these components to be set up. Local virtualenv supports only real unit tests. Technically, to run integration tests, you can configure and install the dependencies on your own, but it is usually complex. Instead, you are recommended to use Breeze development environment with all required packages pre-installed.

  • You need to make sure that your local environment is consistent with other developer environments. This often leads to a "works for me" syndrome. The Breeze container-based solution provides a reproducible environment that is consistent with other developers.

  • You are STRONGLY encouraged to also install and use pre-commit hooks for your local virtualenv development environment. Pre-commit hooks can speed up your development cycle a lot.

All details about using and running Airflow Breeze can be found in BREEZE.rst.

The Airflow Breeze solution is intended to ease your local development as "It's a Breeze to develop Airflow".

Benefits:

  • Breeze is a complete environment that includes external components, such as mysql database, hadoop, mongo, cassandra, redis, etc., required by some of Airflow tests. Breeze provides a preconfigured Docker Compose environment where all these services are available and can be used by tests automatically.
  • Breeze environment is almost the same as used in Travis CI automated builds. So, if the tests run in your Breeze environment, they will work in Travis CI as well.

Limitations:

  • Breeze environment takes significant space in your local Docker cache. There are separate environments for different Python and Airflow versions, and each of the images takes around 3GB in total.
  • Though Airflow Breeze setup is automated, it takes time. The Breeze environment uses pre-built images from DockerHub and it takes time to download and extract those images. Building the environment for a particular Python version takes less than 10 minutes.
  • Breeze environment runs in the background taking precious resources, such as disk space and CPU. You can stop the environment manually after you use it or even use a bare environment to decrease resource usage.

NOTE: Breeze CI images are not supposed to be used in production environments. They are optimized for repeatability of tests, maintainability and speed of building rather than production performance. The production images are not yet officially published.

We check our code quality via static code checks. See STATIC_CODE_CHECKS.rst for details.

Your code must pass all the static code checks in Travis CI in order to be eligible for Code Review. The easiest way to make sure your code is good before pushing is to use pre-commit checks locally as described in the static code checks documentation.

Most of our coding style rules are enforced programmatically by flake8 and pylint (which are run automatically on every pull request), but there are some rules that are not yet automated and are more Airflow specific or semantic than style

Explicit is better than implicit. If a function accepts a session parameter it should not commit the transaction itself. Session management is up to the caller.

To make this easier there is the create_session helper:

from airflow.utils.session import create_session

def my_call(*args, session):
  ...
  # You MUST not commit the session here.

with create_session() as session:
    my_call(*args, session=session)

If this function is designed to be called by "end-users" (i.e. DAG authors) then using the @provide_session wrapper is okay:

from airflow.utils.session import provide_session

...

@provide_session
def my_method(arg, arg, session=None)
  ...
  # You SHOULD not commit the session here. The wrapper will take care of commit()/rollback() if exception

We support the following types of tests:

  • Unit tests are Python tests launched with pytest. Unit tests are available both in the Breeze environment and local virtualenv.
  • Integration tests are available in the Breeze development environment that is also used for Airflow Travis CI tests. Integration test are special tests that require additional services running, such as Postgres, Mysql, Kerberos, etc.
  • System tests are automatic tests that use external systems like Google Cloud Platform. These tests are intended for an end-to-end DAG execution.

For details on running different types of Airflow tests, see TESTING.rst.

When developing features, you may need to persist information to the metadata database. Airflow has Alembic built-in module to handle all schema changes. Alembic must be installed on your development machine before continuing with migration.

# starting at the root of the project
$ pwd
~/airflow
# change to the airflow directory
$ cd airflow
$ alembic revision -m "add new field to db"
   Generating
~/airflow/airflow/migrations/versions/12341123_add_new_field_to_db.py

airflow/www/ contains all yarn-managed, front-end assets. Flask-Appbuilder itself comes bundled with jQuery and bootstrap. While they may be phased out over time, these packages are currently not managed with yarn.

Make sure you are using recent versions of node and yarn. No problems have been found with node>=8.11.3 and yarn>=1.19.1.

Make sure yarn is available in your environment.

To install yarn on macOS:

  1. Run the following commands (taken from this source):
brew install node --without-npm
brew install yarn
yarn config set prefix ~/.yarn
  1. Add ~/.yarn/bin to your PATH so that commands you are installing could be used globally.
  2. Set up your .bashrc file and then source ~/.bashrc to reflect the change.
export PATH="$HOME/.yarn/bin:$PATH"
  1. Install third-party libraries defined in package.json by running the following commands within the airflow/www/ directory:
# from the root of the repository, move to where our JS package.json lives
cd airflow/www/
# run yarn install to fetch all the dependencies
yarn install

These commands install the libraries in a new node_modules/ folder within www/.

Should you add or upgrade a node package, run yarn add --dev <package> for packages needed in development or yarn add <package> for packages used by the code. Then push the newly generated package.json and yarn.lock file so that we could get a reproducible build. See the Yarn docs for more details.

To parse and generate bundled files for Airflow, run either of the following commands:

# Compiles the production / optimized js & css
yarn run prod

# Starts a web server that manages and updates your assets as you modify them
yarn run dev

We try to enforce a more consistent style and follow the JS community guidelines.

Once you add or modify any javascript code in the project, please make sure it follows the guidelines defined in Airbnb JavaScript Style Guide.

Apache Airflow uses ESLint as a tool for identifying and reporting on patterns in JavaScript. To use it, run any of the following commands:

# Check JS code in .js and .html files, and report any errors/warnings
yarn run lint

# Check JS code in .js and .html files, report any errors/warnings and fix them if possible
yarn run lint:fix

Typically, you start your first contribution by reviewing open tickets at Apache JIRA.

For example, you want to have the following sample ticket assigned to you: AIRFLOW-5934: Add extra CC: to the emails sent by Aiflow.

In general, your contribution includes the following stages:

Contribution Workflow

  1. Make your own fork of the Apache Airflow main repository.
  2. Create a local virtualenv, initialize the Breeze environment, and install pre-commit framework. If you want to add more changes in the future, set up your own Travis CI fork.
  3. Join devlist and set up a Slack account.
  4. Make the change and create a Pull Request from your fork.
  5. Ping @ #development slack, comment @people. Be annoying. Be considerate.

From the apache/airflow repo, create a fork:

Creating a fork

Configure the Docker-based Breeze development environment and run tests.

You can use the default Breeze configuration as follows:

  1. Install the latest versions of the Docker Community Edition and Docker Compose and add them to the PATH.

  2. Enter Breeze: ./breeze

    Breeze starts with downloading the Airflow CI image from the Docker Hub and installing all required dependencies.

  3. Enter the Docker environment and mount your local sources to make them immediately visible in the environment.

  4. Create a local virtualenv, for example:

mkvirtualenv myenv --python=python3.6
  1. Initialize the created environment:
./breeze --initialize-local-virtualenv
  1. Open your IDE (for example, PyCharm) and select the virtualenv you created as the project's default virtualenv in your IDE.

For effective collaboration, make sure to join the following Airflow groups:

  1. Update the local sources to address the JIRA ticket.

    For example, to address this example JIRA ticket, do the following:

    • Read about email configuration in Airflow.
    • Find the class you should modify. For the example ticket, this is email.py.
    • Find the test class where you should add tests. For the example ticket, this is test_email.py.
    • Create a local branch for your development. Make sure to use latest apache/master as base for the branch. See How to Rebase PR for some details on setting up the apache remote. Note - some people develop their changes directy in their own master branches - this is OK and you can make PR from your master to apache/master but we recommend to always create a local branch for your development. This allows you to easily compare changes, have several changes that you work on at the same time and many more. If you have apache set as remote then you can make sure that you have latest changes in your master by git pull apache master when you are in the local master branch. If you have conflicts and want to override your locally changed master you can override your local changes with git fetch apache; git reset --hard apache/master.
    • Modify the class and add necessary code and unit tests.
    • Run the unit tests from the IDE or local virtualenv as you see fit.
    • Run the tests in Breeze.
    • Run and fix all the static checks. If you have pre-commits installed, this step is automatically run while you are committing your code. If not, you can do it manually via git add and then pre-commit run.
  2. Rebase your fork, squash commits, and resolve all conflicts. See How to rebase PR if you need help with rebasing your change. Remember to rebase often if your PR takes a lot of time to review/fix. This will make rebase process much easier and less painful - and the more often you do it, the more comfortable you will feel doing it.

  3. Re-run static code checks again.

  4. Create a pull request with the following title for the sample ticket: [AIRFLOW-5934] Added extra CC: field to the Airflow emails.

Make sure to follow other PR guidelines described in this document.

PR Review

Note that committers will use Squash and Merge instead of Rebase and Merge when merging PRs and your commit will be squashed to single commit.

A lot of people are unfamiliar with rebase workflow in Git, but we think it is an excellent workflow, much better than merge workflow, so here is a short guide for those who would like to learn it. It's really worth to spend a few minutes learning it. As opposed to merge workflow, the rebase workflow allows to clearly separate your changes from changes of others, puts responsibility of proper rebase on the author of the change. It also produces a "single-line" series of commits in master branch which makes it much easier to understand what was going on and to find reasons for problems (it is especially useful for "bisecting" when looking for a commit that introduced some bugs.

First of all - you can read about rebase workflow here: Merging vs. rebasing - this is an excellent article that describes all ins/outs of rebase. I recommend reading it and keeping it as reference.

The goal of rebasing your PR on top of apache/master is to "transplant" your change on top of the latest changes that are merged by others. It also allows you to fix all the conflicts that are result of other people changing the same files as you and merging the changes to apache/master.

Here is how rebase looks in practice:

  1. You need to add Apache remote to your git repository. You can add it as "apache" remote so that you can refer to it easily:

git remote add apache git@github.com:apache/airflow.git if you use ssh or git remote add apache https://github.com/apache/airflow.git if you use https.

Later on

  1. You need to make sure that you have the latest master fetched from apache repository. You can do it by git fetch apache for apache remote or git fetch --all to fetch all remotes.
  2. Assuming that your feature is in a branch in your repository called my-branch you can check easily what is the base commit you should rebase from by: git merge-base my-branch apache/master. This will print the HASH of the base commit which you should use to rebase your feature from - for example: 5abce471e0690c6b8d06ca25685b0845c5fd270f. You can also find this commit hash manually - if you want better control. Run git log and find the first commit that you DO NOT want to "transplant". git rebase HASH will "trasplant" all commits after the commit with the HASH.
  3. Make sure you checked out your branch locally:

git checkout my-branch

  1. Rebase: Run: git rebase HASH --onto apache/master for example: git rebase 5abce471e0690c6b8d06ca25685b0845c5fd270f --onto apache/master
  2. If you have no conflicts - that's cool. You rebased. You can now run git push --force-with-lease to push your changes to your repository. That should trigger the build in CI if you have a Pull Request opened already.
  3. While rebasing you might have conflicts. Read carefully what git tells you when it prints information about the conflicts. You need to solve the conflicts manually. This is sometimes the most difficult part and requires deliberate correcting your code looking what has changed since you developed your changes. There are various tools that can help you with that. You can use git mergetool (and you can configure different merge tools with it). Also you can use IntelliJ/PyCharm excellent merge tool. When you open project in PyCharm which has conflict you can go to VCS->Git->Resolve Conflicts and there you have a very intuitive and helpful merge tool. You can see more information about it in Resolve conflicts
  4. After you solved conflicts simply run git rebase --continue and go either to point 6. or 7. above depending if you have more commits that cause conflicts in your PR (rebasing applies each commit from your PR one-by-one).

Apache Airflow is a Community within Apache Software Foundation. As the motto of the Apache Software Foundation states "Community over Code" - people in the community are far more important than their contribution.

This means that communication plays a big role in it, and this chapter is all about it.

We have various channels of communication - starting from the official devlist, comments in the Pull Requests, Slack, wiki.

All those channels can be used for different purposes. You can join the channels via links at the Airflow Community page

  • The Apache Airflow devlist for:
    • official communication
    • general issues, asking community for opinion
    • discussing proposals
    • voting
  • The Airflow CWiki for:
    • detailed discussions on big proposals (Airflow Improvement Proposals also name AIPs)
    • helpful, shared resources (for example Apache Airflow logos
    • information that can be re-used by others (for example instructions on preparing workshops)
  • Github Pull Requests (PRs) for:
    • discussing implementation details of PRs
    • not for architectural discussions (use the devlist for that)
  • The Apache Airflow Slack for:
    • ad-hoc questions related to development (#development channel)
    • asking for review (#development channel)
    • asking for help with PRs (#how-to-pr channel)
    • troubleshooting (#troubleshooting channel)
    • group talks (including SIG - special interest groups) (#sig-* channels)
    • notifications (#announcements channel)
    • random queries (#random channel)
    • regional announcements (#users-* channels)
    • newbie questions (#newbie-questions channel)
    • occasional discussions (wherever appropriate including group and 1-1 discussions)

The devlist is the most important and official communication channel. Often at Apache project you can hear "if it is not in the devlist - it did not happen". If you discuss and agree with someone from the community on something important for the community (including if it is with committer or PMC member) the discussion must be captured and reshared on devlist in order to give other members of the community to participate in it.

We are using certain prefixes for email subjects for different purposes. Start your email with one of those:
  • [DISCUSS] - if you want to discuss something but you have no concrete proposal yet
  • [PROPOSAL] - if usually after "[DISCUSS]" thread discussion you want to propose something and see what other members of the community think about it.
  • [AIP-NN] - if the mail is about one of the Airflow Improvement Proposals
  • [VOTE] - if you would like to start voting on a proposal discussed before in a "[PROPOSAL]" thread

Voting is governed by the rules described in Voting

We are all devoting our time for community as individuals who except for being active in Apache Airflow have families, daily jobs, right for vacation. Sometimes we are in different time zones or simply are busy with day-to-day duties that our response time might be delayed. For us it's crucial to remember to respect each other in the project with no formal structure. There are no managers, departments, most of us is autonomous in our opinions, decisions. All of it makes Apache Airflow community a great space for open discussion and mutual respect for various opinions.

Disagreements are expected, discussions might include strong opinions and contradicting statements. Sometimes you might get two committers asking you to do things differently. This all happened in the past and will continue to happen. As a community we have some mechanisms to facilitate discussion and come to a consensus, conclusions or we end up voting to make important decisions. It is important that these decisions are not treated as personal wins or looses. At the end it's the community that we all care about and what's good for community, should be accepted even if you have a different opinion. There is a nice motto that you should follow in case you disagree with community decision "Disagree but engage". Even if you do not agree with a community decision, you should follow it and embrace (but you are free to express your opinion that you don't agree with it).

As a community - we have high requirements for code quality. This is mainly because we are a distributed and loosely organised team. We have both - contributors that commit one commit only, and people who add more commits. It happens that some people assume informal "stewardship" over parts of code for some time - but at any time we should make sure that the code can be taken over by others, without excessive communication. Setting high requirements for the code (fairly strict code review, static code checks, requirements of automated tests, pre-commit checks) is the best way to achieve that - by only accepting good quality code. Thanks to full test coverage we can make sure that we will be able to work with the code in the future. So do not be surprised if you are asked to add more tests or make the code cleaner - this is for the sake of maintainability.

Here are a few rules that are important to keep in mind when you enter our community:

  • Do not be afraid to ask questions
  • The communication is asynchronous - do not expect immediate answers, ping others on slack (#development channel) if blocked
  • There is a #newbie-questions channel in slack as a safe place to ask questions
  • You can ask one of the committers to be a mentor for you, committers can guide within the community
  • You can apply to more structured Apache Mentoring Programme
  • It’s your responsibility as an author to take your PR from start-to-end including leading communication in the PR
  • It’s your responsibility as an author to ping committers to review your PR - be mildly annoying sometimes, it’s OK to be slightly annoying with your change - it is also a sign for committers that you care
  • Be considerate to the high code quality/test coverage requirements for Apache Airflow
  • If in doubt - ask the community for their opinion or propose to vote at the devlist
  • Discussions should concern subject matters - judge or criticise the merit but never criticise people
  • It’s OK to express your own emotions while communicating - it helps other people to understand you
  • Be considerate for feelings of others. Tell about how you feel not what you think of others

As part of preparation to Airflow 2.0 we decided to prepare backport of providers package that will be possible to install in the Airflow 1.10.*, Python 3.6+ environment. Some of those packages will be soon (after testing) officially released via PyPi, but you can build and prepare such packages on your own easily.

  • The setuptools.py script only works in python3.6+. This is also our minimally supported python version to use the packages in.
  • Make sure you have setuptools and wheel installed in your python environment. The easiest way to do it is to run pip install setuptools wheel
  • Enter the backport_packages directory
  • Usually you only build some of the providers package. The providers directory is separated into separate providers. You can see the list of all available providers by running python setup_backport_packages.py list-backport-packages. You can build the backport package by running python setup.py <PROVIDER_NAME> bdist_wheel. Note that there might be (and are) dependencies between some packages that might prevent subset of the packages to be used without installing the packages they depend on. This will be solved soon by adding cross-dependencies between packages.
  • You can build 'all providers' package by running python setup_backport_packages.py providers bdist_wheel. This package contains all providers thus it does not have issues with cross-dependencies.
  • This creates a wheel package in your dist folder with a name similar to: apache_airflow_providers-0.0.1-py2.py3-none-any.whl
  • You can install this package with pip install <PACKAGE_FILE>
  • You can also build sdist (source distribution packages) by running python setup.py <PROVIDER_NAME> sdist but this is only needed in case of distribution of the packages.

Note that those are unofficial packages yet - they are not yet released in PyPi, but you might use them to test the master versions of operators/hooks/sensors in a 1.10.* environment of airflow with Python3.6+