Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bind log server on worker to IPv6 address (#24755) #24846

Merged
merged 1 commit into from
Jul 7, 2022
Merged

Conversation

zartstrom
Copy link
Contributor

@zartstrom zartstrom commented Jul 5, 2022

Bind log server on worker to IPv6 address (#24755)

The worker(s) run a gunicorn web server that serves task logs,
which are displayed on the web-ui (the Logs).
Until now the log server did bind the address 0.0.0.0:<port>, see airflow/utils/serve_logs.py.

In a Kubernetes cluster that allows IPv6 traffic only, the worker logs are not reachable,
because the log server does not listen on [::]:<port>.
Therefore we bind another address and listen to both IPv4 and IPv6 traffic.

closes: #24755

@boring-cyborg
Copy link

boring-cyborg bot commented Jul 5, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

# Gunicorn can bind multiple addresses, see https://docs.gunicorn.org/en/stable/settings.html#bind.
options = [
GunicornOption("bind", f"0.0.0.0:{worker_log_server_port}"),
GunicornOption("bind", f"[::]:{worker_log_server_port}"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cause any problems if ipv6 isn't enabled?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it breaks the gunicorn startup with a failure. Below more details #24846 (comment)

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. looked up some issues and seems that this benoitc/gunicorn#1138 could lead to some problems in the past (but likely it's solved already as it is very old). So worth double checking if this works in different combos of v4/v6.

@eladkal eladkal mentioned this pull request Jul 5, 2022
@potiuk potiuk merged commit 2f29bfe into apache:main Jul 7, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Jul 7, 2022

Awesome work, congrats on your first merged pull request!

@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Aug 12, 2022
@ephraimbuddy ephraimbuddy added this to the Airflow 2.3.4 milestone Aug 12, 2022
ephraimbuddy pushed a commit that referenced this pull request Aug 15, 2022
@saimon46
Copy link

Hello! We have Airflow version 2.3.4 and this is not working if in the machine IPv6 is not available or disabled. We end up having an error during worker startup and the IPv4 bind doesn't even start because the entire gunicorn process fails. The bind should be configurable and not hardcoded in the code.

[2022-08-30 12:09:00 +0000] [1754859] [INFO] Starting gunicorn 20.1.0
[2022-08-30 12:09:00 +0000] [1754859] [ERROR] Retrying in 1 second.
[2022-08-30 12:09:01 +0000] [1754859] [ERROR] Retrying in 1 second.
[2022-08-30 12:09:02 +0000] [1754859] [ERROR] Retrying in 1 second.
[2022-08-30 12:09:03 +0000] [1754859] [ERROR] Retrying in 1 second.
[2022-08-30 12:09:04,388: INFO/MainProcess] Connected to amqp://**:**@**:**//
[2022-08-30 12:09:04,403: INFO/MainProcess] mingle: searching for neighbors
[2022-08-30 12:09:04 +0000] [1754859] [ERROR] Retrying in 1 second.
[2022-08-30 12:09:05,457: INFO/MainProcess] mingle: sync with 3 nodes
[2022-08-30 12:09:05,457: INFO/MainProcess] mingle: sync complete
[2022-08-30 12:09:05,486: INFO/MainProcess] celery@** ready.
[2022-08-30 12:09:05,543: INFO/MainProcess] Events of group {task} enabled by remote.
[2022-08-30 12:09:05 +0000] [1754859] [ERROR] Can't connect to ('::', 8793)

@potiuk
Copy link
Member

potiuk commented Aug 30, 2022

Feel free to make a PR to change it @saimon46. It is in serve_logs.py. You are likely the best person to test it and provide a fix as you will be able to test if the fix fixes your problem (And you can apply it locally before Airlfow release so there are mutual benefits).

Looking forward to your PR - happy to review it.

@saimon46
Copy link

saimon46 commented Aug 30, 2022

@potiuk I perfectly agree on this. I'm going to work on a PR but before I start I'm writing here 2 possible solutions for this problem:

  • airflow logging configuration to define the bind address, same as the port. Maybe bind it to localhost in some cases can be useful.
  • simple airflow logging configuration that enables or disables IPv6 compatibility on gunicorn.
  • automatically discover if the machine has IPv6 interface and if it's enabled and eventually bind gunicorn on it.

What do you thing would be the best? The first gives more flexibility to the sysadmin same as the second, the third is automatic yes, but still you loose in configuration flexibility.

@potiuk
Copy link
Member

potiuk commented Aug 30, 2022

The best will be handle error gracefully simply rather than fail (this was the original intention and we believed it worked this way).. Adding new configuration should only be neeeded if absolutely necessary.

@potiuk
Copy link
Member

potiuk commented Aug 30, 2022

There is no particular reason why you should configure it - you can configure it by enabling/disabling the IPV6/IPV4 in the deployment of yours and Airflow should use whatever is available. There is no point in having two places where you can configure it (deployment and airflow).

@notatallshaw-gts
Copy link
Contributor

FYI also affected by this.

@potiuk Are you suggesting a preferred PR would be to catch this exception and try binding to IPv4 only?

To me it looks like Gunicorn only throws a sys.exit(1) on this error: https://github.com/benoitc/gunicorn/blob/master/gunicorn/sock.py#L198

Possible to catch SystemExit but it's not the best behavior as you can't tell if that happened for some other reason.

@potiuk
Copy link
Member

potiuk commented Aug 30, 2022

To me it looks like Gunicorn only throws a sys.exit(1) on this error: https://github.com/benoitc/gunicorn/blob/master/gunicorn/sock.py#L198

Then it needs proper fix and detection. Just make it work automatically (whatever it takes)

@notatallshaw-gts
Copy link
Contributor

notatallshaw-gts commented Aug 30, 2022

One approach I guess is to test if it's possible to bind to IPv6?

I played around with the socket library and came up with this test (tested on both machines that can bind and can not bind to IPv6):

import socket

def ipv6_bindable(port=8000):
    # Couldn't find a machine where "socket.has_ipv6" returns False
    # but thought it worth adding as if it did return False it may throw
    # a different exception when trying to open a socket with family=socket.AF_INET6
    if not socket.has_ipv6: 
        return False

    try:
        socket.socket(family=socket.AF_INET6).bind(('::', port))
    except OSError as e:
        if e.args[0] == 97:  # Address family not supported by protocol
            return False
        raise
    return True

@potiuk
Copy link
Member

potiuk commented Aug 30, 2022

Just make a PR :) discusing over the code where we can comment it directly and you can iterate is way better than disucssing copy&pasted code without context. The worst that is going to happen, we will agree to change the approach, but by doing "real code" change you will see immediately if it works, how it fits in etc.

@saimon46
Copy link

I confirm that the test to check that IPv6 is bindable is the best approach. In our case the function that @notatallshaw-work posted works fine. Around that function we should add or not the IPv6 bind in the gunicorn options.
@notatallshaw-work are you already working on a PR?

@notatallshaw-gts
Copy link
Contributor

notatallshaw-gts commented Aug 31, 2022

I confirm that the test to check that IPv6 is bindable is the best approach. In our case the function that @notatallshaw-work posted works fine. Around that function we should add or not the IPv6 bind in the gunicorn options. @notatallshaw-work are you already working on a PR?

I am discussing with my company on the process to contribute to open source projects, we would like to but we just need to make sure we get the policy right on our end. It may be a week or so until we have it ironed out, feel free to use any code I've written here in any way to submit a PR yourself if you would like it to land quicker than I can provide one.

Also I was playing around with a more generic version, with the idea that someone in the future might have only IPv6 enabled and IPv4 disabled:

import socket

def get_bindable_addresses(port=8000):
    address_list = []
    addrinfo = socket.getaddrinfo(None, port, 0, socket.SOCK_STREAM, 0, socket.AI_PASSIVE)
    for addr in addrinfo:
        family = addr[0]
        test_address = addr[-1][:2]
        try:
            socket.socket(family=family).bind(test_address)
        except OSError as e:
            print(f'WARN: Failed to bind {test_address} using IP family {family}: {e!s}')
            continue
        
        if family == socket.AF_INET:
            address_list.append(f'{test_address[0]}:{test_address[1]}')
        elif family == socket.AF_INET6:
            address_list.append(f'[{test_address[0]}]:{test_address[1]}')
        else:
            print(f'WARN: Unrecognized IP address family {family!r}')
    return address_list


bindable_addresses = get_bindable_addresses()
if not bindable_addresses:
    raise RuntimeError(...)
...

But maybe that's overkill, whatever you think is best if you submit a PR.

@potiuk
Copy link
Member

potiuk commented Aug 31, 2022

No worries. It can wait :). One benefit of the "auto-configuration" is that it might be treated as a bug-fix so it might be added any time (and I anticipate one-two bugfix releases following 2.4.0 in a few weeks time.

I believe when you bind to IPv6 address and IPV4 is there, the OS will automatically map IPV4 port to IPV6 port (unless you specify IPV6 bind only) - this is at least what https://www.reddit.com/r/codehunter/comments/tgnalu/dual_ipv4_and_ipv6_support_in_flask_applications/ suggest.

So I think we can simplify it all by simply binding to IPV6 if it is bindable and IPV4 if not. I think we cannot "just" bind to IPV6 because if - for whatever reason - IPV6 support is not enabled at all, it will fail.

@notatallshaw-gts
Copy link
Contributor

notatallshaw-gts commented Aug 31, 2022

This is all new to me, but yes I believe that's the case as long as the OS supports dualstack and IPV6_V6ONLY has not been set on the socket options (which searching through Gunicorns repo I never see this option):

Thus it would be actually simpler to change the code to:

if socket.has_dualstack_ipv6():
    bind_option = GunicornOption("bind", f"[::]:{worker_log_server_port}")
else:
    bind_option = GunicornOption("bind", f"0.0.0.0:{worker_log_server_port}")

options = [
    bind_option,
    GunicornOption("workers", 2),
]

Note though that socket.has_dualstack_ipv6 was introduced in Python 3.8, if Airflow still supports Python 3.7 then this code may need to be vendored: https://github.com/python/cpython/blob/v3.10.6/Lib/socket.py#L853 (or just not serve IPv6 for Python 3.7 and lower)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Log server on celery worker does not work in IPv6-only setup
8 participants