Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow custom proxy settings with requests sessions #501

Open
maawoo opened this issue Mar 26, 2024 · 7 comments
Open

Allow custom proxy settings with requests sessions #501

maawoo opened this issue Mar 26, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@maawoo
Copy link

maawoo commented Mar 26, 2024

I'm trying to download GEDI data on my university's HPC system. The following sample code results in a ConnectionError:

results = earthaccess.search_data(
    short_name='GEDI02_A',
    bounding_box=(31.52,-25.08,31.64,-24.99),
    temporal=("2019-01-01", "2024-01-01"),
    count=-1
)
ConnectionError: HTTPSConnectionPool(host='cmr.earthdata.nasa.gov', port=443): Max retries exceeded with url: [/search/granules.umm_json](https://vscode-remote+ssh-002dremote-002bdraco2.vscode-resource.vscode-cdn.net/search/granules.umm_json)?short_name=GEDI02_A&bounding_box=31.52,-25.08,31.64,-24.99&temporal%5B%5D=2019-01-01T00:00:00Z,2024-01-01T00:00:00Z&page_size=0 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f69ef2d40b0>: Failed to establish a new connection: [Errno 111] Connection refused'))

My initial thought was that the API is not whitelisted in our HTTP/HTTPS proxies, which are set via environment variables. However, according to our sysadmin this should not be an issue. I was able to confirm by requesting the same URL via curl:

>> curl "https://cmr.earthdata.nasa.gov/search/granules.umm_json?short_name=GEDI02_A&bounding_box=31.52,-25.08,31.64,-24.99&temporal%5B%5D=2019-01-01T00:00:00Z,2024-01-01T00:00:00Z&page_size=0"
{"hits":92,"took":394,"items":[]}

Any ideas / workarounds would be appreciated!

@mfisher87
Copy link
Collaborator

It's really hard to say what's going on here without knowing more about your university HPC system. Based on the error, it looks like VSCode is somehow involved?

https://vscode-remote+ssh-002dremote-002bdraco2.vscode-resource.vscode-cdn.net/search/granules.umm_json

Can you provide some more detail on how VSCode is involved in your workflow? host='cmr.earthdata.nasa.gov' indicates that earthaccess is at least attempting to talk to the correct host, and the Requests library seems to agree!

@mfisher87 mfisher87 added the feedback requested We requested feedback from the reporter; if we don't hear back in X days the issue may be closed label Mar 26, 2024
@maawoo
Copy link
Author

maawoo commented Mar 27, 2024

Hi @mfisher87,
I overlooked that, so thanks for pointing it out. However, I still get an error when executing the code outside of VSCode.

Here is the complete error traceback:
---------------------------------------------------------------------------
ConnectionRefusedError                    Traceback (most recent call last)
File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connection.py:203, in HTTPConnection._new_conn(self)
    202 try:
--> 203     sock = connection.create_connection(
    204         (self._dns_host, self.port),
    205         self.timeout,
    206         source_address=self.source_address,
    207         socket_options=self.socket_options,
    208     )
    209 except socket.gaierror as e:

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py:85, in create_connection(address, timeout, source_address, socket_options)
     84 try:
---> 85     raise err
     86 finally:
     87     # Break explicitly a reference cycle

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py:73, in create_connection(address, timeout, source_address, socket_options)
     72     sock.bind(source_address)
---> 73 sock.connect(sa)
     74 # Break explicitly a reference cycle

ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

NewConnectionError                        Traceback (most recent call last)
File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:791, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    790 # Make the request on the HTTPConnection object
--> 791 response = self._make_request(
    792     conn,
    793     method,
    794     url,
    795     timeout=timeout_obj,
    796     body=body,
    797     headers=headers,
    798     chunked=chunked,
    799     retries=retries,
    800     response_conn=response_conn,
    801     preload_content=preload_content,
    802     decode_content=decode_content,
    803     **response_kw,
    804 )
    806 # Everything went great!

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:492, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    491         new_e = _wrap_proxy_error(new_e, conn.proxy.scheme)
--> 492     raise new_e
    494 # conn.request() calls http.client.*.request, not the method in
    495 # urllib3.request. It also calls makefile (recv) on the socket.

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:468, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    467 try:
--> 468     self._validate_conn(conn)
    469 except (SocketTimeout, BaseSSLError) as e:

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:1097, in HTTPSConnectionPool._validate_conn(self, conn)
   1096 if conn.is_closed:
-> 1097     conn.connect()
   1099 if not conn.is_verified:

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connection.py:611, in HTTPSConnection.connect(self)
    610 sock: socket.socket | ssl.SSLSocket
--> 611 self.sock = sock = self._new_conn()
    612 server_hostname: str = self.host

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connection.py:218, in HTTPConnection._new_conn(self)
    217 except OSError as e:
--> 218     raise NewConnectionError(
    219         self, f"Failed to establish a new connection: {e}"
    220     ) from e
    222 # Audit hooks are only available in Python 3.8+

NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f81075a01d0>: Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

MaxRetryError                             Traceback (most recent call last)
File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    485 try:
--> 486     resp = conn.urlopen(
    487         method=request.method,
    488         url=url,
    489         body=request.body,
    490         headers=request.headers,
    491         redirect=False,
    492         assert_same_host=False,
    493         preload_content=False,
    494         decode_content=False,
    495         retries=self.max_retries,
    496         timeout=timeout,
    497         chunked=chunked,
    498     )
    500 except (ProtocolError, OSError) as err:

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:845, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    843     new_e = ProtocolError("Connection aborted.", new_e)
--> 845 retries = retries.increment(
    846     method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
    847 )
    848 retries.sleep()

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/retry.py:515, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    514     reason = error or ResponseError(cause)
--> 515     raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    517 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)

MaxRetryError: HTTPSConnectionPool(host='cmr.earthdata.nasa.gov', port=443): Max retries exceeded with url: /search/granules.umm_json?short_name=GEDI02_A&bounding_box=31.52,-25.08,31.64,-24.99&temporal%5B%5D=2019-01-01T00:00:00Z,2024-01-01T00:00:00Z&page_size=0 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f81075a01d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
Cell In[3], line 1
----> 1 results = earthaccess.search_data(
      2     short_name='GEDI02_A',
      3     bounding_box=(31.52,-25.08,31.64,-24.99),
      4     temporal=("2019-01-01", "2024-01-01"),
      5     count=-1
      6 )

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/earthaccess/api.py:120, in search_data(count, **kwargs)
    118 else:
    119     query = DataGranules().parameters(**kwargs)
--> 120 granules_found = query.hits()
    121 print(f"Granules found: {granules_found}")
    122 if count > 0:

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/earthaccess/search.py:388, in DataGranules.hits(self)
    379 """Returns the number of hits the current query will return.
    380 This is done by making a lightweight query to CMR and inspecting the returned headers.
    381
    382 Returns:
    383     The number of results reported by CMR.
    384 """
    386 url = self._build_url()
--> 388 response = self.session.get(url, headers=self.headers, params={"page_size": 0})
    390 try:
    391     response.raise_for_status()

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/sessions.py:602, in Session.get(self, url, **kwargs)
    594 r"""Sends a GET request. Returns :class:`Response` object.
    595
    596 :param url: URL for the new :class:`Request` object.
    597 :param \*\*kwargs: Optional arguments that ``request`` takes.
    598 :rtype: requests.Response
    599 """
    601 kwargs.setdefault("allow_redirects", True)
--> 602 return self.request("GET", url, **kwargs)

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    584 send_kwargs = {
    585     "timeout": timeout,
    586     "allow_redirects": allow_redirects,
    587 }
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
    700 start = preferred_clock()
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)
    706 elapsed = preferred_clock() - start

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/adapters.py:519, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    515     if isinstance(e.reason, _SSLError):
    516         # This branch is for urllib3 v1.22 and later.
    517         raise SSLError(e, request=request)
--> 519     raise ConnectionError(e, request=request)
    521 except ClosedPoolError as e:
    522     raise ConnectionError(e, request=request)

ConnectionError: HTTPSConnectionPool(host='cmr.earthdata.nasa.gov', port=443): Max retries exceeded with url: /search/granules.umm_json?short_name=GEDI02_A&bounding_box=31.52,-25.08,31.64,-24.99&temporal%5B%5D=2019-01-01T00:00:00Z,2024-01-01T00:00:00Z&page_size=0 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f81075a01d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

Same error also in a clean environment with Python 3.11.8 instead of 3.12.2.

I also tried downgrading the package (to 0.7.0) and noticed that it prints out the number of granules found before the error:

>>> earthaccess.search_data(
...     short_name='GEDI02_A',
...     bounding_box=(31.52,-25.08,31.64,-24.99),
...     temporal=("2019-01-01", "2024-01-01"),
...     count=-1
... )
Granules found: 92
Traceback (most recent call last):
  File "/home/du23yow/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/du23yow/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/home/du23yow/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:
...

Any other ideas of what I could do?

@github-actions github-actions bot removed the feedback requested We requested feedback from the reporter; if we don't hear back in X days the issue may be closed label Mar 27, 2024
@maawoo
Copy link
Author

maawoo commented Mar 27, 2024

Okay, I found the explanation in this icepyx discussion. Ping @betolink 🙂 Any suggestion on using earthaccess.search_data and earthaccess.download with an updated requests session?

@maawoo maawoo changed the title ConnectionError - Proxy env variables overridden? Allow custom proxy settings with requests sessions Mar 27, 2024
@betolink
Copy link
Member

Hi @maawoo, I think this could be resolved if we let users pass the proxy settings to requests, in the meantime you can manually get a session modify it and get the files but that defeats the purpose!

import earthaccess
from itertools import chain # to flatten the results

earthaccess.login()

# Define your proxy
proxy = {
    'http': 'http://your_proxy_address:port',
    'https': 'https://your_proxy_address:port'
}


results = earthaccess.search_data(
    short_name='GEDI02_A',
    bounding_box=(31.52,-25.08,31.64,-24.99),
    temporal=("2019-01-01", "2024-01-01"),
    count=-1
)

links = list(chain.from_iterable([r.data_links() for r in  results]))
session = earthaccess.get_requests_https_session()
session.proxies.update(proxy)

for url in links:
    local_filename = url.split("/")[-1]
    path = f"temp_dir/{local_filename}"
    with session.get(
          url,
          stream=True,
          allow_redirects=True,
      ) as r:
          r.raise_for_status()
          with open(path, "wb") as f:
              shutil.copyfileobj(r.raw, f, length=1024 * 1024)

This is not concurrent so there is room for improvement, as I said we should implement the proxy here but my guess is that it won't be ready in the next week.

@betolink betolink added the enhancement New feature or request label Mar 27, 2024
@maawoo
Copy link
Author

maawoo commented Mar 27, 2024

Thank you for the possible workaround!

my guess is that it won't be ready in the next week

No worries! I already have the data I need. My plan was to implement earthaccess into some scripts but that can wait for now.

@chuckwondo
Copy link
Collaborator

The requests library makes use of urllib and urllib recognizes environment variables of the form <scheme>_proxy (either uppercase or lowercase). Therefore, you should be able to simply set the environment variable https_proxy or HTTPS_PROXY to the appropriate value.

However, whether or not those env vars are used is determined by the boolean value of trust_env on the requests.Session object used for making requests. By default, trust_env is True, and the env vars for proxies are used, but if trust_env is False, they are not used. Thus there might be situations in which earthaccess will not use the env vars because there are situations where it sets trust_env to False.

I suggest attempting to export your https_proxy env var appropriately, and retrying your example to see if that works for you.

@chuckwondo
Copy link
Collaborator

@maawoo, if you want a workaround until we can come up with a robust and secure solution, here's something based upon the thread from #823. This is pulled from a combination of code from a few comments in that PR, and some minor renaming/refactoring.

First, define a set_proxies function:

import os
from functools import cache, wraps
from typing import Callable
from typing_extensions import ParamSpec

import earthaccess
import requests


P = ParamSpec("P")


def set_proxies(f: Callable[P, requests.Session]) -> Callable[P, requests.Session]:
    @wraps(f)
    def wrapper(*args: P.args, **kwargs: P.kwargs) -> requests.Session:
        session = f(*args, **kwargs)
        session.proxies.update(
            {
                scheme: v
                for scheme in ("http", "https")
                if (
                    v := os.environ.get(
                        k := f"{scheme}_proxy", os.environ.get(k.upper())
                    )
                )
            }
        )

        return session

    return wrapper

Now you can use set_proxies to decorate the earthaccess.Auth.get_session method after you login (so that you can get an authenticated Auth instance to get a session from):

earthaccess.login()
auth: earthaccess.Auth = earthaccess.__store__.auth
auth.get_session = cache(set_proxies(auth.get_session))

From here, any further earthaccess calls to open or download files will use the same requests session with the proxies set on the session by set_proxies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 🆕 New
Development

Successfully merging a pull request may close this issue.

4 participants