Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Change Pandas User-Agent and add possibility to set custom http_headers to pd.read_* functions #36688

Closed
astromatt opened this issue Sep 27, 2020 · 14 comments · Fixed by #37966
Assignees
Labels
Docs Enhancement good first issue IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@astromatt
Copy link

astromatt commented Sep 27, 2020

Currently Pandas makes HTTP requests using "Python-urllib/3.8" as a User Agent.
This prevents from downloading some resources and static files from various places.
What if, Pandas would make requests using "Pandas/1.1.0" headers instead?
There should be possibility to add custom headers too (auth, csrf tokens, api versions and so on).

Use Case:

I am writing a book on Pandas:

I published data in CSV and JSON to use in code listings:

You can access those resources via browser, curl, or even requests, but not using Pandas.
The only change you'd need to do is to set User-Agent.
This is due to the readthedocs.io blocking "Python-urllib/3.8" User Agent for whatever reason.
The same problem affects many other places where you can get data (not only readthedocs.io).

Currently I get those resources with requests and then put response.text to one of:

  • pd.read_csv
  • pd.read_json
  • pd.read_html

Unfortunately this makes even simplest code listings... quite complex (due to the explanation of requests library and why I do this like that).

Pandas uses urllib.request.urlopen which does not allow to set http_headers
https://github.com/pandas-dev/pandas/blob/master/pandas/io/common.py#L146

Although urllib.request.urlopen can take urllib.request.Request as an argument.
And urllib.request.Request object has possibility to set custom http_headers
https://docs.python.org/3/library/urllib.request.html#urllib.request.Request

Possibility to add custom http_headers should be in pd.read_csv, pd.read_json and pd.read_html functions.

From what I see, the read_* call stack is three to four function deep.
There are only 6 references in 4 files to urlopen(*args, **kwargs) function.
So the change shouldn't be quite hard to implement.

http_headers parameter can be Optional[List] which will be fully backward compatible and would not require any changes to others code.

@astromatt astromatt added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 27, 2020
@astromatt astromatt changed the title ENH: Change Pandas User-Agent and add possibility to set custom http_headers to pd.read_* functions ENH: Change Pandas User-Agent and add possibility to set custom http_headers to pd.read_* functions Sep 27, 2020
@jreback
Copy link
Contributor

jreback commented Sep 27, 2020

we have had this request before

pls search for these issues

@astromatt
Copy link
Author

Related to #10526

@jreback jreback added IO Data IO issues that don't fit into a more specific label and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 2, 2020
@jreback
Copy link
Contributor

jreback commented Oct 2, 2020

@martindurant can we pass these thru using StorageOptions?

@martindurant
Copy link
Contributor

HTTP is the only of the "protocol://" URLs which is not handled by fsspec, because it already had its own code (whereas s3fs and gcs were already using fsspec second-hand).

For HTTPFileSystem, you can include headers as a key in client_kwargs, which could contain your custom user agent or anything else you want. That would look a little bit untidy, but OK

storage_options={"client_kwargs": {"headers": {"User-Agent": "pandas"}}}

@jreback
Copy link
Contributor

jreback commented Oct 6, 2020

ok u think a PR to add an example in read_csv / io.rst would be sufficient then

@astromatt if u are interested

@jreback jreback added the Docs label Oct 6, 2020
@jreback jreback added this to the Contributions Welcome milestone Oct 6, 2020
@cdknox
Copy link
Member

cdknox commented Nov 11, 2020

take

@cdknox
Copy link
Member

cdknox commented Nov 11, 2020

If a url and storage_options are passed into read_csv a ValueError is raised, as show by current excerpt from pandas/io/common.py(326):

if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer):
    # TODO: fsspec can also handle HTTP via requests, but leaving this unchanged
    if storage_options:
        raise ValueError(
            "storage_options passed with file object or non-fsspec file path"
        )

By using a context manager from fsspec I was able to change the UserAgent successfully as shown by the code below.

import pandas as pd
import fsspec

url = 'http://localhost:8000/temp.csv'
client_kwargs = {'headers': {'User-Agent': 'pandas'}}
with fsspec.open(url, client_kwargs = client_kwargs) as f:
    df = pd.read_csv(f)

Are we okay with the additional top level import being part of the documentation, or should we modify the code as opposed to the documentation?

@martindurant
Copy link
Contributor

HTTP is the only legacy non-fsspec remote IO in pandas (because s3 and gcs were already using fsspec on the backend at the time of transition). You could change the HTTP implementation in pandas, which is simple, to accept storage_options, or you could use the fsspec variant (which may result in a change of behaviour).

@cdknox
Copy link
Member

cdknox commented Nov 11, 2020

I'm assuming there is appetite to have all remote IO switched over for the consistency's sake. It looks like it almost could be a drop in replacement for HTTP. Obviously I'd continue to test, but the biggest hurdle I see is around a server sending back gzipped content. Currently pandas checks to see what the server sends back in the header and will decompress accordingly. Snippet from pandas/io/common.py(332)

req = urlopen(filepath_or_buffer)
content_encoding = req.headers.get("Content-Encoding", None)
if content_encoding == "gzip":
    # Override compression based on Content-Encoding header
    compression = {"method": "gzip"}
reader = BytesIO(req.read())
req.close()

I don't see response header information attached to the file after being read from the network in fsspec, so I don't know how to tell if the server sent back gzipped data. I know you can try to request unzipped data so maybe that's a way to make it work if we want to convert to fsspec for HTTP. Though I figure if there's an existing way to tell whether the response is gzipped in fsspec, you're probably the one to know of it!

If we don't like that route I can try working the storage_options into the existing HTTP implementation. Thanks for the thoughts.

@cdknox cdknox mentioned this issue Nov 20, 2020
5 tasks
@cdknox
Copy link
Member

cdknox commented Nov 20, 2020

Since currently fsspec seems to not be a 100% drop in at the moment I copped out and just passed through storage_options to the header in the event the is an http url.

First time contributor so if something is awry with the pull request simply let me know and I'd be glad to fix it. I didn't notice the commit message guidelines until my last commit I made so sorry if that's an issue.

@stragu
Copy link
Contributor

stragu commented Dec 11, 2020

Will PR #37966 assign a default (pandas-specific) header to read_csv() and read_json(), or will it only resolve the "set custom headers" part of this issue?

@cdknox
Copy link
Member

cdknox commented Dec 11, 2020

As it sits currently it resolves the "set custom headers" aspect. I went that route because urllib, which does the grunt work of making the http(s) request, does have it's own User-Agent header value that it sends by default. I don't see a ton of upside to changing the default User-Agent as it would likely still only convey to the server that it's an automated process requesting the data. In combination with that I could see some downside in that someone out there has probably whitelisted that default User-Agent value and a change here would warrant a change there. So my logic netted out to continuing to use the default User-Agent but adding the ability to change it.

@jreback jreback modified the milestones: Contributions Welcome, 1.3 Dec 14, 2020
@stragu
Copy link
Contributor

stragu commented Dec 15, 2020

Thank you for your work on this, and the detailed answer, @cdknox 😃

@cdknox
Copy link
Member

cdknox commented Dec 15, 2020

No problem, glad to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Enhancement good first issue IO Data IO issues that don't fit into a more specific label
Projects
None yet
5 participants