ENH: Change Pandas User-Agent and add possibility to set custom http_headers to pd.read_* functions #36688

astromatt · 2020-09-27T20:38:20Z

Currently Pandas makes HTTP requests using "Python-urllib/3.8" as a User Agent.
This prevents from downloading some resources and static files from various places.
What if, Pandas would make requests using "Pandas/1.1.0" headers instead?
There should be possibility to add custom headers too (auth, csrf tokens, api versions and so on).

Use Case:

I am writing a book on Pandas:

https://python.astrotech.io/numerical-analysis/index.html#pandas

I published data in CSV and JSON to use in code listings:

You can access those resources via browser, curl, or even requests, but not using Pandas.
The only change you'd need to do is to set User-Agent.
This is due to the readthedocs.io blocking "Python-urllib/3.8" User Agent for whatever reason.
The same problem affects many other places where you can get data (not only readthedocs.io).

Currently I get those resources with requests and then put response.text to one of:

pd.read_csv
pd.read_json
pd.read_html

Unfortunately this makes even simplest code listings... quite complex (due to the explanation of requests library and why I do this like that).

Pandas uses urllib.request.urlopen which does not allow to set http_headers
https://github.com/pandas-dev/pandas/blob/master/pandas/io/common.py#L146

Although urllib.request.urlopen can take urllib.request.Request as an argument.
And urllib.request.Request object has possibility to set custom http_headers
https://docs.python.org/3/library/urllib.request.html#urllib.request.Request

Possibility to add custom http_headers should be in pd.read_csv, pd.read_json and pd.read_html functions.

From what I see, the read_* call stack is three to four function deep.
There are only 6 references in 4 files to urlopen(*args, **kwargs) function.
So the change shouldn't be quite hard to implement.

http_headers parameter can be Optional[List] which will be fully backward compatible and would not require any changes to others code.

The text was updated successfully, but these errors were encountered:

jreback · 2020-09-27T20:49:58Z

we have had this request before

pls search for these issues

astromatt · 2020-09-27T22:34:55Z

Related to #10526

jreback · 2020-10-02T23:27:14Z

@martindurant can we pass these thru using StorageOptions?

martindurant · 2020-10-06T13:32:47Z

HTTP is the only of the "protocol://" URLs which is not handled by fsspec, because it already had its own code (whereas s3fs and gcs were already using fsspec second-hand).

For HTTPFileSystem, you can include headers as a key in client_kwargs, which could contain your custom user agent or anything else you want. That would look a little bit untidy, but OK

storage_options={"client_kwargs": {"headers": {"User-Agent": "pandas"}}}

jreback · 2020-10-06T15:39:19Z

ok u think a PR to add an example in read_csv / io.rst would be sufficient then

@astromatt if u are interested

cdknox · 2020-11-11T00:30:24Z

take

cdknox · 2020-11-11T02:16:14Z

If a url and storage_options are passed into read_csv a ValueError is raised, as show by current excerpt from pandas/io/common.py(326):

if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer):
    # TODO: fsspec can also handle HTTP via requests, but leaving this unchanged
    if storage_options:
        raise ValueError(
            "storage_options passed with file object or non-fsspec file path"
        )

By using a context manager from fsspec I was able to change the UserAgent successfully as shown by the code below.

import pandas as pd
import fsspec

url = 'http://localhost:8000/temp.csv'
client_kwargs = {'headers': {'User-Agent': 'pandas'}}
with fsspec.open(url, client_kwargs = client_kwargs) as f:
    df = pd.read_csv(f)

Are we okay with the additional top level import being part of the documentation, or should we modify the code as opposed to the documentation?

martindurant · 2020-11-11T02:23:52Z

HTTP is the only legacy non-fsspec remote IO in pandas (because s3 and gcs were already using fsspec on the backend at the time of transition). You could change the HTTP implementation in pandas, which is simple, to accept storage_options, or you could use the fsspec variant (which may result in a change of behaviour).

cdknox · 2020-11-11T05:19:07Z

I'm assuming there is appetite to have all remote IO switched over for the consistency's sake. It looks like it almost could be a drop in replacement for HTTP. Obviously I'd continue to test, but the biggest hurdle I see is around a server sending back gzipped content. Currently pandas checks to see what the server sends back in the header and will decompress accordingly. Snippet from pandas/io/common.py(332)

req = urlopen(filepath_or_buffer)
content_encoding = req.headers.get("Content-Encoding", None)
if content_encoding == "gzip":
    # Override compression based on Content-Encoding header
    compression = {"method": "gzip"}
reader = BytesIO(req.read())
req.close()

I don't see response header information attached to the file after being read from the network in fsspec, so I don't know how to tell if the server sent back gzipped data. I know you can try to request unzipped data so maybe that's a way to make it work if we want to convert to fsspec for HTTP. Though I figure if there's an existing way to tell whether the response is gzipped in fsspec, you're probably the one to know of it!

If we don't like that route I can try working the storage_options into the existing HTTP implementation. Thanks for the thoughts.

cdknox · 2020-11-20T00:09:31Z

Since currently fsspec seems to not be a 100% drop in at the moment I copped out and just passed through storage_options to the header in the event the is an http url.

First time contributor so if something is awry with the pull request simply let me know and I'd be glad to fix it. I didn't notice the commit message guidelines until my last commit I made so sorry if that's an issue.

stragu · 2020-12-11T06:04:14Z

Will PR #37966 assign a default (pandas-specific) header to read_csv() and read_json(), or will it only resolve the "set custom headers" part of this issue?

cdknox · 2020-12-11T15:13:06Z

As it sits currently it resolves the "set custom headers" aspect. I went that route because urllib, which does the grunt work of making the http(s) request, does have it's own User-Agent header value that it sends by default. I don't see a ton of upside to changing the default User-Agent as it would likely still only convey to the server that it's an automated process requesting the data. In combination with that I could see some downside in that someone out there has probably whitelisted that default User-Agent value and a change here would warrant a change there. So my logic netted out to continuing to use the default User-Agent but adding the ability to change it.

stragu · 2020-12-15T05:54:47Z

Thank you for your work on this, and the detailed answer, @cdknox 😃

cdknox · 2020-12-15T13:34:23Z

No problem, glad to help!

astromatt added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 27, 2020

astromatt changed the title ~~ENH: Change Pandas User-Agent and add possibility to set custom http_headers to pd.read_* functions~~ ENH: Change Pandas User-Agent and add possibility to set custom http_headers to pd.read_* functions Sep 27, 2020

Antetokounpo mentioned this issue Sep 30, 2020

ENH: Add headers paramater to read_json and read_csv #36754

Closed

5 tasks

jreback mentioned this issue Oct 2, 2020

read_json from url, Accept Header? #10526

Closed

jreback added IO Data IO issues that don't fit into a more specific label and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 2, 2020

jreback mentioned this issue Oct 2, 2020

ENH: read_csv(data_url, verify = False) #36807

Closed

jreback added the Docs label Oct 6, 2020

jreback added this to the Contributions Welcome milestone Oct 6, 2020

jreback added the good first issue label Oct 6, 2020

github-actions bot assigned cdknox Nov 11, 2020

cdknox mentioned this issue Nov 20, 2020

Read csv headers #37966

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.3 Dec 14, 2020

jreback closed this as completed in #37966 Dec 15, 2020

mroeschke mentioned this issue Mar 29, 2023

read_csv from HTTPs + basic-auth + custom port throws an error (urlopen error) #16716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Change Pandas User-Agent and add possibility to set custom http_headers to pd.read_* functions #36688

ENH: Change Pandas User-Agent and add possibility to set custom http_headers to pd.read_* functions #36688

astromatt commented Sep 27, 2020 •

edited

Loading

jreback commented Sep 27, 2020

astromatt commented Sep 27, 2020

jreback commented Oct 2, 2020

martindurant commented Oct 6, 2020

jreback commented Oct 6, 2020

cdknox commented Nov 11, 2020

cdknox commented Nov 11, 2020 •

edited

Loading

martindurant commented Nov 11, 2020

cdknox commented Nov 11, 2020

cdknox commented Nov 20, 2020

stragu commented Dec 11, 2020

cdknox commented Dec 11, 2020

stragu commented Dec 15, 2020

cdknox commented Dec 15, 2020

ENH: Change Pandas User-Agent and add possibility to set custom http_headers to pd.read_* functions #36688

ENH: Change Pandas User-Agent and add possibility to set custom http_headers to pd.read_* functions #36688

Comments

astromatt commented Sep 27, 2020 • edited Loading

jreback commented Sep 27, 2020

astromatt commented Sep 27, 2020

jreback commented Oct 2, 2020

martindurant commented Oct 6, 2020

jreback commented Oct 6, 2020

cdknox commented Nov 11, 2020

cdknox commented Nov 11, 2020 • edited Loading

martindurant commented Nov 11, 2020

cdknox commented Nov 11, 2020

cdknox commented Nov 20, 2020

stragu commented Dec 11, 2020

cdknox commented Dec 11, 2020

stragu commented Dec 15, 2020

cdknox commented Dec 15, 2020

astromatt commented Sep 27, 2020 •

edited

Loading

cdknox commented Nov 11, 2020 •

edited

Loading