-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Change Pandas User-Agent and add possibility to set custom http_headers to pd.read_* functions #36688
Comments
User-Agent
and add possibility to set custom http_headers
to pd.read_*
functions
we have had this request before pls search for these issues |
Related to #10526 |
@martindurant can we pass these thru using |
HTTP is the only of the "protocol://" URLs which is not handled by fsspec, because it already had its own code (whereas s3fs and gcs were already using fsspec second-hand). For HTTPFileSystem, you can include
|
ok u think a PR to add an example in read_csv / io.rst would be sufficient then @astromatt if u are interested |
take |
If a url and storage_options are passed into read_csv a ValueError is raised, as show by current excerpt from pandas/io/common.py(326):
By using a context manager from fsspec I was able to change the UserAgent successfully as shown by the code below.
Are we okay with the additional top level import being part of the documentation, or should we modify the code as opposed to the documentation? |
HTTP is the only legacy non-fsspec remote IO in pandas (because s3 and gcs were already using fsspec on the backend at the time of transition). You could change the HTTP implementation in pandas, which is simple, to accept storage_options, or you could use the fsspec variant (which may result in a change of behaviour). |
I'm assuming there is appetite to have all remote IO switched over for the consistency's sake. It looks like it almost could be a drop in replacement for HTTP. Obviously I'd continue to test, but the biggest hurdle I see is around a server sending back gzipped content. Currently pandas checks to see what the server sends back in the header and will decompress accordingly. Snippet from pandas/io/common.py(332)
I don't see response header information attached to the file after being read from the network in fsspec, so I don't know how to tell if the server sent back gzipped data. I know you can try to request unzipped data so maybe that's a way to make it work if we want to convert to fsspec for HTTP. Though I figure if there's an existing way to tell whether the response is gzipped in fsspec, you're probably the one to know of it! If we don't like that route I can try working the storage_options into the existing HTTP implementation. Thanks for the thoughts. |
Since currently fsspec seems to not be a 100% drop in at the moment I copped out and just passed through storage_options to the header in the event the is an http url. First time contributor so if something is awry with the pull request simply let me know and I'd be glad to fix it. I didn't notice the commit message guidelines until my last commit I made so sorry if that's an issue. |
Will PR #37966 assign a default (pandas-specific) header to |
As it sits currently it resolves the "set custom headers" aspect. I went that route because urllib, which does the grunt work of making the http(s) request, does have it's own User-Agent header value that it sends by default. I don't see a ton of upside to changing the default User-Agent as it would likely still only convey to the server that it's an automated process requesting the data. In combination with that I could see some downside in that someone out there has probably whitelisted that default User-Agent value and a change here would warrant a change there. So my logic netted out to continuing to use the default User-Agent but adding the ability to change it. |
Thank you for your work on this, and the detailed answer, @cdknox 😃 |
No problem, glad to help! |
Currently Pandas makes HTTP requests using "Python-urllib/3.8" as a User Agent.
This prevents from downloading some resources and static files from various places.
What if, Pandas would make requests using "Pandas/1.1.0" headers instead?
There should be possibility to add custom headers too (
auth
,csrf tokens
,api versions
and so on).Use Case:
I am writing a book on Pandas:
I published data in CSV and JSON to use in code listings:
You can access those resources via browser,
curl
, or evenrequests
, but not using Pandas.The only change you'd need to do is to set User-Agent.
This is due to the
readthedocs.io
blocking "Python-urllib/3.8" User Agent for whatever reason.The same problem affects many other places where you can get data (not only
readthedocs.io
).Currently I get those resources with
requests
and then putresponse.text
to one of:pd.read_csv
pd.read_json
pd.read_html
Unfortunately this makes even simplest code listings... quite complex (due to the explanation of
requests
library and why I do this like that).Pandas uses
urllib.request.urlopen
which does not allow to sethttp_headers
https://github.com/pandas-dev/pandas/blob/master/pandas/io/common.py#L146
Although
urllib.request.urlopen
can takeurllib.request.Request
as an argument.And
urllib.request.Request
object has possibility to set customhttp_headers
https://docs.python.org/3/library/urllib.request.html#urllib.request.Request
Possibility to add custom
http_headers
should be inpd.read_csv
,pd.read_json
andpd.read_html
functions.From what I see, the
read_*
call stack is three to four function deep.There are only 6 references in 4 files to
urlopen(*args, **kwargs)
function.So the change shouldn't be quite hard to implement.
http_headers
parameter can beOptional[List]
which will be fully backward compatible and would not require any changes to others code.The text was updated successfully, but these errors were encountered: