-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG, ENH: Read Data From Password-Protected URL's and allow self signed SSL certs #16910
Conversation
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -40,7 +40,8 @@ Other Enhancements | |||
- :func:`DataFrame.clip()` and :func:`Series.clip()` have gained an ``inplace`` argument. (:issue:`15388`) | |||
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`) | |||
- :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`) | |||
|
|||
- :func:`read_csv` `read_html` `read_json` `read_html` now accept auth in url //<user>:<password>@<host>:<port>/<url-path>, or ``auth`` tuple (username, password) parameter | |||
- :func:`read_csv` `read_html` `read_json` `read_html` now accept ``verify_ssl`` False to disable https/ssl certificate verification (eg: self signed ssl certs in testing) | |||
.. _whatsnew_0210.api_breaking: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's condense into one line:
- It is possible to read data (i.e. CSV, JSON, HTML) from
a URL that is password-protected (:issue:`16716`)
Note that I also put the issue number at the end of the line.
pandas/io/common.py
Outdated
compression: | ||
auth: (str,str), default None. (username, password) for HTTP(s) basic auth | ||
verify_ssl: Default True. If False, allow self signed and invalid SSL | ||
certificates for https | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Why is the
compression
field empty? -
The formatting for
auth
andverify_ssl
should be patched. The general format is the following:
<var_name> : <data_type>, <defaults>
<description>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the compression
already existed, but was not in the comments. I simply added it - to comments and left it empty because I was not too familiar to add best docs. I'll fix the rest
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -40,7 +40,8 @@ Other Enhancements | |||
- :func:`DataFrame.clip()` and :func:`Series.clip()` have gained an ``inplace`` argument. (:issue:`15388`) | |||
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`) | |||
- :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`) | |||
|
|||
- :func:`read_csv` `read_html` `read_json` `read_html` now accept auth in url //<user>:<password>@<host>:<port>/<url-path>, or ``auth`` tuple (username, password) parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use commans in between, and you need :func:
on each one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skynss : Sure thing. I would refrain from doing the detailed section right now, as I would like to see what @jreback thinks of that suggestion. If he thinks it's a good idea, the section would go right below the "Other Enhancements" title. In that section, you would provide examples of how to use your newly added functionality.
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -40,7 +40,8 @@ Other Enhancements | |||
- :func:`DataFrame.clip()` and :func:`Series.clip()` have gained an ``inplace`` argument. (:issue:`15388`) | |||
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`) | |||
- :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`) | |||
|
|||
- :func:`read_csv` `read_html` `read_json` `read_html` now accept auth in url //<user>:<password>@<host>:<port>/<url-path>, or ``auth`` tuple (username, password) parameter | |||
- :func:`read_csv` `read_html` `read_json` `read_html` now accept ``verify_ssl`` False to disable https/ssl certificate verification (eg: self signed ssl certs in testing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this user visible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is. These parameters are available at the top level API.
pandas/io/common.py
Outdated
@@ -186,7 +190,12 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None, | |||
---------- | |||
filepath_or_buffer : a url, filepath (str, py.path.local or pathlib.Path), | |||
or buffer | |||
supports 'https://username:password@fqdn.com:port/aaa.csv' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls clarify and add a versionadded tag (0.21.)
pandas/io/common.py
Outdated
encoding : the encoding to use to decode py3 bytes, default is 'utf-8' | ||
compression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compression: string, default None
# the expl is indented on the next line
auth: (string, string), default None
username, password......
same for verify_ssl
add a versionadded tag
pandas/io/common.py
Outdated
------- | ||
(username, password), url_no_usrpwd : username or "", password or "", | ||
url without username or password (if it contained it ) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show what this Raises
pandas/io/common.py
Outdated
url_with_uname : a url that may or may not contain username and password | ||
see section 3.1 RFC 1738 https://www.ietf.org/rfc/rfc1738.txt | ||
//<user>:<password>@<host>:<port>/<url-path> | ||
auth : ( username/""/None, password/"", None) tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
pandas/tests/io/test_common.py
Outdated
'', | ||
'https://ccc.com:1010/aaa.txt' | ||
)]: | ||
un, pw, mod_url = common.split_uname_from_url(url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this raise at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function name actually doesn't exist anymore. @skynss : could you rename this to the correct function?
pandas/tests/test_common.py
Outdated
@@ -1,223 +1,262 @@ | |||
# -*- coding: utf-8 -*- | |||
|
|||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is changed in this file that is releveant to this PR?
pandas/io/common.py
Outdated
------- | ||
(username, password), url_no_usrpwd : username or "", password or "", | ||
url without username or password (if it contained it ) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
See my comment here to patch the formatting for
url_with_uname
. -
The
return
format will need to be changed. The general format is this:
<var_name> : <data_type>
<Description>
However, in this case, it would be preferable to describe the returned object without any naming, since this is a nested tuple
object e.g.:
Returns
--------
A length-two tuple containing the following:
- A length-two tuple of username and password. These will be empty strings if none were extracted
- The URL stripped of the username and password if provided in the URL.
pandas/io/common.py
Outdated
//<user>:<password>@<host>:<port>/<url-path> | ||
auth : ( username/""/None, password/"", None) tuple | ||
verify_ssl: If False, SSL certificate verification is disabled. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment here to patch the formatting for your parameters listed above.
pandas/io/common.py
Outdated
|
||
Returns | ||
------- | ||
Request, kwargs to pass to urlopen. kwargs may be {} or {'context': obj } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment here to patch the formatting for your return variable listed above.
pandas/io/parsers.py
Outdated
float_precision=None): | ||
float_precision=None, | ||
|
||
# Basic auth (http/https) (username, password) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment. Your documentation of the parameters in the docstring should make this clear.
pandas/io/parsers.py
Outdated
# Basic auth (http/https) (username, password) | ||
auth=None, | ||
|
||
# skip verify self signed SSL certificates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above. You should also be able to remove this comment.
pandas/tests/io/test_common.py
Outdated
""" | ||
Test extraction of username, pwd from url, if contained. | ||
""" | ||
for url, uname, pwd, nurl in [('https://aaa:bbb@ccc.com:1010/aaa.txt', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can take advantage of pytest
parametrisation decorator. See here for how to do it.
843f135
to
eb03fd3
Compare
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -40,7 +40,8 @@ Other Enhancements | |||
- :func:`DataFrame.clip()` and :func:`Series.clip()` have gained an ``inplace`` argument. (:issue:`15388`) | |||
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`) | |||
- :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`) | |||
|
|||
- :func:`read_csv`, :func:`read_html`, :func:`read_json`, :func:`read_html` now accept auth in url //<user>:<password>@<host>:<port>/<url-path>, or ``auth`` tuple (username, password) parameter | |||
- :func:`read_csv`, :func:`read_html`, :func:`read_json`, :func:`read_html` now accept ``verify_ssl`` False to disable https/ssl certificate verification (eg: self signed ssl certs in testing) (:issue:`16716`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally still in favor of providing something more general like what I had suggested before and have a section explaining what you can do now.
@jreback : Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that we should condense into one line:
- It is possible to read data (i.e. CSV, JSON, HTML) from
a URL that is password-protected (:issue:`16716`)
In addition, we should add a section about what you can do now with password-authenticated URL's.
@skynss : In addition to failing your own test, you seemed to have broken several others because the file paths on some of the tests didn't work anymore. Could you run tests locally on your machine to confirm and investigate this? |
@skynss : whoa, what happened to |
@gfyoung Sorry I had made a silly mistake. I ran the tests that I changed and they seem to work. I cannot run all tests because of my current environment (ssh does not allow for Displayable tests to run) |
@skynss : No worries! No one's going to get mad if you make mistakes like this, especially since you are very new to the process 😄 I would focus right now on trying to address all of our comments as well as you can and ensuring that tests do not fail. Make as many commits as you need to get that done. |
@jreback : Yes, absolutely. We wouldn't need to setup any authentication, as |
i would make this an optional dep then |
Makes sense. Probably should first clean-up the PR as is before adding this in. |
@jreback @gfyoung wrt Here are few options to and their advantages disadvantages: Option 2: Auto-detect and use Option 3: Forget adding Option 4: Stop using urllib and switch with strong dependency to Option 5: Forget requests, allow user to just pass in urllib.request.Request object and possibly other headers etc directly to I am not happy with Option 2 as it will produce inconsistent behavior eg Issue #17019 and has greater chance of causing regression. I dont recommend Option4 or Option 5 due to complexity esp in Py2/Py3 scenario. I think Option 1 is best but it be checked in 2 phases: |
@skynss : The signature for Auto-detecting would be the best way to incorporate I agree with you on option 3: username and password are the simplest hands down. Option 5 is our backup. We don't need to add |
…tml.py. Common logic update. whatsnew fixed
…r 'test_html.py' with error AssertionError: Did not see expected warning of class 'InsecureRequestWarning'
the way to do this is:
the idea is that we can push as much code as possible to use requests, rather that re-invent the wheel |
@jreback Why push only auth scenarios to requests if installed? Instead, if requests is installed, use it by default for all http(s) cases. If not installed, use existing codepath - as-is? Also, instead of passing in Also at a quick glance, verb other than Right now, I don't see how pandas returns status code, and response headers. By utilizing requests import pandas as pd
from requests import Session
# req_session is optional and replaces username, password,
# by default, no requests.session needed. Backwards compatible api.
df = pd.read_csv('https://uname:pwd@aa.com/bb.csv') # will work automatically because
# requests will handles it if installed, else handle with existing codebase
# custom auth
s = Session()
s.auth = MyAuthProvider('secret-key') # custom auth provider supported by requests
df = pd.read_csv(url, req_session=s)
# optional advanced scenarios
s = Session()
s.auth = ('darth', 'l0rd') # if user wants to perform basic auth Skip if url itself contains username and pwd
s.timeout = (3.05, 27) # if user wants to modify timeout
s.verify = False # if user wants to disable ssl cert verification
s.headers.update( {'User-Agent': 'Custom user agent'} ) # extensible to set any custom header needed
s.proxies = { 'http': 'http://a.com:100'} # if user has proxies
s.cert = '/path/client.cert' # if custom cert is needed
df = pd.read_csv( 'https://aa.com/bbb.csv', req_session=s)
# support verbs other than 'GET' such as 'POST'
r = Request('POST', 'http://joker:pwd@nlp_service.api/email_sentiment_extract?out=json')
prepped = req.prepare()
prepped.body = 'from: aaa@aa.bb\nto: cc@dd.ee\nsubject:Complaint letter\n\nbody: I am feeling :(' # multiple lines
df = pd.read_json( prepped) # minor update pandas code to detect type(Request) and submit it using requests session in lieu of URL.
"""
[{
'from': 'aaa@aa.bb',
'to': 'cc@dd.ee',
'email_type': 'complaint',
'sentiment': 'unhappy',
}]
"""
# Event hooks callback (eg log http status codes)
def print_http_status(r, *args, **kwargs):
print(r.status_code)
print(r.headers['Content-Length'])
s = Session()
s.hooks = dict(response=print_http_status)
df = pd.read_csv( 'https://aa.com/bbb.csv', req_session=s) It does mean that some of the work I did becomes unnecessary, but if you are ok to use requests, I think the most extensible and simplest scenario is to allow passing in optional request.session parameter. For the above scenarios, following changes would need to be made to code checked into
I am far from being an expert at EDIT: I think I might already have this mostly working (except for read_html) |
@jreback @gfyoung : Take a look at a new branch
It is probably ready to be merged in. please review. let me know if you want me to start a different pull request. |
@jreback : I wouldn't say we're re-inventing the wheel. It's rather than |
@skynss You mentioned before that you were concerned about inconsistencies between |
@skynss : That's a bit of reversal since your comment earlier here. That being said, I'm still wary about allowing a user to pass this in. Since it seems we are moving to incorporate This interface allows us to have the flexibility of implementing whatever handling we would like in terms of authentication in the future WITHOUT impacting how the user has to interact with it. Internally, we would check if certain fields existed and utilize them for authentication (e.g. |
so the current behavior must work w/o requests installed at all. I would push to have as minimal code added as possible, IOW use requests for the auth and just use the current code for existing things; that said if you want to make a mock requests internally then this might be simpler. |
Can you clarify? # for basic auth
df = pd.read_csv(url, auth={ 'uname':'aa', 'pwd':'bb'})
# to bypass ssl cert verification (which is not authentication) it would be confusing to pass it into auth param
df = pd.read_csv(url, auth={ 'verify_ssl':False}) Perhaps what you meant (and what I think is good idea) was you prefer end user to not make any imports of # simple user never has to import requests
up = { 'auth' : ('user','pwd'), 'verify_ssl' : False }
df = pd.read_csv( url, url_params=up)
# or power user can choose to use request.Session directly
import requests
up = requests.Session()
up.auth = ('uname','pwd')
up.header.modify( {'User-Agent' : 'My custom user agent'})
df = pd.read_csv( url, url_params=up)
# internally we check
if type(url_params) is requests.Session:
sess = url_params
elif type(url_params) is dict:
sess = requests.Session()
# then we add in auth/verify_ssl in here
Agreed. Both basic-auth-https-self-signed as well as use-requests already satisfy them.
If we just want to use requests for auth (and I assume you meant ssl cert verification bypass too) then why are we even having a dependency on requests? First option of use-requests has the least amount of code change ( compared to basic-auth-https-self-signed ). It uses requests for all scenarios, only if it is installed. This functionality is passing all I got engaged as I opened an issue for basic auth + self signed certs. df = pd.read_csv('http://handsome-equator.000webhostapp.com/no_auth/aaa.csv') # works fine
df = pd.read_csv('http://handsome-equator.000webhostapp.com:80/no_auth/aaa.csv') # fails with 404 if port number is added. it shouldnt. It doesnt in browser, or requests It fails because So, what is the direction for http(s) in pandas? Natively use Wish there was a bit more synchronous way to communicate, to hammer out the issues. faster |
@skynss : You understood my point correctly! True that |
@gfyoung Take a look at latest version of use-requests, this change is implemented. Added new tests, all pytests are passing. Should I create a pull request for that branch of fork instead of this one? |
I would wait to see what @jreback has to say before you go through the effort of preparing a PR. I would try to clean up this PR to best you can for now. |
@skynss you branch above looks much better than the current soln. I have a number of comments, but will comment directly on the PR. |
@jreback So I should go ahead and submit PR for |
Yes, absolutely! |
Don't close this PR yet. When we merge, we'll close whichever PR we don't want to use. |
I mean |
@skynss I think would be good for 0.21.0. can you rebase / update and get this passing? |
pls rebase / update & move whatsnew to 0.22.0 |
can you rebase |
closing as stale. if you want to continue working, pls ping. @skynss this is a nice change, but we would need to integrate to the current infrastructure. |
git diff upstream/master --name-only -- '*.py' | flake8 --diff