Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG, ENH: Read Data From Password-Protected URL's and allow self signed SSL certs #16910

Closed
wants to merge 14 commits into from

Conversation

skynss
Copy link

@skynss skynss commented Jul 13, 2017

@gfyoung gfyoung added Bug IO Data IO issues that don't fit into a more specific label labels Jul 13, 2017
@@ -40,7 +40,8 @@ Other Enhancements
- :func:`DataFrame.clip()` and :func:`Series.clip()` have gained an ``inplace`` argument. (:issue:`15388`)
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`)
- :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`)

- :func:`read_csv` `read_html` `read_json` `read_html` now accept auth in url //<user>:<password>@<host>:<port>/<url-path>, or ``auth`` tuple (username, password) parameter
- :func:`read_csv` `read_html` `read_json` `read_html` now accept ``verify_ssl`` False to disable https/ssl certificate verification (eg: self signed ssl certs in testing)
.. _whatsnew_0210.api_breaking:
Copy link
Member

@gfyoung gfyoung Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's condense into one line:

- It is possible to read data (i.e. CSV, JSON, HTML) from
a URL that is password-protected (:issue:`16716`)

Note that I also put the issue number at the end of the line.

compression:
auth: (str,str), default None. (username, password) for HTTP(s) basic auth
verify_ssl: Default True. If False, allow self signed and invalid SSL
certificates for https

Copy link
Member

@gfyoung gfyoung Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why is the compression field empty?

  2. The formatting for auth and verify_ssl should be patched. The general format is the following:

<var_name> : <data_type>, <defaults>
     <description>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the compression already existed, but was not in the comments. I simply added it - to comments and left it empty because I was not too familiar to add best docs. I'll fix the rest

@@ -40,7 +40,8 @@ Other Enhancements
- :func:`DataFrame.clip()` and :func:`Series.clip()` have gained an ``inplace`` argument. (:issue:`15388`)
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`)
- :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`)

- :func:`read_csv` `read_html` `read_json` `read_html` now accept auth in url //<user>:<password>@<host>:<port>/<url-path>, or ``auth`` tuple (username, password) parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use commans in between, and you need :func: on each one

Copy link
Member

@gfyoung gfyoung Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : See my comment above. Do we need this level of detail here? Perhaps a separate section describing what you can now might be good?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @gfyoung Pls let me know which is the way to go. Also, if summary here with detailed section.. where does the section belong in this doc?

Copy link
Member

@gfyoung gfyoung Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skynss : Sure thing. I would refrain from doing the detailed section right now, as I would like to see what @jreback thinks of that suggestion. If he thinks it's a good idea, the section would go right below the "Other Enhancements" title. In that section, you would provide examples of how to use your newly added functionality.

@@ -40,7 +40,8 @@ Other Enhancements
- :func:`DataFrame.clip()` and :func:`Series.clip()` have gained an ``inplace`` argument. (:issue:`15388`)
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`)
- :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`)

- :func:`read_csv` `read_html` `read_json` `read_html` now accept auth in url //<user>:<password>@<host>:<port>/<url-path>, or ``auth`` tuple (username, password) parameter
- :func:`read_csv` `read_html` `read_json` `read_html` now accept ``verify_ssl`` False to disable https/ssl certificate verification (eg: self signed ssl certs in testing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this user visible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. These parameters are available at the top level API.

@@ -186,7 +190,12 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
----------
filepath_or_buffer : a url, filepath (str, py.path.local or pathlib.Path),
or buffer
supports 'https://username:password@fqdn.com:port/aaa.csv'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls clarify and add a versionadded tag (0.21.)

encoding : the encoding to use to decode py3 bytes, default is 'utf-8'
compression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compression: string, default None

# the expl is indented on the next line
auth: (string, string), default None
    username, password......

same for verify_ssl

add a versionadded tag

-------
(username, password), url_no_usrpwd : username or "", password or "",
url without username or password (if it contained it )
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show what this Raises

url_with_uname : a url that may or may not contain username and password
see section 3.1 RFC 1738 https://www.ietf.org/rfc/rfc1738.txt
//<user>:<password>@<host>:<port>/<url-path>
auth : ( username/""/None, password/"", None) tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

'',
'https://ccc.com:1010/aaa.txt'
)]:
un, pw, mod_url = common.split_uname_from_url(url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this raise at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function name actually doesn't exist anymore. @skynss : could you rename this to the correct function?

@@ -1,223 +1,262 @@
# -*- coding: utf-8 -*-

"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is changed in this file that is releveant to this PR?

-------
(username, password), url_no_usrpwd : username or "", password or "",
url without username or password (if it contained it )
"""
Copy link
Member

@gfyoung gfyoung Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. See my comment here to patch the formatting for url_with_uname.

  2. The return format will need to be changed. The general format is this:

<var_name> : <data_type>
    <Description>

However, in this case, it would be preferable to describe the returned object without any naming, since this is a nested tuple object e.g.:

Returns
--------
A length-two tuple containing the following:
    - A length-two tuple of username and password.  These will be empty strings if none were extracted
    - The URL stripped of the username and password if provided in the URL.

//<user>:<password>@<host>:<port>/<url-path>
auth : ( username/""/None, password/"", None) tuple
verify_ssl: If False, SSL certificate verification is disabled.

Copy link
Member

@gfyoung gfyoung Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment here to patch the formatting for your parameters listed above.


Returns
-------
Request, kwargs to pass to urlopen. kwargs may be {} or {'context': obj }
Copy link
Member

@gfyoung gfyoung Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment here to patch the formatting for your return variable listed above.

float_precision=None):
float_precision=None,

# Basic auth (http/https) (username, password)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment. Your documentation of the parameters in the docstring should make this clear.

# Basic auth (http/https) (username, password)
auth=None,

# skip verify self signed SSL certificates
Copy link
Member

@gfyoung gfyoung Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above. You should also be able to remove this comment.

"""
Test extraction of username, pwd from url, if contained.
"""
for url, uname, pwd, nurl in [('https://aaa:bbb@ccc.com:1010/aaa.txt',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can take advantage of pytest parametrisation decorator. See here for how to do it.

@gfyoung gfyoung changed the title Basic auth https self signed BUG, ENH: Read Data From Password-Protected URLBasic auth https self signed Jul 13, 2017
@gfyoung gfyoung changed the title BUG, ENH: Read Data From Password-Protected URLBasic auth https self signed BUG, ENH: Read Data From Password-Protected URL's Jul 13, 2017
@skynss skynss force-pushed the basic-auth-https-self-signed branch from 843f135 to eb03fd3 Compare July 14, 2017 01:02
@@ -40,7 +40,8 @@ Other Enhancements
- :func:`DataFrame.clip()` and :func:`Series.clip()` have gained an ``inplace`` argument. (:issue:`15388`)
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`)
- :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`)

- :func:`read_csv`, :func:`read_html`, :func:`read_json`, :func:`read_html` now accept auth in url //<user>:<password>@<host>:<port>/<url-path>, or ``auth`` tuple (username, password) parameter
- :func:`read_csv`, :func:`read_html`, :func:`read_json`, :func:`read_html` now accept ``verify_ssl`` False to disable https/ssl certificate verification (eg: self signed ssl certs in testing) (:issue:`16716`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally still in favor of providing something more general like what I had suggested before and have a section explaining what you can do now.

@jreback : Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that we should condense into one line:

- It is possible to read data (i.e. CSV, JSON, HTML) from
a URL that is password-protected (:issue:`16716`)

In addition, we should add a section about what you can do now with password-authenticated URL's.

@skynss skynss changed the title BUG, ENH: Read Data From Password-Protected URL's BUG, ENH: Read Data From Password-Protected URL's and allow self signed SSL certs Jul 14, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

@skynss : In addition to failing your own test, you seemed to have broken several others because the file paths on some of the tests didn't work anymore. Could you run tests locally on your machine to confirm and investigate this?

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

@skynss : whoa, what happened to test_common.py ? Why did you delete everything there?

@skynss
Copy link
Author

skynss commented Jul 14, 2017

@gfyoung Sorry I had made a silly mistake. I ran the tests that I changed and they seem to work. I cannot run all tests because of my current environment (ssh does not allow for Displayable tests to run)
The issue was I had wrongfully overwritten python/tests/test_common.py with python/tests/io/test_common.py originally and submitted it - which obviously caused tests to fail. I think I have corrected the mistake. I am a bit new at this process.

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

@skynss : No worries! No one's going to get mad if you make mistakes like this, especially since you are very new to the process 😄

I would focus right now on trying to address all of our comments as well as you can and ensuring that tests do not fail. Make as many commits as you need to get that done.

@gfyoung
Copy link
Member

gfyoung commented Jul 19, 2017

if we used requests as an optional dep to handle url like things would pandas code be much simpler?

@jreback : Yes, absolutely. We wouldn't need to setup any authentication, as requests would handle that for us. You can look at the code examples from the issue to see the difference.

@jreback
Copy link
Contributor

jreback commented Jul 19, 2017

i would make this an optional dep then

@gfyoung
Copy link
Member

gfyoung commented Jul 19, 2017

i would make this an optional dep then

Makes sense. Probably should first clean-up the PR as is before adding this in.

@skynss
Copy link
Author

skynss commented Jul 19, 2017

@jreback @gfyoung wrt requests as optional dependency, what do we expect as end user experience?

Here are few options to and their advantages disadvantages:
Option 1: Add requests.Session() as a optional param to read_csv( url, req_session=my_session) or read_csv( url, use_requests_lib=True)- in which case, the user is knowingly opting to bypass urllib from pandas and use requests instead. If read_csv( url, username='darth', password='mtfbwy', verify_ssl=False, req_session=my_session), only if my_session is valid or use_requests_lib=True is passed, is requests library used, else urllib is used even if requests package is already installed.
Advantage: user has greater control. Passing requests session might enable power user functionality such as accessing AWS API gateway with token, and other scenarios.
Disadvantage: Existing code stays.. Add new code to optionally handle requests

Option 2: Auto-detect and use requests package if installed, and use urllib if not installed. So end user would possibly call read_csv( url, username='darth', password='mtfbwy', verify_ssl=False) and has no control over whether urllib is used or requests is used.
Advantage: read_csv api is cleaner as it doesnt have additional parameter of use_requests_lib.
Disadvantage: User may notice inconsistent behavior in different environments without warning - and might be very hard to detect the cause is the presence or absence of requests package. Example: Imagine AWS Lambda does not have requests package installed in their python runtime.. and I upload a Lambda python function with locally packaged pandas v.0.20 in single zip. AWS decides that requests is common ask and includes it in their runtime. The behavior of my lambda function might change without me having any clue of why.

Option 3: Forget adding username, password, verify_ssl to read_csv and just add req_session parameter to read_csv instead. Expect end user to know how to build a requests session with username/pwd/verify set and pass that object to read_csv.
Advantage: Cleaner and powerful read_csv.
Disadvantage: Dev now has to know about auth/ssl using requests. Possibly, if end user is going to do this, it might just be easier to just utilize requests + StringIO and bypass the http functionality in read_csv

Option 4: Stop using urllib and switch with strong dependency to requests.
Advantages: Simpler code
Disadvantages: Potential regressions. Bulkier packages going up AWS Lambda. read_csv will still need to be able to accept a req_session object because there is no way to add all potential scenarios as parameters to read_csv

Option 5: Forget requests, allow user to just pass in urllib.request.Request object and possibly other headers etc directly to read_csv.
Advantage: No regressions. No external dependency. Great for power user eg: change useragent string, etc. Use for other scenarios such as AWS API gateway authentication. Easy to code it in existing codebase.
Disadvantage: For simple use cases, a novice user has to learn urllib.

I am not happy with Option 2 as it will produce inconsistent behavior eg Issue #17019 and has greater chance of causing regression.
While Option 3 is better than released version, I might as well consider getting data separately and passing in StringIO object.

I dont recommend Option4 or Option 5 due to complexity esp in Py2/Py3 scenario.

I think Option 1 is best but it be checked in 2 phases:
Phase 1: without requests i.e this issue.
Phase 2: allow requests session as parameter. Have you folks observed lot of issues that requests might resolve? If yes, Phase 2 should be sooner rather than later. I think making the request call is probably easy to implement, but handling and testing various scenarios of response (mimetypes, gzip, multipart, redirections, chunked, various verbs, encoding) will probably need time and expertise. I'm bit too tight on bandwidth to dig deeper into this.

@gfyoung
Copy link
Member

gfyoung commented Jul 19, 2017

@skynss : The signature for read_csv is bloated enough as it is. Username and password should be the only way to authenticate and is indeed the easiest. Thus, I would be against option 1.

Auto-detecting would be the best way to incorporate requests behavior. I agree that inconsistencies might be a concern, but that's bound to happen no matter how much you try to standardize. That's why we're making it an optional dependency. We would also need to document that potential issue so that users are aware of this.

I agree with you on option 3: username and password are the simplest hands down.
I agree with you on option 4: we are not relying on requests beyond being an optional dependency.

Option 5 is our backup. We don't need to add requests support at the moment. You could even do that as a follow-up.

Sky NSS added 2 commits July 19, 2017 19:56
…tml.py. Common logic update. whatsnew fixed
…r 'test_html.py' with error AssertionError: Did not see expected warning of class 'InsecureRequestWarning'
@jreback
Copy link
Contributor

jreback commented Jul 20, 2017

the way to do this is:

  • if no auth is provided use current behavior
  • if auth is provide, import requests, if its not available, raise a nice error message that auth requires requests
  • use request to service the request.

the idea is that we can push as much code as possible to use requests, rather that re-invent the wheel

@skynss
Copy link
Author

skynss commented Jul 20, 2017

@jreback Why push only auth scenarios to requests if installed? Instead, if requests is installed, use it by default for all http(s) cases. If not installed, use existing codepath - as-is?

Also, instead of passing in username, pwd, verify_ssl, it is probably easier to instead just pass in optional only requests.session object which accepts not only auth but also enables a ton of other scenarios such as adding proxies, client certs, useragent changes, and custom auth scenarios such as 'Calling AWS api-gateway with cognito (just googled a sample scenario)'

Also at a quick glance, verb other than GET such as POST, etc might be automatically supported using prepared requests

Right now, I don't see how pandas returns status code, and response headers. By utilizing requests event hooks functionality built into requests, this can be handled.

import pandas as pd
from requests import Session

# req_session is optional and replaces username, password, 
# by default, no requests.session needed. Backwards compatible api.
df = pd.read_csv('https://uname:pwd@aa.com/bb.csv') # will work automatically because 
# requests will handles it if installed, else handle with existing codebase

# custom auth
s = Session()
s.auth = MyAuthProvider('secret-key') # custom auth provider supported by requests
df = pd.read_csv(url, req_session=s)

# optional advanced scenarios
s = Session()
s.auth = ('darth', 'l0rd')  # if user wants to perform basic auth Skip if url itself contains username and pwd
s.timeout = (3.05, 27)                           # if user wants to modify timeout
s.verify = False                                      # if user wants to disable ssl cert verification
s.headers.update( {'User-Agent': 'Custom user agent'} )  # extensible to set any custom header needed
s.proxies = { 'http': 'http://a.com:100'}  # if user has proxies 
s.cert = '/path/client.cert'                     # if custom cert is needed
df = pd.read_csv( 'https://aa.com/bbb.csv', req_session=s)

# support verbs other than 'GET' such as 'POST'
r = Request('POST', 'http://joker:pwd@nlp_service.api/email_sentiment_extract?out=json')
prepped = req.prepare()
prepped.body = 'from: aaa@aa.bb\nto: cc@dd.ee\nsubject:Complaint letter\n\nbody: I am feeling :(' # multiple lines
df = pd.read_json( prepped) # minor update pandas code to detect type(Request) and submit it using requests session in lieu of URL.
"""
[{
  'from': 'aaa@aa.bb',
  'to': 'cc@dd.ee',
  'email_type': 'complaint',
  'sentiment': 'unhappy',
}]
"""

# Event hooks callback (eg log http status codes)
def print_http_status(r, *args, **kwargs):
    print(r.status_code)
    print(r.headers['Content-Length'])
s = Session()
s.hooks = dict(response=print_http_status)
df = pd.read_csv( 'https://aa.com/bbb.csv', req_session=s)

It does mean that some of the work I did becomes unnecessary, but if you are ok to use requests, I think the most extensible and simplest scenario is to allow passing in optional request.session parameter.

For the above scenarios, following changes would need to be made to code checked into master:

  1. read_* accept new optional parameter req_session, and detect if 'url' is passed or 'prepared' is.
  2. if requests installed, use url/prepared/req_session as specified. Call session.send()
  3. if requests not installed, don't change current behavior of read_*
  4. Testing: Don't have to test stuff that requests already tests.

I am far from being an expert at requests library and encourage others to review and correct me.

EDIT: I think I might already have this mostly working (except for read_html)

@skynss
Copy link
Author

skynss commented Jul 21, 2017

@jreback @gfyoung : Take a look at a new branch use-requests . Its a completely different codebase which uses requests

  • If installed uses python-requests by default.. else use existing codebase of urlopen
  • only one new optional param of req_session added in lieu of username/pwd, etc
  • fully working
  • all pytest unit tests passed in pandas/tests/io - only thing remaining is to figure out how to add test case to emulate python-requests is not installed?
  • whatsnew updated
  • codebase rebased with pandas-dev/pandas master as of few mins ago.

It is probably ready to be merged in. please review. let me know if you want me to start a different pull request.

@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2017

the idea is that we can push as much code as possible to use requests, rather that re-invent the wheel

@jreback : I wouldn't say we're re-inventing the wheel. It's rather than requests came up with a nice way to provide an even higher-level interface to Python urllib libraries. That being said, I'm perfectly fine with using requests so long as we're okay with making it an optional dependency that is required when authentication with data IO is needed.

@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2017

Why push only auth scenarios to requests if installed? Instead, if requests is installed, use it by default for all http(s) cases. If not installed, use existing codepath - as-is?

@skynss You mentioned before that you were concerned about inconsistencies between requests and urllib. If we were to use requests (at least for now) solely for authentication, we would not have to worry about this issue. Not to mention, it will make things easier to test for the time being.

@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2017

Also, instead of passing in username, pwd, verify_ssl, it is probably easier to instead just pass in optional only requests.session object which accepts not only auth but also enables a ton of other scenarios

@skynss : That's a bit of reversal since your comment earlier here. That being said, I'm still wary about allowing a user to pass this in.

Since it seems we are moving to incorporate requests into pandas, IMO we should keep the interface relatively agnostic to the user i.e., instead of accepting a session object, we accept an auth parameter, which is a dictionary mapping to different types of authentication that we could accept such as username / password with verify_ssl or a session object.

This interface allows us to have the flexibility of implementing whatever handling we would like in terms of authentication in the future WITHOUT impacting how the user has to interact with it. Internally, we would check if certain fields existed and utilize them for authentication (e.g. if 'session' in auth, then use the session to authenticate).

@jreback
Copy link
Contributor

jreback commented Jul 21, 2017

so the current behavior must work w/o requests installed at all.

I would push to have as minimal code added as possible, IOW use requests for the auth and just use the current code for existing things; that said if you want to make a mock requests internally then this might be simpler.

@skynss
Copy link
Author

skynss commented Jul 21, 2017

@gfyoung

IMO we should keep the interface relatively agnostic to the user i.e., instead of accepting a session object, we accept an auth parameter, which is a dictionary mapping to different types of authentication that we could accept such as username / password with verify_ssl or a session object.

Can you clarify?

# for basic auth
df = pd.read_csv(url, auth={ 'uname':'aa', 'pwd':'bb'})

# to bypass ssl cert verification (which is not authentication) it would be confusing to pass it into auth param
df = pd.read_csv(url, auth={ 'verify_ssl':False})

Perhaps what you meant (and what I think is good idea) was you prefer end user to not make any imports of requests but allow to pass in dict or requests.Session such as following:

# simple user never has to import requests
up = {  'auth' : ('user','pwd'), 'verify_ssl' : False } 
df = pd.read_csv( url, url_params=up)

# or power user can choose to use request.Session directly
import requests
up = requests.Session()
up.auth = ('uname','pwd')
up.header.modify( {'User-Agent' : 'My custom user agent'})
df = pd.read_csv( url, url_params=up)

# internally we check 
if type(url_params) is requests.Session:
    sess = url_params
elif type(url_params) is dict:
    sess = requests.Session()
    # then we add in auth/verify_ssl in here

@jreback

so the current behavior must work w/o requests installed at all

Agreed. Both basic-auth-https-self-signed as well as use-requests already satisfy them.

use requests for the auth and just use the current code for existing things

If we just want to use requests for auth (and I assume you meant ssl cert verification bypass too) then why are we even having a dependency on requests? First option of basic-auth-https-self-signed already does that, and its tested.

use-requests has the least amount of code change ( compared to basic-auth-https-self-signed ). It uses requests for all scenarios, only if it is installed. This functionality is passing all pytest . Also, if requests is not installed, existing functionality continues to work as is.

I got engaged as I opened an issue for basic auth + self signed certs.
Tomorrow someone might open an issue that pandas doesnt support today eg: proxies. Yet another issue might be setting custom headers, etc.
Here is a bug in released version of pandas. This does not involve https, or auth of any kind. This is supposed to work

df = pd.read_csv('http://handsome-equator.000webhostapp.com/no_auth/aaa.csv') # works fine
df = pd.read_csv('http://handsome-equator.000webhostapp.com:80/no_auth/aaa.csv') # fails with 404 if port number is added. it shouldnt. It doesnt in browser, or requests

It fails because urllib behavior. It works in use-requests if requests is installed and it is used even for existing scenarios when installed.

So, what is the direction for http(s) in pandas? Natively use urllib or move to requests? If the goal is to try to use requests the most we can, then what is the test acceptance criteria for trying to use requests most when available? use-requests passes all tests. Besides api design, what else remains?

Wish there was a bit more synchronous way to communicate, to hammer out the issues. faster

@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2017

@skynss : You understood my point correctly! True that verify_ssl is technically not an authentication parameter, so I agree that url_params might be better then.

@skynss
Copy link
Author

skynss commented Jul 24, 2017

You understood my point correctly! True that verify_ssl is technically not an authentication parameter, so I agree that url_params might be better then.

@gfyoung Take a look at latest version of use-requests, this change is implemented. Added new tests, all pytests are passing. Should I create a pull request for that branch of fork instead of this one?

@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2017

Should I create a pull request for that branch of fork instead of this one?

I would wait to see what @jreback has to say before you go through the effort of preparing a PR. I would try to clean up this PR to best you can for now.

@jreback
Copy link
Contributor

jreback commented Jul 26, 2017

@skynss you branch above looks much better than the current soln. I have a number of comments, but will comment directly on the PR.

@skynss
Copy link
Author

skynss commented Jul 26, 2017

I would wait to see what @jreback has to say before you go through the effort of preparing a PR

@skynss you branch above looks much better than the current soln. I have a number of comments, but will comment directly on the PR.

@jreback So I should go ahead and submit PR for use-requests branch? I am asking to be sure you meant use-requests branch as "your branch above"

@gfyoung
Copy link
Member

gfyoung commented Jul 26, 2017

So I should go ahead and submit PR for use-requests branch? I am asking to be sure you meant use-requests branch as "your branch above"

Yes, absolutely!

@gfyoung
Copy link
Member

gfyoung commented Jul 26, 2017

Don't close this PR yet. When we merge, we'll close whichever PR we don't want to use.

@jreback
Copy link
Contributor

jreback commented Jul 26, 2017

I mean use-requests (which you pointed to above). actually pls close this one when you push. That's going to be a replacement for this one.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

@skynss I think would be good for 0.21.0. can you rebase / update and get this passing?

@jreback
Copy link
Contributor

jreback commented Nov 10, 2017

pls rebase / update & move whatsnew to 0.22.0

@jreback
Copy link
Contributor

jreback commented Nov 25, 2017

can you rebase

@jreback
Copy link
Contributor

jreback commented Jan 21, 2018

closing as stale. if you want to continue working, pls ping.

@skynss this is a nice change, but we would need to integrate to the current infrastructure.

@jreback jreback closed this Jan 21, 2018
@ocefpaf ocefpaf mentioned this pull request Oct 10, 2019
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv from HTTPs + basic-auth + custom port throws an error (urlopen error)
4 participants