-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv from HTTPs + basic-auth + custom port throws an error (urlopen error) #16716
Comments
Apparently not with But joking aside, that's an awkward problem to have (can't replicate since I have no endpoint against which to test this). Also, this is not easy to test without publicly providing credentials to an endpoint. I read the SO post, and I was wondering: can you confirm that If that is the case, I suppose we could implement a wrapper that makes the request and then returns the response content in place of |
With requests and StringIO ot works well. I am travelling today but over next 1-2 days i can post a reproducable test url or simple reprocode. I suspect it repros with http and no real auth.. because i suspect this is simply a parsing error in hostname and port even before outbound call initiates. Can confirm in next 1-2 days when i am back unless you beat me to it |
Yeah if you could provide repro code that would be useful since it is hard to reproduce without our own endpoint. However, if you could confirm none of the other solutions proposed in the SO work besides the |
It appears to me that an existing url is not needed, because the issue is just in the parsing of the URL. eg: df = pd.read_csv('http://username:pwd@cnn.com:8080/get_content.csv')
# returns URLError: <urlopen error [Errno 11003] getaddrinfo failed>
df = pd.read_csv('http://username:pwd@cnn.com/get_content.csv')
# returns InvalidURL: nonnumeric port: 'pwd@cnn.com' ( failing to find port number) However, simplest way to fake it is: with open('/path1/aaa.csv', 'w') as f:
f.write( 'animal,bird\ncat,pigeon\nmonkey,swan') cd /path1
Now the url http://localhost:8080/aaa.csv should exist import requests
import pandas as pd
u1 = 'http://localhost:8080/aaa.csv'
print( requests.get(u1).text) # as expected prints contents of aaa.csv
df = pd.read_csv(u1) # as expected, loads contents of aaa.csv into df
u2 = 'http://uname:pwd@localhost:8080/aaa.csv'
print( requests.get(u2).text) # as expected prints contents of aaa.csv (ignores unnecessary uname pwd)
df = pd.read_csv(u2) # URLError: <urlopen error [Errno 11003] getaddrinfo failed> I'll provide a working positive test case url later tomorrow once I set it up. It would be good to engineer a fix to allow self-signed certs for testing too - eg: requests has verify=False. Helpful for testing. |
So since we can't use requests, we would follow https://stackoverflow.com/a/4188709/1889400
|
|
This alone isn't worth adding it as a dependency I think. |
That's very different from "we can't use For starters, it's builtin with Python 3.x, so we would only be adding it as a dependency for Python 2.x. However, if we could find a way to use just |
Can't really test that since I'd have to write the implementation first :) I think it'll work though.
I don't think it is... It's documented as the recommended way for making high-level http calls though (but still requires a separate install). |
Oh right, I stand corrected. 😄 |
Here are working urls using self signed certs. Would be good to get it to work with self-signed certs too as that can be very useful especially in testing. Please let me know once you are done testing.. so I can shut down the demo because thats all it is up for. import pandas as pd
import requests
from io import StringIO
# both urls use self signed cert. Both will remain working for few days
u1 = 'https://pandasusr:pandaspwd@pandastest.mooo.com:5000/aaa.csv' # non default ssl port
u2 = 'https://pandasusr:pandaspwd@pandastest.mooo.com/aaa.csv' # default ssl port
r1 = requests.get(u1, verify=False)
print(r1.text) # prints ok
df1 = pd.read_csv(StringIO(r1.text)) # works
r2 = requests.get(u1, verify=False)
print(r2.text) # prints ok
df2 = pd.read_csv(StringIO(r2.text)) # works
# without requests
df1 = pd.read_csv(u1) # URLError: <urlopen error [Errno 11003] getaddrinfo failed>
df2 = pd.read_csv(u2) # InvalidURL: nonnumeric port: 'pandaspwd@pandastest.mooo.com' |
Is anyone going to use the live endpoints referred above (and below) to repro and test? If not, I will shut down the server in the next 4 days. I havent seen anyone attempt to use it over past 11 days.
|
@skynss : Sorry about that! I imagine that outside work has caught up with a bunch of us (including myself). I'll see if I can look at it later today. |
@skynss : Can replicate the issues you were experiencing. |
Please run the source code I pasted above.. and I can replicate the issues I am experiencing. |
Did you read the comment I made above here? |
I am not sure which comment you are referring to.. as the link doesnt work. But trying my best to answer..
The only way I know that worked for me is 1) get txt content 2) load it in StringIO 3) give the StringIO buffer to pandas to read. For step 1) to get text content, I imagine any method would work. I used requests and that worked. And the repro code above follows that step. And I just verified that if I copy paste the repro code, it replicates the issue. If I didnt answer your question, kindly re-state your question. |
@skynss : Not sure why the link doesn't work. It's just a URL to an earlier comment I made. However, I'm wondering if you can access those files by passing into A solution that doesn't use |
The following working code does not depend on import urllib2, base64, ssl
from urlparse import urlparse
#from io import StringIO # python 3.x
from StringIO import StringIO
import pandas as pd
def split_uname_from_url(url_with_uname):
o = urlparse( url_with_uname)
uname = o.username
pwd = o.password
# create url without username and pwd
usrch = '{}:{}@{}'.format( uname, pwd, o.hostname)
url_no_usrpwd = url_with_uname.replace( usrch , o.hostname)
return uname, pwd, url_no_usrpwd
def get_https_basic_auth_ignore_invalid_cert( url_with_uname, verify_ssl=True):
uname, pwd, url_no_usrpwd = split_uname_from_url(url_with_uname)
print('Calling [{}] -- uname:[{}] -- pwd[{}]'.format(url_no_usrpwd, uname, pwd))
request = urllib2.Request( url_no_usrpwd )
base64string = base64.encodestring('%s:%s' % (uname, pwd)).replace('\n', '')
request.add_header("Authorization", "Basic %s" % base64string)
# I hope pandas can support self signed certs too
# because it is very difficult to get official SSL certs in testing scenarios
if verify_ssl:
result = urllib2.urlopen(request)
else: # in case of self signed SSL certificates.
result = urllib2.urlopen(request, context=ssl._create_unverified_context() )
txt = result.read()
return txt
url_with_uname = 'https://pandasusr:pandaspwd@pandastest.mooo.com:5000/aaa.csv'
csv_txt = get_https_basic_auth_ignore_invalid_cert( url_with_uname, verify_ssl=False)
df = pd.read_csv( StringIO(csv_txt.strip()) ) # forgot to close the StringIO buffer |
The following code:
import sys
import ssl
def split_uname_from_url(url_with_uname):
try:
from urlparse import urlparse
except:
from urllib.parse import urlparse
o = urlparse( url_with_uname)
uname = o.username
pwd = o.password
# create url without username and pwd
usrch = '{}:{}@{}'.format( uname, pwd, o.hostname)
url_no_usrpwd = url_with_uname.replace( usrch , o.hostname)
return uname, pwd, url_no_usrpwd
def get_https_basic_auth_ignore_invalid_cert( url_with_uname, verify_ssl=True):
uname, pwd, url_no_usrpwd = split_uname_from_url(url_with_uname)
print('Calling [{}] -- uname:[{}] -- pwd[{}]'.format(url_no_usrpwd, uname, pwd))
if sys.version_info[0] < 3:
fn= get_py2_https_basic_auth_ignore_invalid_cert
else:
fn = get_py3_https_basic_auth_ignore_invalid_cert
return fn( uname, pwd, url_no_usrpwd, verify_ssl=verify_ssl)
def get_py2_https_basic_auth_ignore_invalid_cert( uname, pwd, url_no_usrpwd, verify_ssl=True):
import urllib2, base64
request = urllib2.Request( url_no_usrpwd )
base64string = base64.encodestring('%s:%s' % (uname, pwd)).replace('\n', '')
request.add_header("Authorization", "Basic %s" % base64string)
# I hope pandas can support self signed certs too
if verify_ssl:
result = urllib2.urlopen(request)
else: # in case of self signed SSL certificates.
result = urllib2.urlopen(request, context=ssl._create_unverified_context() )
return result.read()
def get_py3_https_basic_auth_ignore_invalid_cert( uname, pwd, url_no_usrpwd, verify_ssl=True):
import urllib.request
passman = urllib.request.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url_no_usrpwd, uname, pwd)
authhandler = urllib.request.HTTPBasicAuthHandler(passman)
if verify_ssl:
opener = urllib.request.build_opener(authhandler)
else:
context = ssl.create_default_context()
context.check_hostname = False
context.verify_mode = ssl.CERT_NONE
opener = urllib.request.build_opener(authhandler, urllib.request.HTTPSHandler(context=context))
urllib.request.install_opener(opener)
res = urllib.request.urlopen(url_no_usrpwd)
return res.read().decode('utf-8')
def csv_to_df(csv_txt, **kwargs):
'''
@param csv_txt: text of csv rows.
@param kwargs: to pass to pd.read_csv
@return df
'''
import pandas as pd
try:
from StringIO import StringIO #python2.7
except:
from io import StringIO #python3.x.
buf = None
df = None
try:
buf = StringIO(csv_txt)
df = pd.read_csv(buf, **kwargs)
finally:
if buf:
try:
buf.close()
except:
pass
return df
url_with_uname = 'https://pandasusr:pandaspwd@pandastest.mooo.com:5000/aaa.csv'
csv_txt = get_https_basic_auth_ignore_invalid_cert( url_with_uname, verify_ssl=False)
df = csv_to_df( csv_txt.strip() )
print(df.to_string(index=False)) |
@skynss : Awesome! Thanks for doing this (I can check this later today). Now that we have something that acts as a workaround, I think the next step is seeing whether you can incorporate parts of this into the existing codebase. Want to give that a shot? |
@gfyoung I am in midst of travel. Feel free to check in incorporate, i wont get chance to look until couple of days. |
The following is updated code which makes it easier to merge into import sys
is_py3 = sys.version_info[0] >= 3 # replace with 'compat.PY3'
## BEGIN SECTION modifications to pandas/io/common.py
import ssl
import base64
if is_py3:
from urllib.parse import urlparse as parse_url
from urllib.request import urlopen, build_opener, install_opener, \
HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, HTTPSHandler
_urlopen = urlopen
else:
from urlparse import urlparse as parse_url
from urllib2 import urlopen as _urlopen
from urllib2 import Request
from contextlib import contextmanager, closing # noqa
from functools import wraps # noqa
# @wraps(_urlopen)
@contextmanager
def urlopen(*args, **kwargs):
with closing(_urlopen(*args, **kwargs)) as f:
yield f
def split_uname_from_url(url_with_uname):
o = parse_url( url_with_uname)
usrch = '{}:{}@{}'.format( o.username, o.password, o.hostname)
url_no_usrpwd = url_with_uname.replace( usrch , o.hostname)
return o.username, o.password, url_no_usrpwd
def get_urlopen_args( url_with_uname, verify_ssl=True):
uname, pwd, url_no_usrpwd = split_uname_from_url(url_with_uname)
print('Calling [{}] -- uname:[{}] -- pwd[{}]'.format(url_no_usrpwd, uname, pwd))
if is_py3:
fn= get_urlopen_args_py3
else:
fn = get_urlopen_args_py2
req, kwargs = fn( uname, pwd, url_no_usrpwd, verify_ssl=verify_ssl)
return req, kwargs
def get_urlopen_args_py2( uname, pwd, url_no_usrpwd, verify_ssl=True):
req = Request( url_no_usrpwd )
base64string = base64.encodestring('{}:{}'.format(uname, pwd)).replace('\n', '')
req.add_header("Authorization", "Basic {}".format( base64string) )
# I hope pandas can support self signed certs too
kwargs = {}
if not verify_ssl:
kwargs['context'] = ssl._create_unverified_context()
return req, kwargs
def get_urlopen_args_py3( uname, pwd, url_no_usrpwd, verify_ssl=True):
passman = HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url_no_usrpwd, uname, pwd)
authhandler = HTTPBasicAuthHandler(passman)
if verify_ssl:
opener = build_opener(authhandler)
else:
context = ssl.create_default_context()
context.check_hostname = False
context.verify_mode = ssl.CERT_NONE
opener = build_opener(authhandler, HTTPSHandler(context=context))
install_opener(opener)
return url_no_usrpwd, {}
## END SECTION modifications to pandas/io/common.py
def call_urlopen( url_with_uname, verify_ssl=False):
# in get_filepath_or_buffer prior to calling _urlopen get params
# not sure where to obtain verify_ssl from pd.read_csv
req, kwargs = get_urlopen_args(url_with_uname, verify_ssl)
resp = _urlopen(req , **kwargs)
return resp_to_csv(resp)
def resp_to_csv(resp):
csv = resp.read()
if is_py3:
csv = csv.decode('utf-8')
return csv
def csv_to_df(csv_txt, **kwargs):
'''
@param csv_txt: text of csv rows.
@param kwargs: to pass to pd.read_csv
@return df
'''
import pandas as pd
try:
from StringIO import StringIO #python2.7
except:
from io import StringIO #python3.x.
buf = None
df = None
try:
buf = StringIO(csv_txt)
df = pd.read_csv(buf, **kwargs)
finally:
if buf:
try:
buf.close()
except:
pass
return df
url_with_uname = 'https://pandasusr:pandaspwd@pandastest.mooo.com:5000/aaa.csv'
csv_txt = call_urlopen( url_with_uname, verify_ssl=False)
df = csv_to_df( csv_txt.strip() )
print(df.to_string(index=False)) I am not quiet familiar with process involved around contributing to pandas - so please feel free to take over. |
@jreback Thx. Followed it. @gfyoung I forked and modified the codebase with modifications to the best of my ability. here it is: https://github.com/skynss/pandas/tree/basic-auth-https-self-signed I don't know how to check in live test scenario so I am going to leave that out. # pip install --upgrade https://github.com/skynss/pandas/archive/basic-auth-https-self-signed.zip
# live working test that tests both scenarios:
# pd.read_csv('https://uname:pwd@fqdn:<port>/fname.csv', verify_ssl=False)
# pd.read_csv('https://fqdn:<port>/fname.csv', username='uname', password='pwd', verify_ssl=False)
import pandas as pd
uname='pandasusr'
pwd='pandaspwd'
url = 'https://{}pandastest.mooo.com:5000/'
verify_ssl=False
def get_df(url, uname, pwd, verify_ssl, pd_read_fn, fname):
furl = url + fname
kwargs = {}
if uname:
kwargs['username']=uname
if pwd:
kwargs['password']=pwd
if verify_ssl is not None:
kwargs['verify_ssl']=verify_ssl
print('\n' +furl)
df = pd_read_fn(furl, **kwargs)
if type(df) is list: # html
df = df[0]
print(df.to_string(index=False))
print(df.to_json())
fparams = [ (pd.read_csv, 'aaa.csv'),
(pd.read_json, 'jdoc.json'),
(pd.read_excel, 'ex_doc.xlsx'),
(pd.read_html, 'html_file.html') ]
for pd_read_fn, fname in fparams:
u = url.format('{}:{}@'.format(uname, pwd))
get_df( u, None, None, verify_ssl, pd_read_fn, fname) #1 url with username/pwd as part of url
u2 = url.format('')
get_df( u2, uname, pwd, verify_ssl, pd_read_fn, fname) # url with username/pwd as params |
@skynss : Thanks for doing this! A couple of points:
One place to examine: why can't you use the |
@gfyoung Thx - implemented the changes. Please view. I had tried the 'Request' from py3 already and it didnt work. But I kept the py3 code the way it is because it seems extensible and correct way long term. I changed auth to match requests lib, and kept verify_ssl separate just like requests. import pandas as pd
uname='pandasusr'
pwd='pandaspwd'
url = 'https://{}pandastest.mooo.com:5000/'
verify_ssl=False
def get_df(url, uname, pwd, verify_ssl, pd_read_fn, fname):
furl = url + fname
kwargs = {}
if uname or pwd:
kwargs['auth']=(uname, pwd)
if verify_ssl is not None:
kwargs['verify_ssl']=verify_ssl
print('\n' +furl)
df = pd_read_fn(furl, **kwargs)
if type(df) is list: # html
df = df[0]
print(df.to_string(index=False))
print(df.to_json())
fparams = [ (pd.read_csv, 'aaa.csv'), (pd.read_json, 'jdoc.json'), (pd.read_excel, 'ex_doc.xlsx'), (pd.read_html, 'html_file.html') ]
for pd_read_fn, fname in fparams:
u = url.format('{}:{}@'.format(uname, pwd))
get_df( u, None, None, verify_ssl, pd_read_fn, fname)
u2 = url.format('')
get_df( u2, uname, pwd, verify_ssl, pd_read_fn, fname) |
@skynss : Cool! Thanks for the
|
|
Sorry I haven't been following this closely, but
We don't need to test this against a live server. We just need to ensure that we structure the request properly, and mock the actual @skynss when you're able, could you submit a pull request? It'll make reviewing this easier. |
@TomAugspurger : Do we mock any requests in our current tests? I think we should still test that it actually works against some endpoint. That's the best way to confirm that it works and is not just a special case for @skynss IMO. We could just decorate such a test with |
@skynss : I second @TomAugspurger on this now. The code looks in a lot better shape. We'll definitely make more changes to it, but it's at a stage where I think we can review it now. A couple of things you will need to do:
|
At least one in
|
@TomAugspurger : I looked at the code. How does it actually use the |
Huh, it kind of looks like it's not actually used anymore :) |
So my question still stands: Do we mock any requests in our current tests? A quick GitHub search suggests not AFAICT. |
@TomAugspurger just created a pull request |
There is also same request for read_json |
Since #36688 has been addressed which should address this issue so closing |
Code Sample, a copy-pastable example if possible
Problem description
HTTPS basic auth is very common. This URL format works in Excel, other text editors, etc. This url works in
requests
library. It seems like the scenario doesnt work because the underlyingurlopen
doesnt worksee stackoverflow issue urllib basic auth
Only way to overcome this is using
requests
+StringIO
?Expected Output
be able to get a CSV loaded dataframe
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: