Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boilerpipe fails to extract certain urls with 406 Error #24

Closed
rshiva opened this issue Jun 26, 2014 · 5 comments
Closed

Boilerpipe fails to extract certain urls with 406 Error #24

rshiva opened this issue Jun 26, 2014 · 5 comments

Comments

@rshiva
Copy link

rshiva commented Jun 26, 2014

I'm trying to extract content for website but boilerpipe fail with this error ..

File "/local/lib/python2.7/site-packages/boilerpipe/extract/init.py", line 36, in init
connection = urllib2.urlopen(request)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(_args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(_args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 406: Not Acceptable

First i thought error is coming from urllib2 . So i tried it with urllib2 independently but it was working fine.
These are the few url where its failing ..
http://mlmgod.com/onlinedeals/20-off-on-max-3000-ebay-rs-0-0-other-free-deals-coupons/
http://swapmyapp.com/how-to/use-microsoft-office-for-iphone/
http://www.techeyetech.com/micromax-canvas-tube-specifications-features-price-canvas-tube-palm-theater.html

@rshiva rshiva changed the title Boilerpipe fails to extractor for certain urls with 406 Error Boilerpipe fails to extractor certain urls with 406 Error Jun 26, 2014
@rshiva rshiva changed the title Boilerpipe fails to extractor certain urls with 406 Error Boilerpipe fails to extract certain urls with 406 Error Jun 26, 2014
@rshiva
Copy link
Author

rshiva commented Aug 6, 2014

i traced back and found the errors is coming from urllib2 in headers
/local/lib/python2.7/site-packages/boilerpipe/extract/init.py

headers = {'User-Agent': 'Mozilla/5.0'}
request = urllib2.Request(kwargs['url'], headers=self.headers)

Instead if just pass headers = {'User-Agent': 'Mozilla'}
it worked fine. But couldn't understand why its breaking when we pass Mozilla/5.0 in headers

@tuxdna
Copy link
Collaborator

tuxdna commented Sep 12, 2016

@rshiva This seems like an issue outside of python-boilerpipe. Can you reproduce this issue now ?

@jonjrodriguez
Copy link

I just started using python-boilerpipe on a project and am running into the same issue. Is there no update on this? Thanks.

@tuxdna
Copy link
Collaborator

tuxdna commented Apr 9, 2017

@thoughts1053 Can you provide with a sample script that can help replicate this issue?

@krishnaupadhyay3
Copy link
Contributor

krishnaupadhyay3 commented Jul 21, 2021

@tuxdna tuxdna closed this as completed Jul 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants