Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary double encoding of url #839

Closed
mircowidmer opened this issue Feb 24, 2017 · 6 comments
Closed

Unnecessary double encoding of url #839

mircowidmer opened this issue Feb 24, 2017 · 6 comments
Labels
Milestone

Comments

@mircowidmer
Copy link

The url http://test.com/& is not treated correctly as of version 1.10.2 (or maybe even 1.10.1).

Jsoup.connect("http://test.com/" + URLEncoder.encode("&", "UTF-8")).get();

The url that gets passed to Jsoup is now http://test.com/%26 because the & was url encoded. So everything works fine in 1.9.2 because the encodeUrl(String url) method in the HttpConnection class does not modify the given url in this example because there is no space in the given url. The same url in 1.10.2 gets encoded again in the encodeUrl() method which leads to the following url: http://test.com/%2526 (the percent of the url passed to Jsoup is unnecessarily encoded again).

A workaround for this issue is to downgrade to 1.9.2 where the encodeUrl method was implemented differently (see below)

1.10.2:

private static String encodeUrl(String url) {
        try {
            URL u = new URL(url);
            return encodeUrl(u).toExternalForm();
        } catch (Exception e) {
            return url;
        }
}

1.9.2:

private static String encodeUrl(String url) {
	if(url == null)
		return null;
	return url.replaceAll(" ", "%20");
}
@ghost
Copy link

ghost commented Mar 12, 2017

Faced the same issue!

URL url = new URL("http://en.wikipedia.org/wiki/Adolph_St%C3%B6hr");
Document notLoaded = Jsoup.parse(url, 10_000);

Fails with a reason 'Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=400, URL=https://en.wikipedia.org/wiki/Adolph_St%25C3%25B6hr' (double encoding)

Possible solution is Apache HttpClient for external data loading:

HttpClient client = HttpClientBuilder.create().setRedirectStrategy(new LaxRedirectStrategy()).build();
RequestConfig conf = RequestConfig.copy(RequestConfig.DEFAULT).setConnectionRequestTimeout(10_000)
				.setMaxRedirects(10).setCookieSpec(CookieSpecs.STANDARD).build();

URL url = new URL("http://en.wikipedia.org/wiki/Adolph_St%C3%B6hr");
HttpGet request = new HttpGet(url.toExternalForm());
request.setConfig(conf);
request.addHeader("User-Agent", you_user_agent);

HttpResponse response = client.execute(request);
Document doc = Jsoup.parse(response.getEntity().getContent(), "UTF-8", url.toExternalForm());

@lsoarestd
Copy link

same here. this changed in this version and my scraper "died"

@axzhcode
Copy link

same. It worked well in the 1.10.1.

@dgavrus
Copy link

dgavrus commented May 23, 2017

I have same issue.
Document document = Jsoup.connect(URL).get();
My URL has param ...&AGE[]=32..., it converts to &AGE%255b%255d=23, but must be &AGE%5b%5d=32 (percent has 25 hex code)

@jhy
Copy link
Owner

jhy commented Jun 10, 2017

Sorry about the issues here. This is fixed in 1.10.3 (upcoming). I fixed it back in 56a728d

Will close when 1.10.3 is released

@jhy jhy added bug Confirmed bug that we should fix fixed and removed bug Confirmed bug that we should fix labels Jun 10, 2017
@jhy jhy added this to the 1.10.3 milestone Jun 10, 2017
@jhy
Copy link
Owner

jhy commented Jun 11, 2017

jsoup 1.10.3 is out now: https://jsoup.org/news/release-1.10.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants