Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting referrer in download request #12

Open
matt-gardner opened this issue Sep 18, 2014 · 1 comment
Open

Allow setting referrer in download request #12

matt-gardner opened this issue Sep 18, 2014 · 1 comment

Comments

@matt-gardner
Copy link

Thanks for the tool, it's pretty useful. A nice addition would be the ability to set the referrer (and perhaps other variables, like user-agent) in the http request that's sent to download a particular site. Some sites don't function correctly without a correct referrer.

I'm pretty sure this just needs an additional line here that sets the referrer. I can try to do this and submit a pull request, but I'm pretty new to scala and I might handle things the wrong way (i.e., I haven't used implicits much, and this uses them pretty heavily, so I'm not that confident in my ability to do this right).

@Rovak
Copy link
Owner

Rovak commented Sep 19, 2014

That is indeed be a good addition which adds much needed configurability. It's been a while since I've written this code and after reading the code i think i overused implicits a bit to much and added unneeded complexity. So a solution with implicits is not necessarily the "right" solution.

Moving the jsoup configuration to an overridable method should be enough.

class WebsiteScraper extends Scraper {

  def download(jsoup: org.jsoup.helper.HttpConnection) = jsoup
    .userAgent("Mozilla")
    .followRedirects(true)
    .timeout(0)

  def downloadPage(pageUrl: String) = Future {
    new WebPage(new URL(pageUrl)) {
      doc = download(Jsoup.connect(pageUrl)).get
    }
  }
}

which can then be overridden

class CustomWebsiteScraper extends WebsiteScraper {

  override def download(jsoup: org.jsoup.helper.HttpConnection) = jsoup
    .userAgent("Mozilla")
    .followRedirects(true)
    .referrer("Referrer")
    .timeout(0)
}

and then used in a spider

new Spider {

  override implicit val scraper = new CustomWebsiteScraper

  onReceivedPage ::= { page: WebPage =>
    // Page received
  }

}.start()

This is just a suggestion and i would love to hear your ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants