Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting referrer in a page download #13

Closed
wants to merge 1 commit into from

Conversation

matt-gardner
Copy link

This is what I did for #12. It's not that elegant, but it worked. I would say that expanding the API that's input to the download method makes more sense to me than overriding that method in subclasses of Scraper - what if I need a different referrer for each page? The referrer is naturally a property of the page that's being downloaded, not the scraper. So to handle more general settings, expanding the QueryBuilder seems like the right way to go. Maybe creating a new user-facing class like PageSpec, or something, would be appropriate.

@Rovak
Copy link
Owner

Rovak commented Sep 19, 2014

Thanks for the PR, allowing to set the referer on a per-page basis is indeed a better solution. I'm not sure how this would work with Spiders. A spider only knows the start URL in advance and just visits every link that it finds.

My suggestion is to add another hook which is called before download a page, in this hook the user can then configure the settings which are used to scrape the page.

new Spider {
  startUrls ::= "http://events.stanford.edu/"
  allowedDomains ::= "events.stanford.edu"

  beforeDownloadPage ::= { page: WebPage =>
    // Set page/referrer before page download
    page.url match {
      case url if url.contains("something") => // Set referrer settings 1
      case url if url.contains("something_else") => // Set referrer settings 2
      case x => // default page settings
    }
  }

  onReceivedPage ::= { page: WebPage =>

  }
}.start()

@matt-gardner
Copy link
Author

Yeah, I didn't even think about the spider, as my particular application doesn't really need it. What you suggest sounds like a good idea. In the end, you would still use the same basic technique for putting in additional settings to the page downloader (i.e., either a simple tuple that has both the page and the referrer, or building some more complicated object if there are more interest settings you need), and giving a hook to the caller to customize it for each page would be great.

I got this working well enough for me with the simple modification that I made. I guess you could treat these changes as inspiration or a starting point for addressing this issue, whenever you feel like you want to add this feature. As for me, I'm satisfied for the time being with what I have now, so I'm going to close the pull request (unless you really want to merge these exact changes). Thanks a lot for providing the library!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants