Allow per-domain wildcard link filters #121

aecio · 2017-09-06T03:24:41Z

Sometimes it is desired to crawl only URLs that match specific patterns in a domain, and each site contains a different domain.

binh-vu · 2017-09-09T19:31:24Z

It's also very helpful if we could have documentation for LinkClassifier

aecio · 2017-09-09T22:45:26Z

Hi @binh-vu, there is some documentation on how to run a focused crawl using a link classifier and online learning at: http://ache.readthedocs.io/en/latest/tutorial-focused-crawl.html
We also plan to add more detailed documentation as soon as time allows at: http://ache.readthedocs.io/en/latest/crawling-strategies.html#link-classifiers
If you have any more specific question, we kindly ask you to open another issue for that.

aecio · 2017-09-12T18:37:46Z

Documentation for per-domain link filters is available at: http://ache.readthedocs.io/en/latest/link-filters.html

aecio added the new-feature label Sep 6, 2017

aecio added this to the 0.9 milestone Sep 6, 2017

aecio added a commit that referenced this issue Sep 6, 2017

Allow per-domain wildcard link filters (issue #121)

9817adf

aecio self-assigned this Sep 8, 2017

aecio added a commit that referenced this issue Sep 11, 2017

Constant memory algorithm for wildcard pattern matching (issue #121)

f9e9fb1

aecio closed this as completed in c0083cc Sep 12, 2017

Provide feedback