Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow per-domain wildcard link filters #121

Closed
aecio opened this issue Sep 6, 2017 · 3 comments
Closed

Allow per-domain wildcard link filters #121

aecio opened this issue Sep 6, 2017 · 3 comments
Assignees
Milestone

Comments

@aecio
Copy link
Member

aecio commented Sep 6, 2017

Sometimes it is desired to crawl only URLs that match specific patterns in a domain, and each site contains a different domain.

@aecio aecio added this to the 0.9 milestone Sep 6, 2017
@aecio aecio self-assigned this Sep 8, 2017
@binh-vu
Copy link

binh-vu commented Sep 9, 2017

It's also very helpful if we could have documentation for LinkClassifier

@aecio
Copy link
Member Author

aecio commented Sep 9, 2017

Hi @binh-vu, there is some documentation on how to run a focused crawl using a link classifier and online learning at: http://ache.readthedocs.io/en/latest/tutorial-focused-crawl.html
We also plan to add more detailed documentation as soon as time allows at: http://ache.readthedocs.io/en/latest/crawling-strategies.html#link-classifiers
If you have any more specific question, we kindly ask you to open another issue for that.

@aecio
Copy link
Member Author

aecio commented Sep 12, 2017

Documentation for per-domain link filters is available at: http://ache.readthedocs.io/en/latest/link-filters.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants