crawler

A dead simple crawler to generate a site map of your website

The code has been compiled using Java 11, so it may not compile if using previous versions of the JDK.

This project uses Gradle as dependency manager, so in order to build it, it's enough to check out the code and do

gradle build

Gradle builds an application jar in ./build/libs. The app JAR can then be run directly with

java -jar crawler-0.0.1-SNAPSHOT.jar https://your-website.com

where your-website.com is the website you want to get a site map for.

The app produces a JSON array of results for every link found in your website, starting from the URL provided at startup. For every link found, the app will print information about:

links to pages in the same domain
static assets contained in the current page (images, CSS and JS links)

The result is printed to the console once all pages are processed. If your intention is to run this in production, we strongly recommend to swap ConsoleResultAggregatorService.java for FileResultAggregatorService. You can do so by removing the @Primary annotation from the first and adding it to the second, so that it will tell Spring which implementation of ResultAggregatorService has to take priority.

The app has a default maximum number to links to fetch of 100. This means the app will not fetch more than 100 links. If you expect your target website to have more than 100 links in total, we recommend you change this value by setting the environment variable

maxPagesToSearch=n

where n is an integer number greater than zero.

The value can be passed as JVM argument as well, with -DmaxPagesToSearch=n

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
gradle/wrapper		gradle/wrapper
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
result.json		result.json
settings.gradle		settings.gradle
test.json		test.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawler

About

Releases

Packages

Languages

License

nickmelis/crawler

Folders and files

Latest commit

History

Repository files navigation

crawler

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages