Site Hound

Site Hound (previously THH) is a Domain Discovery Tool that extends the capabilities of commercial search engines using automation and human-in-the-loop (HITL) machine learning, allowing the user efficiently expand the set of relevant web pages within his domain/s or topic/s of interest.
Site Hound is the UI to a more complex set of tools described below. Site Hound was developed under the Memex Program by HyperionGray LLC in partnership with Scrapinghub, Ltd. (2015/2017)

Main Features

Role Based Access Control (RBAC).
Multiple workspaces for keeping things tidy.
Input of keywords, to be included or excluded to the search.
Input of seeds URLs, an initial list of websites that you already know are on-topic.
Expand the list of sites by fetching the keywords on multiple commercial search engines.
Displays screenshots (powered by Splash), title, text, html, relevant terms in the text
Allows the user to iteratively train a topic model based on these results by assigning them into defined values (Relevant/Irrelevant/Neutral), as well as re-scoring the associated keywords.
Allows an unbounded training module based on user-defined categories.
Language detection (powered byApache Tika) and page-type classification
Allows the user to view the trained topic model through a human-interpretable explaination of the model powered by our machine learning explanation toolkit ELI5
Performs a broad crawl of thousand of sites, using Machine Learning provided by DeepDeep-crawler filtering the ones matching the defined domain.
Displays the results in an interface similar to Pinterest for easy scrolling of the findings.
Provides summarized data about the broad crawl and exporting of the broad-crawl results in CSV format.
Provides real time information about the progress of the crawlers.
Allows search of the Dark web via integration with an onion index

Infrastructure Components

When the app starts up, it will try to connect first with all this components

Mongo (>3.0.*) stores the data about users, workspace and metadata about the crawlings
Elasticsearch (2.0) stores the results of the crawling (screenshots, html, extracted text)
Kafka (10.1.*) handles the communication between the backend components regarding the crawlings.

Custom Docker versions of these components are provided with their extra args to set up the stack correctly, in the Containers section below.

Service Components:

This components offer a suite of capabilities to Site Hound. Only the first three components are required.

Sitehound-Frontend: The user interface web application that handles auth, metadata and the labeled data.
Sitehound-Backend: Performs queries on the Search engines, follows the relevant links and orchestrates the screenshots, text extraction, language identification, page-classification, naive scoring using the cosine difference of TF*IDF, and stores the results sets.
Splash: Splash is used for screenshoot and html capturing.
HH-DeepDeep: Allows the user to train a page model to perform on-topic crawls
[ExcavaTor]: Our own tor index. This is currently a private db. Ask us about it!

Here is the components diagram for reference

Install:

Check the installation guide

How to use it:

Check the walkthrough guide

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
dev-docs		dev-docs
img		img
installation		installation
user-docs		user-docs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Site Hound

Main Features

Infrastructure Components

Service Components:

Install:

How to use it:

About

Releases 1

Packages

Contributors 3

Languages

License

TeamHG-Memex/sitehound

Folders and files

Latest commit

History

Repository files navigation

Site Hound

Main Features

Infrastructure Components

Service Components:

Install:

How to use it:

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages