webdanica

System for finding Danish webpages outside the .dk domain

The system consists of a ROOT tomcat application (the webapp) running on port 8080, with two embedded workflows, a filtering workflow - used to reject undesirable seeds, and a harvesting workflow, that makes small single seed harvests in a local NetarchiveSuite system the result of which is pushed to third major component, an automatic analysis workflow, that takes harvestlogs - written by the harvesting workflow to a common directory writeable by both, makes parsedText out of the warc.gz files from heritrix3, and then criteria-analysis on this text, and finally ingested into the database.

The database backend is HBase (currently 1.1.5) through Apache Phoenix. Using apache phoenix requires that phoenix-4.7.0-HBase-1.1-client.jar is part of the distribution phoenix-4.7.0-HBase-1.1-bin.tar.gz downloaded from https://archive.apache.org/dist/phoenix/phoenix-4.7.0-HBase-1.1/bin/ .

Installation of hbase is not yet documented properly

Installation of the webdanica-tables are done using the psql.py script and the create-scripts found here: webdanica-core/src/main/resources/scripts/hbase-phoenix Use the latest scripts, as the scripts in the 1.X branch could be out-of-date:

There are the following create scripts for each of the required hbase tables

create_blacklists.sql
create_criteria_results.sql
create_domains.sql
create_harvests.sql
create_ingestlog.sql
create_seeds.sql

Sample command to create the blacklists table with connectionstring=kb-test-hadoop-01.kb.dk:2181:/hbase e.g. psql.py kb-test-hadoop-01.kb.dk:2181:/hbase create_blacklists.sql

Building the war-file

Installation and configuration of the webapp

Installation and configuration of the automatic workflow

Installation and configuration of the webdanica Netarchivesuite

Tools manual

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
automatic-workflow		automatic-workflow
install		install
templates		templates
tools		tools
webdanica-core		webdanica-core
webdanica-webapp-war		webdanica-webapp-war
webdanica-webapp		webdanica-webapp
workflow		workflow
.gitignore		.gitignore
README.md		README.md
cleanup_oldjobs.sh		cleanup_oldjobs.sh
maven-set-version.txt		maven-set-version.txt
pom.xml		pom.xml
tools.md		tools.md
warfile_building.md		warfile_building.md
webapp_install.md		webapp_install.md
webdanicaNAS_install.md		webdanicaNAS_install.md
webdanica_installation.xml		webdanica_installation.xml
workflow_install.md		workflow_install.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webdanica

About

Releases

Packages

Languages

rahulghanate/webdanica

Folders and files

Latest commit

History

Repository files navigation

webdanica

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages