Project Description

THIS CODE SUCKS. I'll work on it in the coming weeks.

Turn any generic web address into semantic RDF content if possible via the use of:

site specific rules (if they exist)
software specific rules (if a particular software, eg. Wordpress, phpBB) is detected
rely on heuritics otherwise
- supply a basic learning algorithm to improve quality of heuristics
- These can be pretty advanced...

Also report the following:

Type of website
Type of software (eg. Wordpress, phpBB, etc.)
If it appears the page is static rather than dynamic
License of content (copyright, CC, public domain, etc.)
If the website is friendly to this method
- Also, if the website makes an attempt to block this method via HTTP user agent headers, etc.
If the website uses ads or user tracking (and list them)
If the website supports direct semantic endpoints rather than scraping
An estimated approximation of how accurate the parsing was in [0.0, 1.0]
Basic sitemap
- Location of likely "feed" urls (that can be utilized like RSS/Atom)

All ads will be filtered out of the returned RDF.

This project is a work in progress, and it is designed to support my other distributed web project: Sylph.

License

Project is AGPL3 licensed (may relicense if requested). Site-specific rules and templates are licensed under BSD/MIT and GFDL/CC-BY-SA.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
rdfstore		rdfstore
web-interface		web-interface
web2rdf		web2rdf
.gitignore		.gitignore
README.mkd		README.mkd
cache		cache
display.py		display.py
fetch.py		fetch.py
massFetch.py		massFetch.py
repair.py		repair.py