Data Sources

Since this is a knowledge-based application, we need serious sources of knowledge in order to power it. So there's a lot of data and a lot to download. We try to make it easy to setup, just two steps. But depending on your connection and machine, these two steps could take quite some time.

The Local Indexes (Indri, Lucene and Jena)

We distribute our data on Dropbox but occasionally we get throttled. Report an issue (on the right) if there are problems. Just download the data/ archive and extract it into the already existing data/ directory, overwriting if necessary.

The Relational Database

Depending on the scale of the installation, you may prefer to use the Postgres database backup, but we recommend using the SQLite database-in-a-file until you are sure you have outgrown it. (Unpack it first though; it's compressed to save bandwidth.)

What all we are using

Our data comes from several sources, including:

Wikipedia, Wiktionary and Wikiquotes XML dumps, from which we gather and use:
Articles, contiguously and separated into paragraphs
Links, including source, target, label, and count
Redirects, target and source
Wikipedia pageview statistics, used as a scorer
Full texts of Shakespeare
Bing Search (several thousand queries are cached in the database, to reduce traffic and to enable reproducability)
The DBPedia ontologies, as well as English labels, instance types, and short abstracts.

We also have pre-indexed these in the archive:

Using Indri search on all the articles and paragraphs
Using Lucene on articles, paragraphs, and DBPedia labels
Using Jena on everything else we downloaded from DBPedia.
Using PostgreSQL and SQLite for the relational database dump, both containing many tables indexed in several ways.

In Development

We're deciding better ways to synchronize the data we need for the project and we think we can do it with Bittorrent Sync instead of Dropbox, to avoid problems with bandwidth. So you should be able to load the external data archive using a Bittorrent Sync link.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Sources

The Local Indexes (Indri, Lucene and Jena)

The Relational Database

What all we are using

In Development

Clone this wiki locally