Home

Version 2

To address the challenge of larger files, such as recent Freebase dumps, the Infovore framework is being rebuilt on the Hadoop framework. Progress is rapid on this now, and Join our Google Group to follow it.

You will probably control Infovore 2 with the Haruhi command line application. Installed on your command line, Haruhi can launch jobs on a locally available hadoop cluster (i.e. the "hadoop" command is in the $PATH) or it can provision a cluster in Amazon EMR for prices starting from 7.5 cents an hour.

Most of Infovore 2 is packaged in the Bakemono super jar -- a jar that contains multiple Hadoop applications. Infovore contains another Haruhi super jar that can deploy a bakemono configuration to any Hadoop-compatible platform for ease of manual use and automation.

Bakemono contains a number of applications (see the list of applications), such as freebaseRDFPrefilter and pse3, sieve3, and ranSample.

perfecting the automation of a process that deploys :BaseKB Lime in the AMZN cloud weekly
developing tools for rapidly exploring large RDF files with Pig (the Chopper project)
developing bakemono apps for further processing of data sets.

To get started quickly, see Hadoop for the Impatient, how to run your own Jar and how to use Persistent AWS Clusters Paul's gave a talk about the application of Infovore to Freebase at SemtechBiz 2013 NY -- see the slides.

See Academic Papers About BaseKB and Projects that use :BaseKB

Infovore 2 Documentation

Editions of :BaseKB
:BaseKB-Lime
:BaseKB-Lite
:BaseKB-Pro
Developer's Notes
Building Infovore
Choosing the number of reducers
Cluster Types Supported By Haruhi
Coding Conventions
Design of a data processing path
External Documentation
Downloading Large Files
Economics of Amazon EMR
Primitive Triples and Primitive Nodes
Using the Eclipse IDE
Notes on Hadoop
Alternative Frameworks For Hadoop Programs
Fetching results out of Hadoop
Hadoop For the Impatient
Organizing Pig Code
Real Multiple Outputs For Hadoop
Major Components
Haruhi
Particulars for Freebase
Horizontal Divisions of Freebase
The top 25 URI Nodes in Freebase
Gotchas
Don't upload Super JARs over slow connections
Cannot pass null to reducer
Empty Files Are Normal In Early Processing
Type mismatch in key from map (expected)
Unit Testing Hadoop Mappers and Reducers

Historical Versions of Infovore

The development of Infovore has passed through three phases so far.

Infovore 1.0 -- a proprietary system for converting the old Freebase quad dump to RDF that was later released as open source
Infovore 1.1 -- an open source toolset for processing data sets such as Wikipedia and Freebase; like Infovore 1.0, one-computer concurrency was enabled with our own Millipede framework.
Infovore 2 -- as growing data sets broke components of Millipede, we switched to Hadoop

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Version 2

Infovore 2 Documentation

Historical Versions of Infovore

Documentation for era 1 Infovore

Clone this wiki locally