what exactly is the input data format expected by Metronome? #2

pchalasani · 2014-07-03T00:52:40Z

subject says it all

jpatanooga · 2014-07-07T14:30:12Z

Its really similar to the SVMLight format where its just a CSV style line
oriented format, but we changed it slightly to accomodate multiple outputs.
The best reference is the unit test:

https://github.com/jpatanooga/Metronome/blob/master/src/test/java/tv/floe/metronome/io/records/TestMetronomeVectorizatonFormat.java

but in general it comes down to a mapping of an input vector to an output
vector:

[i0 i1 i2 | o0 o1 o2]

where spaces separate the vector entries and then each is indexed to save
space. We provide the vectorization class (MetronomeRecordFactory) with a
schema as shown in the unit test.

So yeah its a bit custom, but after looking around and thinking about it we
just wanted something simple to map in:output and this made sense.

Adam and I are working on some more robust and complete vectorization tools
( https://github.com/jpatanooga/Canova - still a work in progress) that
will interop in a number of formats and run serially or in MapReduce that
should make all of this simpler. Today Metronome should be considered
alpha/beta software at best and that's why you don't see a more robust set
of input formats for every tool. If you compare it to say MLLib in Spark,
you'll see that we're at about a similar state (some of their stuff is
hardcoded to arbitrary csv formats);

TLDR: yes, vectorization and input formats are important, we;re thinking
hard about it all holistically (Canova)

Thanks!

JP

On Wed, Jul 2, 2014 at 8:52 PM, pchalasani notifications@github.com wrote:

subject says it all

—
Reply to this email directly or view it on GitHub
#2.

agibsonccc · 2014-07-07T14:59:35Z

I would like to add here that this is a big problem. Rather than take an adhoc approach, canova will also support different modes of feature extraction for various kinds of data.

Lots of people don't think about word vectors, moving window on images, and other kinds of the harder formats.

Featurization is a huge problem we'll be tackling here in the coming weeks. As ambitious as it sounds,
much of this is being incubated in the deeplearning4j project now, and a more "neutral" version of this with support for SVM light and other formats will be supported by canova.

pchalasani · 2014-07-07T18:55:55Z

Thanks for the clarifications. I was just trying to figure out how I can (say) use Metronome to deploy deep-learning on Hadoop for one of our data-sets. Eventually, I'll probably put a friendly Clojure wrapper around it.

jpatanooga · 2014-07-07T19:51:40Z

glad we could help. let me know if you need help getting it going, I can
help you triage errors / etc.

JP

On Mon, Jul 7, 2014 at 2:55 PM, pchalasani notifications@github.com wrote:

Thanks for the clarifications. I was just trying to figure out how I can
(say) use Metronome to deploy deep-learning on Hadoop for one of out
data-sets. Eventually, I'll probably put a friendly Clojure wrapper around
it.

—
Reply to this email directly or view it on GitHub
#2 (comment).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what exactly is the input data format expected by Metronome? #2

what exactly is the input data format expected by Metronome? #2

pchalasani commented Jul 3, 2014

jpatanooga commented Jul 7, 2014

agibsonccc commented Jul 7, 2014

pchalasani commented Jul 7, 2014

jpatanooga commented Jul 7, 2014

what exactly is the input data format expected by Metronome? #2

what exactly is the input data format expected by Metronome? #2

Comments

pchalasani commented Jul 3, 2014

jpatanooga commented Jul 7, 2014

agibsonccc commented Jul 7, 2014

pchalasani commented Jul 7, 2014

jpatanooga commented Jul 7, 2014