Add Hadoop Converter Job and task #1351

drcrallen · 2015-05-09T00:01:01Z

Allows submitting conversion tasks through Hadoop
Allows submitting conversion tasks through the indexing service for Hadoop
Adds unit tests for the hadoop conversion task.
Small changes to SQLMetadataSegmentManager to facilitate better unit tests
Fixes Convert Segment Task does not close properly on error #1363

The following should merge first:
#1367 (done)
#1366 (done)
#1428 (done)

Then I'll rebase this one.

xvrl · 2015-05-12T06:35:28Z

...r/src/main/java/io/druid/metadata/storage/derby/DerbyEmbeddedMetadataStorageDruidModule.java

+/**
+ * Most useful for unit testing when `org.apache.derby.jdbc.EmbeddedDriver` is needed
+ */
+public class DerbyEmbeddedMetadataStorageDruidModule extends SQLMetadataStorageDruidModule


I'm not sure I understand why we need a new module here?

DerbyEmbeddedConnector uses org.apache.derby.jdbc.EmbeddedDriver instead of the client driver. Another alternative is to have a setting that allows the configuration of the driver class.

I managed to eliminate it and add a new test @Rule for anything that needs a metadata connector

cheddar · 2015-05-12T21:44:34Z

If I'm understanding this correctly, this looks like a way to run a conversion task over Hadoop MR. That is, logically, it is executing tasks using Hadoop MR (and ultimately YARN) as the task manager instead of using whatever the Indexing Service is using.

I'm not against this, but I also kinda wonder if we shouldn't make a generic "run tasks on hadoop" job and then have the conversion task be a part of that?

drcrallen · 2015-05-12T21:55:07Z

@cheddar : Yes, that is a longer goal. Some of the aspects of this PR will be combined with the hadoop task once this has proven stable. A future goal would be to make a better task container that can run on the indexing service, yarn, or mesos.

drcrallen · 2015-05-13T15:51:09Z

I had discussed with @xvrl briefly about what to do regarding the metadata update.

I'm in agreement with his point of view that only allowing the task to work as an indexing service task (and NOT as a standalone hadoop job) would greatly simply things overall.

@cheddar : is there any objection to simply having this as ONLY an indexing task? (EDIT: indexing task which spawns a hadoop job, meaning it requires the indexing service to run on Hadoop)

xvrl · 2015-05-13T16:31:22Z

The existing conversion tasks is also not standalone, so I don't see why this one would need to be.

It would also greatly simplify this PR to remove all metadata rated stuff, since that's unrelated to the task itself if we remove the standalone option.

drcrallen · 2015-05-14T02:58:33Z

@cheddar / @xvrl : I removed the metadata updating in the job and left it as ONLY an indexing task.

I also fixed the last known bug (#1363) that I have encountered in local tests.

nishantmonu51 · 2015-05-14T16:15:28Z

indexing-hadoop/src/main/java/io/druid/indexer/updater/HadoopConverterJob.java

+    return writtenBytes;
+  }
+
+  private static void setupClassPath(Job job, JobConf jobConf)


can we reuse code from JobHelper here ?

No, not without modification. JobHelper relies exclusively on HadoopDruidIndexerConfig and not interfaces to something more abstract.

I tried not to touch any existing hadoop codepaths until this is stable because hadoop is incredibly picky.

@drcrallen it should be easy to reuse the JobHelper code for this. We can simply change the JobHelper.setupClassPath method to pass the workingPath as opposed to the entire config.

I'll be happy to set it, but tried to touch the existing hadoop stuff as little as possible.

If this PR is generally agreed upon and it proves pretty stable in our tests then I'll be happy to migrate existing hadoop stuff over to this PR's "framework" of doing things.

I would rather use the existing code unless there is a good reason to rewrite things. Something little changes can make a big differences and for things that can be re-used I would prefer we take the old code unless there is a good reason to rewrite it, or because it would be difficult to refactor.

I'm just going to refactor it as part of this PR.

Splitting into its own PR

Fixed having some more common stuff

drcrallen · 2015-05-27T13:16:08Z

Travis failed due to #1393 restarting

drcrallen · 2015-05-29T21:37:15Z

@cheddar / @xvrl This has been sitting here for 2 days and is blocking stuff on my side. Please either comment or merge.

xvrl · 2015-05-29T21:44:16Z

common/src/main/java/io/druid/metadata/MetadataStorageConnectorConfig.java

@@ -41,6 +41,25 @@
  @JsonProperty("password")
  private PasswordProvider passwordProvider;

+  public MetadataStorageConnectorConfig(){
+    // NOOP


can we just call this(null, null, null, ...) to make it clear the new constructor is the main one?

fjy · 2015-06-05T16:51:22Z

indexing-hadoop/src/main/java/io/druid/indexer/updater/HadoopConverterJob.java

+        @Override
+        public void setupJob(JobContext jobContext) throws IOException
+        {
+


can we comment these as noop as well?

himanshug · 2015-06-05T21:35:01Z

indexing-hadoop/src/main/java/io/druid/indexer/updater/HadoopConverterJob.java

+    }
+  }
+
+  public static class DataSegmentSplit extends InputSplit


it is more conventional to implement "org.apache.hadoop.io.Writable" as well so that you don't have to create and setup serde separately. I think that will reduce some code(DataSegmentSplitSerializer) and setupSerializers(..) method.

fjy · 2015-06-08T21:35:27Z

indexing-hadoop/src/main/java/io/druid/indexer/JobHelper.java

@@ -425,4 +440,108 @@ public static Path prependFSIfNullScheme(FileSystem fs, Path path)
    }
    return path;
  }
+
+  // TODO: Replace this whenever hadoop gets their act together and stops breaking with more recent versions of Guava
+  public static long unzipNoGuava(


lol
+1 on this

how much of this method is copypasta from guava?

Zero, it is actually because com.metamx.common.CompressionUtils uses byte source and byte sink in the "useful" methods. Also com.google.common.io.ByteStreams#copy(java.io.InputStream, java.io.OutputStream) is used quite a bit in there.

xvrl · 2015-06-09T04:11:15Z

indexing-hadoop/pom.xml

+        <dependency>
+            <groupId>org.apache.derby</groupId>
+            <artifactId>derby</artifactId>
+            <version>10.11.1.1</version>


derbynet in the parent pom already depends on derby, so we can just use derbynet and remove the version here.

was only needed when the derby server rule was present. since it has vanished this can to. Will fix.

* Fixes apache#1363 * Add extra utils in JobHelper based on PR feedback

xvrl · 2015-06-09T22:36:42Z

+1

Add Hadoop Converter Job and task

cheddar · 2015-06-10T18:08:06Z

Conversations directly with FJ and others who have commented on this PR show that everyone is generally ok with this, so I went ahead and merged.

drcrallen changed the title ~~Add Hadoop Converter Job and task~~ Add Hadoop Converter Job and task (DISCUSSION / INITIAL REVIEW) May 9, 2015

xvrl added the Discuss label May 11, 2015

drcrallen force-pushed the hadoopConvert branch from f020a66 to 6788493 Compare May 11, 2015 19:39

xvrl reviewed May 12, 2015
View reviewed changes

drcrallen force-pushed the hadoopConvert branch from 6788493 to 31d0787 Compare May 13, 2015 15:48

drcrallen changed the title ~~Add Hadoop Converter Job and task (DISCUSSION / INITIAL REVIEW)~~ Add Hadoop Converter Job and task May 13, 2015

drcrallen force-pushed the hadoopConvert branch from 31d0787 to 51423e8 Compare May 14, 2015 02:55

nishantmonu51 reviewed May 14, 2015
View reviewed changes

drcrallen force-pushed the hadoopConvert branch 3 times, most recently from 05c88ec to c62a1d8 Compare May 27, 2015 13:05

drcrallen closed this May 27, 2015

drcrallen reopened this May 27, 2015

drcrallen closed this May 27, 2015

drcrallen reopened this May 27, 2015

xvrl reviewed May 29, 2015
View reviewed changes

drcrallen force-pushed the hadoopConvert branch 2 times, most recently from a642675 to 673bc47 Compare June 1, 2015 17:50

fjy reviewed Jun 5, 2015
View reviewed changes

drcrallen force-pushed the hadoopConvert branch from 216c4d8 to d267c16 Compare June 5, 2015 20:30

himanshug reviewed Jun 5, 2015
View reviewed changes

drcrallen force-pushed the hadoopConvert branch 2 times, most recently from 4f626d9 to 2f1a6d5 Compare June 8, 2015 20:21

fjy reviewed Jun 8, 2015
View reviewed changes

drcrallen force-pushed the hadoopConvert branch 2 times, most recently from 01b5128 to 3df16c7 Compare June 8, 2015 22:11

xvrl reviewed Jun 9, 2015
View reviewed changes

Add Hadoop Converter Job and task

056cab9

* Fixes apache#1363 * Add extra utils in JobHelper based on PR feedback

drcrallen force-pushed the hadoopConvert branch from 3df16c7 to 056cab9 Compare June 9, 2015 21:47

cheddar added a commit that referenced this pull request Jun 10, 2015

Merge pull request #1351 from metamx/hadoopConvert

0004ba2

Add Hadoop Converter Job and task

cheddar merged commit 0004ba2 into apache:master Jun 10, 2015

drcrallen deleted the hadoopConvert branch June 10, 2015 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Hadoop Converter Job and task #1351

Add Hadoop Converter Job and task #1351

drcrallen commented May 9, 2015

xvrl May 12, 2015

drcrallen May 12, 2015

drcrallen May 13, 2015

cheddar commented May 12, 2015

drcrallen commented May 12, 2015

drcrallen commented May 13, 2015

xvrl commented May 13, 2015

drcrallen commented May 14, 2015

nishantmonu51 May 14, 2015

drcrallen May 14, 2015

xvrl May 14, 2015

drcrallen May 14, 2015

xvrl May 14, 2015

drcrallen May 14, 2015

drcrallen May 14, 2015

drcrallen May 21, 2015

drcrallen commented May 27, 2015

drcrallen commented May 29, 2015

xvrl May 29, 2015

fjy Jun 5, 2015

drcrallen Jun 5, 2015

himanshug Jun 5, 2015

drcrallen Jun 8, 2015

fjy Jun 8, 2015

fjy Jun 8, 2015

drcrallen Jun 8, 2015

xvrl Jun 9, 2015

drcrallen Jun 9, 2015

drcrallen Jun 9, 2015

xvrl commented Jun 9, 2015

cheddar commented Jun 10, 2015

Add Hadoop Converter Job and task #1351

Add Hadoop Converter Job and task #1351

Conversation

drcrallen commented May 9, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheddar commented May 12, 2015

drcrallen commented May 12, 2015

drcrallen commented May 13, 2015

xvrl commented May 13, 2015

drcrallen commented May 14, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drcrallen commented May 27, 2015

drcrallen commented May 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xvrl commented Jun 9, 2015

cheddar commented Jun 10, 2015