Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Hadoop Converter Job and task #1351

Merged
merged 1 commit into from
Jun 10, 2015
Merged

Conversation

drcrallen
Copy link
Contributor

  • Allows submitting conversion tasks through Hadoop
  • Allows submitting conversion tasks through the indexing service for Hadoop
  • Adds unit tests for the hadoop conversion task.
  • Small changes to SQLMetadataSegmentManager to facilitate better unit tests
  • Fixes Convert Segment Task does not close properly on error #1363

The following should merge first:
#1367 (done)
#1366 (done)
#1428 (done)

Then I'll rebase this one.

@drcrallen drcrallen changed the title Add Hadoop Converter Job and task Add Hadoop Converter Job and task (DISCUSSION / INITIAL REVIEW) May 9, 2015
@xvrl xvrl added the Discuss label May 11, 2015
/**
* Most useful for unit testing when `org.apache.derby.jdbc.EmbeddedDriver` is needed
*/
public class DerbyEmbeddedMetadataStorageDruidModule extends SQLMetadataStorageDruidModule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why we need a new module here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DerbyEmbeddedConnector uses org.apache.derby.jdbc.EmbeddedDriver instead of the client driver. Another alternative is to have a setting that allows the configuration of the driver class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to eliminate it and add a new test @Rule for anything that needs a metadata connector

@cheddar
Copy link
Contributor

cheddar commented May 12, 2015

If I'm understanding this correctly, this looks like a way to run a conversion task over Hadoop MR. That is, logically, it is executing tasks using Hadoop MR (and ultimately YARN) as the task manager instead of using whatever the Indexing Service is using.

I'm not against this, but I also kinda wonder if we shouldn't make a generic "run tasks on hadoop" job and then have the conversion task be a part of that?

@drcrallen
Copy link
Contributor Author

@cheddar : Yes, that is a longer goal. Some of the aspects of this PR will be combined with the hadoop task once this has proven stable. A future goal would be to make a better task container that can run on the indexing service, yarn, or mesos.

@drcrallen
Copy link
Contributor Author

I had discussed with @xvrl briefly about what to do regarding the metadata update.

I'm in agreement with his point of view that only allowing the task to work as an indexing service task (and NOT as a standalone hadoop job) would greatly simply things overall.

@cheddar : is there any objection to simply having this as ONLY an indexing task? (EDIT: indexing task which spawns a hadoop job, meaning it requires the indexing service to run on Hadoop)

@drcrallen drcrallen changed the title Add Hadoop Converter Job and task (DISCUSSION / INITIAL REVIEW) Add Hadoop Converter Job and task May 13, 2015
@xvrl
Copy link
Member

xvrl commented May 13, 2015

The existing conversion tasks is also not standalone, so I don't see why this one would need to be.

It would also greatly simplify this PR to remove all metadata rated stuff, since that's unrelated to the task itself if we remove the standalone option.

@drcrallen
Copy link
Contributor Author

@cheddar / @xvrl : I removed the metadata updating in the job and left it as ONLY an indexing task.

I also fixed the last known bug (#1363) that I have encountered in local tests.

return writtenBytes;
}

private static void setupClassPath(Job job, JobConf jobConf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse code from JobHelper here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not without modification. JobHelper relies exclusively on HadoopDruidIndexerConfig and not interfaces to something more abstract.

I tried not to touch any existing hadoop codepaths until this is stable because hadoop is incredibly picky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drcrallen it should be easy to reuse the JobHelper code for this. We can simply change the JobHelper.setupClassPath method to pass the workingPath as opposed to the entire config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be happy to set it, but tried to touch the existing hadoop stuff as little as possible.

If this PR is generally agreed upon and it proves pretty stable in our tests then I'll be happy to migrate existing hadoop stuff over to this PR's "framework" of doing things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather use the existing code unless there is a good reason to rewrite things. Something little changes can make a big differences and for things that can be re-used I would prefer we take the old code unless there is a good reason to rewrite it, or because it would be difficult to refactor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just going to refactor it as part of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting into its own PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed having some more common stuff

@drcrallen drcrallen force-pushed the hadoopConvert branch 3 times, most recently from 05c88ec to c62a1d8 Compare May 27, 2015 13:05
@drcrallen
Copy link
Contributor Author

Travis failed due to #1393 restarting

@drcrallen drcrallen closed this May 27, 2015
@drcrallen drcrallen reopened this May 27, 2015
@drcrallen drcrallen closed this May 27, 2015
@drcrallen drcrallen reopened this May 27, 2015
@drcrallen
Copy link
Contributor Author

@cheddar / @xvrl This has been sitting here for 2 days and is blocking stuff on my side. Please either comment or merge.

@@ -41,6 +41,25 @@
@JsonProperty("password")
private PasswordProvider passwordProvider;

public MetadataStorageConnectorConfig(){
// NOOP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just call this(null, null, null, ...) to make it clear the new constructor is the main one?

@drcrallen drcrallen force-pushed the hadoopConvert branch 2 times, most recently from a642675 to 673bc47 Compare June 1, 2015 17:50
@Override
public void setupJob(JobContext jobContext) throws IOException
{

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we comment these as noop as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
}

public static class DataSegmentSplit extends InputSplit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is more conventional to implement "org.apache.hadoop.io.Writable" as well so that you don't have to create and setup serde separately. I think that will reduce some code(DataSegmentSplitSerializer) and setupSerializers(..) method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@drcrallen drcrallen force-pushed the hadoopConvert branch 2 times, most recently from 4f626d9 to 2f1a6d5 Compare June 8, 2015 20:21
@@ -425,4 +440,108 @@ public static Path prependFSIfNullScheme(FileSystem fs, Path path)
}
return path;
}

// TODO: Replace this whenever hadoop gets their act together and stops breaking with more recent versions of Guava
public static long unzipNoGuava(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol
+1 on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much of this method is copypasta from guava?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero, it is actually because com.metamx.common.CompressionUtils uses byte source and byte sink in the "useful" methods. Also com.google.common.io.ByteStreams#copy(java.io.InputStream, java.io.OutputStream) is used quite a bit in there.

@drcrallen drcrallen force-pushed the hadoopConvert branch 2 times, most recently from 01b5128 to 3df16c7 Compare June 8, 2015 22:11
<dependency>
<groupId>org.apache.derby</groupId>
<artifactId>derby</artifactId>
<version>10.11.1.1</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

derbynet in the parent pom already depends on derby, so we can just use derbynet and remove the version here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was only needed when the derby server rule was present. since it has vanished this can to. Will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* Fixes apache#1363
* Add extra utils in JobHelper based on PR feedback
@xvrl
Copy link
Member

xvrl commented Jun 9, 2015

+1

cheddar added a commit that referenced this pull request Jun 10, 2015
Add Hadoop Converter Job and task
@cheddar cheddar merged commit 0004ba2 into apache:master Jun 10, 2015
@drcrallen drcrallen deleted the hadoopConvert branch June 10, 2015 18:07
@cheddar
Copy link
Contributor

cheddar commented Jun 10, 2015

Conversations directly with FJ and others who have commented on this PR show that everyone is generally ok with this, so I went ahead and merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert Segment Task does not close properly on error
6 participants