-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Hadoop Converter Job and task #1351
Conversation
/** | ||
* Most useful for unit testing when `org.apache.derby.jdbc.EmbeddedDriver` is needed | ||
*/ | ||
public class DerbyEmbeddedMetadataStorageDruidModule extends SQLMetadataStorageDruidModule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why we need a new module here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DerbyEmbeddedConnector
uses org.apache.derby.jdbc.EmbeddedDriver
instead of the client driver. Another alternative is to have a setting that allows the configuration of the driver class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I managed to eliminate it and add a new test @Rule
for anything that needs a metadata connector
If I'm understanding this correctly, this looks like a way to run a conversion task over Hadoop MR. That is, logically, it is executing tasks using Hadoop MR (and ultimately YARN) as the task manager instead of using whatever the Indexing Service is using. I'm not against this, but I also kinda wonder if we shouldn't make a generic "run tasks on hadoop" job and then have the conversion task be a part of that? |
@cheddar : Yes, that is a longer goal. Some of the aspects of this PR will be combined with the hadoop task once this has proven stable. A future goal would be to make a better task container that can run on the indexing service, yarn, or mesos. |
I had discussed with @xvrl briefly about what to do regarding the metadata update. I'm in agreement with his point of view that only allowing the task to work as an indexing service task (and NOT as a standalone hadoop job) would greatly simply things overall. @cheddar : is there any objection to simply having this as ONLY an indexing task? (EDIT: indexing task which spawns a hadoop job, meaning it requires the indexing service to run on Hadoop) |
The existing conversion tasks is also not standalone, so I don't see why this one would need to be. It would also greatly simplify this PR to remove all metadata rated stuff, since that's unrelated to the task itself if we remove the standalone option. |
return writtenBytes; | ||
} | ||
|
||
private static void setupClassPath(Job job, JobConf jobConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse code from JobHelper here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not without modification. JobHelper relies exclusively on HadoopDruidIndexerConfig
and not interfaces to something more abstract.
I tried not to touch any existing hadoop codepaths until this is stable because hadoop is incredibly picky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@drcrallen it should be easy to reuse the JobHelper code for this. We can simply change the JobHelper.setupClassPath
method to pass the workingPath as opposed to the entire config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be happy to set it, but tried to touch the existing hadoop stuff as little as possible.
If this PR is generally agreed upon and it proves pretty stable in our tests then I'll be happy to migrate existing hadoop stuff over to this PR's "framework" of doing things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather use the existing code unless there is a good reason to rewrite things. Something little changes can make a big differences and for things that can be re-used I would prefer we take the old code unless there is a good reason to rewrite it, or because it would be difficult to refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just going to refactor it as part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting into its own PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed having some more common stuff
05c88ec
to
c62a1d8
Compare
Travis failed due to #1393 restarting |
@@ -41,6 +41,25 @@ | |||
@JsonProperty("password") | |||
private PasswordProvider passwordProvider; | |||
|
|||
public MetadataStorageConnectorConfig(){ | |||
// NOOP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just call this(null, null, null, ...) to make it clear the new constructor is the main one?
a642675
to
673bc47
Compare
@Override | ||
public void setupJob(JobContext jobContext) throws IOException | ||
{ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we comment these as noop as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
} | ||
} | ||
|
||
public static class DataSegmentSplit extends InputSplit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is more conventional to implement "org.apache.hadoop.io.Writable" as well so that you don't have to create and setup serde separately. I think that will reduce some code(DataSegmentSplitSerializer) and setupSerializers(..) method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
4f626d9
to
2f1a6d5
Compare
@@ -425,4 +440,108 @@ public static Path prependFSIfNullScheme(FileSystem fs, Path path) | |||
} | |||
return path; | |||
} | |||
|
|||
// TODO: Replace this whenever hadoop gets their act together and stops breaking with more recent versions of Guava | |||
public static long unzipNoGuava( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol
+1 on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how much of this method is copypasta from guava?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zero, it is actually because com.metamx.common.CompressionUtils
uses byte source and byte sink in the "useful" methods. Also com.google.common.io.ByteStreams#copy(java.io.InputStream, java.io.OutputStream)
is used quite a bit in there.
01b5128
to
3df16c7
Compare
<dependency> | ||
<groupId>org.apache.derby</groupId> | ||
<artifactId>derby</artifactId> | ||
<version>10.11.1.1</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
derbynet in the parent pom already depends on derby, so we can just use derbynet and remove the version here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was only needed when the derby server rule was present. since it has vanished this can to. Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* Fixes apache#1363 * Add extra utils in JobHelper based on PR feedback
+1 |
Add Hadoop Converter Job and task
Conversations directly with FJ and others who have commented on this PR show that everyone is generally ok with this, so I went ahead and merged. |
The following should merge first:
#1367 (done)
#1366 (done)
#1428 (done)
Then I'll rebase this one.