-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add clean missing data module #22
Add clean missing data module #22
Conversation
@imatiach-msft, |
LGTM, let's remember to add example notebook once this gets into build. |
createMockDataset | ||
} | ||
|
||
override def schemaForDataset: StructType = ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why leave this unimplemented?
@@ -121,6 +121,18 @@ trait HasOutputCol extends Wrappable { | |||
def getOutputCol: String = $(outputCol) | |||
} | |||
|
|||
trait HasInputCols extends Wrappable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great thanks but we should be very careful about versioning Core now that we are public. Is there a way to add a @since
tag or would that be too superfluous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can definitely do this later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, we can do this later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new estimator looks great, I'm concerned about changes to Core at this point even though they look useful. Do you see any concerns with adding new APIs? Could you please discuss with Eli/AK? Thanks!
discussed, we can add the "since" tag later, we should add it to all APIs, similar to Spark |
ee412f5
to
38abb47
Compare
Updated PR - added support for string, boolean types for custom value and added test cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
38abb47
to
44a495e
Compare
[ Pass! — The build has succeeded. MMLSpark 0.5.dev14+1.g44a495e
Copy the link to this HDInsight Script Action to setup this build on a cluster. This is a build for github PR #22Associated Changes
|
Add clean missing data module with modes:
1.) Replace with mean - replaces missings with mean of fit column
2.) Replace with median - replaces missings with approximate median of fit column
3.) Replace with custom value - replaces missings with custom value specified by user
As well as several smoke tests
Also added HasInputCols and HasOutputCols as traits for multiple columns (which will be needed when imputation will be added).
Resolved issue: Add support for missing value cleaner #11