Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML][Inference] adding tree model #47044

Merged

Conversation

benwtrent
Copy link
Member

This adds the base tree model that can be used for our future ensemble model.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@benwtrent
Copy link
Member Author

@valeriy42 Let me know what you think.

Additionally, I cannot find a use for BOTH split_feature and split_index. It seems to me we just need ONE of those because they will always point to the same thing, no?

Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if (treeNode.getRightChild() != null) {
toVisit.add(treeNode.getRightChild());
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could add an easy check for disconnected nodes by checking visited.size() == nodes.size()

@valeriy42
Copy link
Contributor

Additionally, I cannot find a use for BOTH split_feature and split_index. It seems to me we just need ONE of those because they will always point to the same thing, no?

The difference is split_index is an index of the tree node, which is references e.g. by left_child or right_child, while split_feature refers to the index of the feature (after pre-processing) wrt which we are splitting.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. There is a mix-up between "nodeIndex" and "splitIndex". Also "model" is called "evaluation" in JSON schema. We need to synchronize the definitions.

* @param decisionThreshold The decision threshold
* @return The created node
*/
public TreeNode.Builder addJunction(int nodeIndex, int featureIndex, boolean isDefaultLeft, double decisionThreshold) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the nodeIndex variable here is what is called split_index in JSON. We should homogenize the names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think node_index is a more friendly name. Honestly, the value will probably be passed down from the tree as the node array order needs to be guaranteed for serialization to work well.

new NamedWriteableRegistry.Entry(PreProcessor.class, FrequencyEncoding.NAME.getPreferredName(), FrequencyEncoding::new),
new NamedWriteableRegistry.Entry(PreProcessor.class, OneHotEncoding.NAME.getPreferredName(), OneHotEncoding::new),
new NamedWriteableRegistry.Entry(PreProcessor.class, TargetMeanEncoding.NAME.getPreferredName(), TargetMeanEncoding::new),
// ML - Inference models
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to avoid using "models" since there is a long tradition of overloading this term. In JSON schema the section is called evaluation. I am not particularly invested in the term, but we should stick to the same terminology everywhere.

This also may mean that we rename "evaluation" to "model" in the JSON schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like using model as it is a unified name with the ensemble and its nested models object.

It seems logical to me that a model that is an ensemble will have many models. If we choose to use evaluation I think ensemble should have an evaluations field.

@benwtrent benwtrent merged commit 85f1272 into elastic:master Sep 25, 2019
@benwtrent benwtrent deleted the feature/ml-inference-model-parsing branch September 25, 2019 20:02
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Sep 25, 2019
* [ML][Inference] adding tree model

* renaming features for updated schema
benwtrent added a commit that referenced this pull request Sep 25, 2019
* [ML][Inference] adding tree model (#47044)

* [ML][Inference] adding tree model

* renaming features for updated schema

* fixing 7.x compilation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants