-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: Decision Tree Visualization #386
FIX: Decision Tree Visualization #386
Conversation
Return empty if node is a constructor.
Swap LogisticRegression or RandomForestRegressor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
skops/io/tests/test_visualize.py
Outdated
("clf", LogisticRegression(random_state=0, solver="liblinear")), | ||
("clf", RandomForestRegressor(random_state=0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could test both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. One challenge here is that I don't think it'll be straightforward without a minor refactor of the pipeline fixture. The simplest approaches seem to be to parameterize the fixture or to use a GridSearchCV. Any preferences or suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also simply have a separate test only for RandomForestRegressor
and RandomForestClsasifier
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added separate test for DecisionTreeClassifier
and DecisionTreeRegressor
.
skops/io/_visualize.py
Outdated
if node_name == "constructor": | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems a bit shady, it might skip things that the user should see. Can you show here an example of what's inside this node in our case?
One would also need to update the node_name
docstring if we're skipping here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very sympathetic to the concerns. Suspect you all know better here, so would defer to your recommendations.
As far as I am aware, sklearn tree-based models are the only ones that produce a "constructor" object of this type (presumably as an artifact of how skops handles C Extension types).
Taking the example:
from sklearn.tree import DecisionTreeClassifier
from skops.io import dumps, loads, visualize
dumped = dumps(DecisionTreeClassifier().fit([[0, 1], [2, 3], [4, 5]], [0, 1, 2]))
loaded = loads(dumped)
visualize(dumped)
The skipped node is <class 'sklearn.tree._tree.Tree'>
. Skipping processing of this object, the code outputs:
root: sklearn.tree._classes.DecisionTreeClassifier
└──
attrs: builtins.dict├──
criterion: json-type("gini")├──
splitter: json-type("best")├──
max_depth: json-type(null)├──
min_samples_split: json-type(2)├──
min_samples_leaf: json-type(1)├──
min_weight_fraction_leaf: json-type(0.0)├──
max_features: json-type(null)├──
max_leaf_nodes: json-type(null)├──
random_state: json-type(null)├──
min_impurity_decrease: json-type(0.0)├──
class_weight: json-type(null)├──
ccp_alpha: json-type(0.0)├──
n_features_in_: json-type(2)├──
n_outputs_: json-type(1)├──
classes_: numpy.ndarray├──
n_classes_: numpy.int64├──
max_features_: json-type(2)├──
tree_: sklearn.tree._tree.Tree
│
├──
attrs: builtins.dict
│ │
├──
max_depth: json-type(2)
│ │
├──
node_count: json-type(5)
│ │
├──
nodes: numpy.ndarray
│ │
└──
values: numpy.ndarray
│
├──
args: builtins.tuple
│ │
├──
content: json-type(2)
│ │
├──
content: numpy.ndarray
│ │
└──
content: json-type(1)└──
_sklearn_version: json-type("1.3.0")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think the best solution here would be to actually handle type
, as in, in the visualized tree print what the type is, rather than skipping it. I'm sure there would be other cases where we'd encounter types as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave this a first pass. So if isinstance(node, type)
, then print information about the type rather than skipping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs a changelog.
skops/io/_visualize.py
Outdated
is_self_safe=False, | ||
is_safe=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we should defer to the information in the Node instead of assuming it to be unsafe all the time. the Tree
constructor for instance, is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I agree. However the challenge I'm running into is that it's not a valid Node object, so it does not have the necessary attributes:
AttributeError: type object 'sklearn.tree._tree.Tree' has no attribute 'is_self_safe'
If I understand correctly, it's because the constructor class is directly set as a child node here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so probably the parent when doing the walk_tree
should send this info to the child.
skops/io/_visualize.py
Outdated
if "is_self_safe" in kwargs: | ||
is_self_safe = kwargs["is_self_safe"] | ||
if "is_safe" in kwargs: | ||
is_safe = kwargs["is_safe"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd make sense to pass these as actual args rather than **kwargs
, and they'd need to be added to the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite nice, thank you @reidjohnson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the extension of the walk_tree
API to include is_self_safe
and is_safe
made me skeptical, as it looks like we have a leaky abstraction here. I tried to understand why that is necessary and I think I found a deeper underlying issue.
I think the correct solution right now is that we should not visualize the constructor. It is not an attribute of the object, just a utility class we add to construct it, so its presence is not really needed. Therefore, I think we should actually just skip the node info on the constructor.
This should be perfectly safe with the current code, since we only have two examples where we should have type(node) is type
, namely TreeNode
and SGDNode
, where we ensure that the constructor is indeed safe. However, this may of course change in the future. Therefore, we could do a check that the type is either a Tree
or in ALLOWED_SGD_LOSSES
.
If we make this change, then we can roll back the changes in the signature of walk_tree
and avoid the leaky abstraction.
As a more long term solution, I think we might want to consider something like this: the children
of a ReduceNode
should not directly contain the constructor or any type
. Instead, we should wrap the constructor into a TypeNode
, which is capable of determining if the constructor is safe or not.
If we make that change, the walk_tree
function would, if it encounters a ReduceNode
, just rely on ReduceNode.is_safe
, which under the hood calls is_safe
on the TypeNode
.
WDYT @adrinjalali?
@BenjaminBossan Thanks for the comments. Updated with wrapping constructor into a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much nicer, but we should handle this change in protocol the same way it's done via the files in the io.old
folder.
skops/io/_sklearn.py
Outdated
"constructor": TypeNode( | ||
{ | ||
"__class__": constructor.__name__, | ||
"__module__": constructor.__module__, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__module__
doesn't always exist, we should use get_module
instead.
The failing test seems to be that in the latest sklearn version, they added a new parameter:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, this LGTM.
I think the codecov complaints are false positives, so if @adrinjalali approves of this final version, we can merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the PR @reidjohnson , and thank you for being patient with our reviews.
Thank you for the reviews, have been happy to contribute. |
Fixes issue #385.
The root cause of the bug appears to be that the base sklearn tree constructor object is not wrapped as a valid skops
Node
type, but is still included in the visualized object tree.This PR fixes the issue by appropriately processing this node.