Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree refactor stage I #3983

Closed
wants to merge 11 commits into from
Closed

Conversation

RAMitchell
Copy link
Member

  • Renaming classes
  • Combine TreeModel/RegTree class
  • Added serialisation test
  • Remove redundant functions
  • Remove redundant member variables
  • Move definitions to .cc file

One thing I noticed while working on this, MetaInfo.root_index_ seems to be completely redundant.

@trivialfis
Copy link
Member

I'm starting to think about if back porting a bug fix release is needed. The next release is gonna be heavy, which might not be stable enough. And we had some important bug fixes after the last release.

@RAMitchell
Copy link
Member Author

Yes it will be risky if we do this and the serialisation changes at once.

@tqchen
Copy link
Member

tqchen commented Dec 10, 2018

Can you elaborate a bit on why some of the changes. I can see that most of the changes has things to do with the generic, function renaming and remove the root index.

The current proposed changes contains two sets:

  • the code that renamed the functions
  • the code that removes certain features(root index, recycle tree node)

The root index was originally used for building multiple trees when each of them has a given group, each in a group can have unique tree. Given that it was not a strict generalization in the code, I feel we could keep it as it is.

The leaf vector was useful for implementing vector tree. Given how much we need to go through that path, it is important to think how can we deal with this

// allocate a new node,
// !!!!!! NOTE: may cause BUG here, nodes.resize
inline int AllocNode() {
if (param.num_deleted != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the logic for recycling pruned nodes and need to be kept here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put an exit(-1) statement inside this and ran all of the Python tests without any issues. Unless I'm missing something, these recycled nodes never get used again. The pruner deletes nodes but there is nothing that adds nodes after pruning has occurred.

The reason for removing this is to reduce the number of member variables and statefullness of the tree structure. This will make it more flexible in future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only be triggered if certain nodes are pruned and alloc reuse them in the future tree builds. I think it is still an useful feature to have here.

@tqchen
Copy link
Member

tqchen commented Dec 10, 2018

Given that this is a major data structure change, can you please send an RFC discussion on this, thanks!

@RAMitchell
Copy link
Member Author

@tqchen thanks your comments!

The root index was originally used for building multiple trees when each of them has a given group, each in a group can have unique tree. Given that it was not a strict generalization in the code, I feel we could keep it as it is.

This is not used anywhere in the code so it causes some confusion for me (and I assume others) as to how it should be used or what the future use might be. I also suspect if we try to begin using it a large amount of code wouldn't work because it is built on the assumption of a tree having a single root. I prefer not to have code lying around without clear function or use cases, it makes it very difficult to understand and reorganise code.

The leaf vector was useful for implementing vector tree. Given how much we need to go through that path, it is important to think how can we deal with this

Again, this is not used anywhere in the code base. I have no idea what its purpose is, can we implement this according to our needs in future instead of keeping it redundantly?

Given that this is a major data structure change, can you please send an RFC discussion on this, thanks!

There is no change in end user functionality or data structure here (the internal node vector and stats vector remains unchanged) so an RFC seems like a lot of work. Serialisation remains the same. I have only renamed and removed code that was not being used.

@RAMitchell
Copy link
Member Author

See these two references identifying the above problems as code smells:
https://refactoring.guru/smells/speculative-generality
https://refactoring.guru/smells/dead-code

// allocate a new node,
// !!!!!! NOTE: may cause BUG here, nodes.resize
inline int AllocNode() {
if (param.num_deleted != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only be triggered if certain nodes are pruned and alloc reuse them in the future tree builds. I think it is still an useful feature to have here.

@tqchen
Copy link
Member

tqchen commented Dec 11, 2018

I am fine removing the leaf_vectors, but let us keep the root index as they will not hurt the flow as much, and a generally growing n-th layer of the tree already did that.

@khotilov
Copy link
Member

While I feel a bit attached to having the vector leaves available (I tried some hacking attempts at vector trees a while ago), I've also realized that there would be quite more work needed in order to properly utilize them. So it would make sense to not lug them around while not in use.

But as for the renaming the RegTree and switching from the [nid] overloading to a getter function, I really don't see any tangible benefit to these changes which would justify touching the hundreds of lines of code.

@trivialfis
Copy link
Member

@RAMitchell When it's done, please take some notes in #3986 , these things started to become a headache.

@RAMitchell
Copy link
Member Author

@trivialfis there will be no modification of user facing parameters in this PR. The parameters that are changed and are only used internally and are not documented.

@khotilov These changes update the code with google code style (descriptive unabbreviated naming). I do agree that their impact is small but renaming is also fairly low risk.

There is probably far too much going on in this PR so I will consider separating out the key elements into several PRs. It has stimulated a nice discussion anyway :)

@RAMitchell RAMitchell closed this Dec 11, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Mar 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants