Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] cleaning checkpoint file after a successful training #4754

Merged
merged 2 commits into from
Aug 14, 2019

Conversation

CodingCat
Copy link
Member

No description provided.

@CodingCat
Copy link
Member Author

@trams would you please help to review?

Copy link
Contributor

@trams trams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change. I think generally we should update also docs and future changelog to communicate API change (however minor)

(checkpointPath, checkpointInterval)

val skipCheckpointFile: Boolean = params.get("skip_clean_checkpoint") match {
case None => false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit worried that it changes "API".
Before this change xgboost-spark does not clean it checkpoint folder
After this change it will do the cleaning by default.

What do you think about creating a param
"clean_checkpoint" instead of "skip_clean_checkpoint"
Those who wants can enable cleaning.

One use case when cleaning checkpoint folder may not be a good idea is if we desire to train N trees, optionally validate it and then continue training (may be with different hyperparameters) another M trees.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the previous implementation is buggy, even you wanted N trees, the leftover of checkpoint is the one produced after N-1 iterations,

the reason I want to make cleaning as a default behavior is that I encountered several times that the left over makes my successive training starts with a checkpoint instead of from scratch if I didn't change checkpoint path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Now I understand you motivation

@@ -53,6 +53,12 @@ private[spark] class CheckpointManager(sc: SparkContext, checkpointPath: String)
}
}

def cleanPath(): Unit = {
if (checkpointPath != "") {
FileSystem.get(sc.hadoopConfiguration).delete(new Path(checkpointPath), true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that CheckpontManager owns the folder (i.e. all files in this folder has been created by this or earlier CheckpointManager) so it is safe to remove the whole folder.

This is true for our use case. I am not sure it is actually true for everybody. At least we should update the docs (and 1.0 changelog) to mention this

One way to solve this problem would be to actually reuse cleanUpHigherVersions. We can pass it here to clean all versions. After that we can remove the folder non recursively. That would remove the empty directory if any

@@ -473,6 +475,11 @@ object XGBoost extends Serializable {
tracker.stop()
}
}.last
// we should delete the checkpoint directory after a successful training
if (!skipCleanCheckpoint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Really minor] I am not sure whether xgboost has java|scala coding style. I generally prefer to have cleanCheckpoint flag not skipCleanCheckpoint. That would avoid adding extra "not" which makes reading the code slightly harder
P.S. This is really bike shadding

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as explained above, cleaning checkpoint is the desired behavior. skipCleanCheckpoint is there mainly for testing

regarding the use cases you want to continue training, I think that belongs to another feature that you can start training from an existing model

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on your proposed feature: train starting from an existing model

@codecov-io
Copy link

codecov-io commented Aug 8, 2019

Codecov Report

Merging #4754 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4754   +/-   ##
=======================================
  Coverage   79.59%   79.59%           
=======================================
  Files          11       11           
  Lines        1965     1965           
=======================================
  Hits         1564     1564           
  Misses        401      401

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19f9fd5...82cdd76. Read the comment docs.

@CodingCat CodingCat merged commit 7b5cbcc into dmlc:master Aug 14, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Nov 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants