-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] cleaning checkpoint file after a successful training #4754
Conversation
@trams would you please help to review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good change. I think generally we should update also docs and future changelog to communicate API change (however minor)
...ackages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/CheckpointManager.scala
Outdated
Show resolved
Hide resolved
(checkpointPath, checkpointInterval) | ||
|
||
val skipCheckpointFile: Boolean = params.get("skip_clean_checkpoint") match { | ||
case None => false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit worried that it changes "API".
Before this change xgboost-spark does not clean it checkpoint folder
After this change it will do the cleaning by default.
What do you think about creating a param
"clean_checkpoint" instead of "skip_clean_checkpoint"
Those who wants can enable cleaning.
One use case when cleaning checkpoint folder may not be a good idea is if we desire to train N trees, optionally validate it and then continue training (may be with different hyperparameters) another M trees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually the previous implementation is buggy, even you wanted N trees, the leftover of checkpoint is the one produced after N-1 iterations,
the reason I want to make cleaning as a default behavior is that I encountered several times that the left over makes my successive training starts with a checkpoint instead of from scratch if I didn't change checkpoint path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Now I understand you motivation
@@ -53,6 +53,12 @@ private[spark] class CheckpointManager(sc: SparkContext, checkpointPath: String) | |||
} | |||
} | |||
|
|||
def cleanPath(): Unit = { | |||
if (checkpointPath != "") { | |||
FileSystem.get(sc.hadoopConfiguration).delete(new Path(checkpointPath), true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that CheckpontManager owns the folder (i.e. all files in this folder has been created by this or earlier CheckpointManager) so it is safe to remove the whole folder.
This is true for our use case. I am not sure it is actually true for everybody. At least we should update the docs (and 1.0 changelog) to mention this
One way to solve this problem would be to actually reuse cleanUpHigherVersions
. We can pass it here to clean all versions. After that we can remove the folder non recursively. That would remove the empty directory if any
@@ -473,6 +475,11 @@ object XGBoost extends Serializable { | |||
tracker.stop() | |||
} | |||
}.last | |||
// we should delete the checkpoint directory after a successful training | |||
if (!skipCleanCheckpoint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Really minor] I am not sure whether xgboost has java|scala coding style. I generally prefer to have cleanCheckpoint
flag not skipCleanCheckpoint
. That would avoid adding extra "not" which makes reading the code slightly harder
P.S. This is really bike shadding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as explained above, cleaning checkpoint is the desired behavior. skipCleanCheckpoint is there mainly for testing
regarding the use cases you want to continue training, I think that belongs to another feature that you can start training from an existing model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on your proposed feature: train starting from an existing model
...es/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/CheckpointManagerSuite.scala
Outdated
Show resolved
Hide resolved
...es/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/CheckpointManagerSuite.scala
Outdated
Show resolved
Hide resolved
...es/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/CheckpointManagerSuite.scala
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #4754 +/- ##
=======================================
Coverage 79.59% 79.59%
=======================================
Files 11 11
Lines 1965 1965
=======================================
Hits 1564 1564
Misses 401 401 Continue to review full report at Codecov.
|
No description provided.