[jvm-packages] cleaning checkpoint file after a successful training #4754

CodingCat · 2019-08-08T19:09:12Z

No description provided.

CodingCat · 2019-08-08T19:09:36Z

@trams would you please help to review?

trams

This is a good change. I think generally we should update also docs and future changelog to communicate API change (however minor)

...ackages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/CheckpointManager.scala

trams · 2019-08-08T19:30:03Z

...ackages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/CheckpointManager.scala

-    (checkpointPath, checkpointInterval)
+
+    val skipCheckpointFile: Boolean = params.get("skip_clean_checkpoint") match {
+      case None => false


I am a bit worried that it changes "API".
Before this change xgboost-spark does not clean it checkpoint folder
After this change it will do the cleaning by default.

What do you think about creating a param
"clean_checkpoint" instead of "skip_clean_checkpoint"
Those who wants can enable cleaning.

One use case when cleaning checkpoint folder may not be a good idea is if we desire to train N trees, optionally validate it and then continue training (may be with different hyperparameters) another M trees.

actually the previous implementation is buggy, even you wanted N trees, the leftover of checkpoint is the one produced after N-1 iterations,

the reason I want to make cleaning as a default behavior is that I encountered several times that the left over makes my successive training starts with a checkpoint instead of from scratch if I didn't change checkpoint path

👍 Now I understand you motivation

trams · 2019-08-08T19:33:37Z

...ackages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/CheckpointManager.scala

@@ -53,6 +53,12 @@ private[spark] class CheckpointManager(sc: SparkContext, checkpointPath: String)
    }
  }

+  def cleanPath(): Unit = {
+    if (checkpointPath != "") {
+      FileSystem.get(sc.hadoopConfiguration).delete(new Path(checkpointPath), true)


This assumes that CheckpontManager owns the folder (i.e. all files in this folder has been created by this or earlier CheckpointManager) so it is safe to remove the whole folder.

This is true for our use case. I am not sure it is actually true for everybody. At least we should update the docs (and 1.0 changelog) to mention this

One way to solve this problem would be to actually reuse cleanUpHigherVersions. We can pass it here to clean all versions. After that we can remove the folder non recursively. That would remove the empty directory if any

trams · 2019-08-08T19:38:32Z

jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala

@@ -473,6 +475,11 @@ object XGBoost extends Serializable {
            tracker.stop()
          }
      }.last
+      // we should delete the checkpoint directory after a successful training
+      if (!skipCleanCheckpoint) {


[Really minor] I am not sure whether xgboost has java|scala coding style. I generally prefer to have cleanCheckpoint flag not skipCleanCheckpoint. That would avoid adding extra "not" which makes reading the code slightly harder
P.S. This is really bike shadding

as explained above, cleaning checkpoint is the desired behavior. skipCleanCheckpoint is there mainly for testing

regarding the use cases you want to continue training, I think that belongs to another feature that you can start training from an existing model

Agree on your proposed feature: train starting from an existing model

...es/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/CheckpointManagerSuite.scala

codecov-io · 2019-08-08T20:26:01Z

Codecov Report

Merging #4754 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #4754   +/-   ##
=======================================
  Coverage   79.59%   79.59%           
=======================================
  Files          11       11           
  Lines        1965     1965           
=======================================
  Hits         1564     1564           
  Misses        401      401

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19f9fd5...82cdd76. Read the comment docs.

cleaning checkpoint file after a successful file

15f4d45

trams reviewed Aug 8, 2019

View reviewed changes

address comments

82cdd76

trams approved these changes Aug 14, 2019

View reviewed changes

CodingCat merged commit 7b5cbcc into dmlc:master Aug 14, 2019

lock bot locked as resolved and limited conversation to collaborators Nov 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] cleaning checkpoint file after a successful training #4754

[jvm-packages] cleaning checkpoint file after a successful training #4754

CodingCat commented Aug 8, 2019

CodingCat commented Aug 8, 2019

trams left a comment

trams Aug 8, 2019

CodingCat Aug 8, 2019

trams Aug 14, 2019

trams Aug 8, 2019

trams Aug 8, 2019

CodingCat Aug 8, 2019

trams Aug 14, 2019

codecov-io commented Aug 8, 2019 •

edited

Loading

[jvm-packages] cleaning checkpoint file after a successful training #4754

[jvm-packages] cleaning checkpoint file after a successful training #4754

Conversation

CodingCat commented Aug 8, 2019

CodingCat commented Aug 8, 2019

trams left a comment

Choose a reason for hiding this comment

trams Aug 8, 2019

Choose a reason for hiding this comment

CodingCat Aug 8, 2019

Choose a reason for hiding this comment

trams Aug 14, 2019

Choose a reason for hiding this comment

trams Aug 8, 2019

Choose a reason for hiding this comment

trams Aug 8, 2019

Choose a reason for hiding this comment

CodingCat Aug 8, 2019

Choose a reason for hiding this comment

trams Aug 14, 2019

Choose a reason for hiding this comment

codecov-io commented Aug 8, 2019 • edited Loading

Codecov Report

codecov-io commented Aug 8, 2019 •

edited

Loading