[SPARK-49883][SS] State Store Checkpoint Structure V2 Integration with RocksDB and RocksDBFileManager #48355

WweiL · 2024-10-04T20:18:41Z

What changes were proposed in this pull request?

This PR enables RocksDB to read <zip, changelog> file watermarked with unique ids (e.g. version_uniqueId.zip, version_uniqueId.changelog). Below is a explanation on the changes and rationale.

Now for each changelog file, we put a "version: uniqueId" to its first line, from it's current version to the previous snapshot version. For each snapshot (zip) file, there is no change other than their name (version_uniqueId.zip), because snapshot files are single source of truth.

In addition to LastCommitBasedCheckpointId, lastCommittedCheckpointId, loadedCheckpointId added in #47895 (review), also add a in-memory instance lineagfeManager that contains mapping information version to unique Id. This is useful when you reuse the same rocksDB instance in the executor, so you don't need to load the lineage from the changelog file again.

RocksDB:

Load

When loadedVersion != version, try to load a changelog file with version_checkpointUniqueId.changelog. There could be multiple cases:
1. Version corresponds to a zip file:
  1. version_uniqueId.zip, this means (either changelog was not enabled, or it was enabled but version happens to be a checkpoint file), and previously query run ckpt v2
  2. version.zip, this means (either changelog was not enabled, or it was enabled but version happens to be a checkpoint file), and previously query run ckpt v1
2. Version corresponds to a changelog file:
  1. version_uniqueId.changelog, this means changelog was enabled, and previously query run ckpt v2
  2. version.changelog, this means changelog was enabled, and previously query run ckpt v1
For case i.a, we construct a new empty lineage (version, sessionCheckpointId). version_uniqueId.changelog.
For case ii.a, we read the lineage file stored in
For case i.b and ii.b, there is no need to load the lineage as they were not presented before, we just load the corresponding file without uniqueId, but newer files will be constructed with uniqueId.

checkpoint version v1 to checkpoint version v2.

Next the code finds all snapshot version through file listing. When there are multiple snapshot files with the same version but different unique Id (main problem this project was trying to solve), the correct one will be loaded based on the checkpoint id.

Then changelog is replayed with the awareness of lineage. The lineage is stored in memory for next load().

Last, load the changelog writer for version + 1, and write the lineage (version + 1, sessionCheckpointId) to the first line of the file. While it seems that the lineage is written early, it is safe because the change log writer is not committed yet.

When loadedVersion == version, the same rocks db instance is reused and the lineage is stored in memory to versionToUniqueIdLineage.

Commit

Also save sessionCheckpointId to latestSnapshot
Add (newVersion, sessionCheckpointId) to versionToUniqueIdLineage

Abort

Also clear up versionToUniqueIdLineage

RocksDBFileManager:

A bunch of add-ups to make until code uniqueId aware. Now all places that return version returns a <version, Option[uniqueId]> pair, in v1 format, the option is None.

deleteOldVersions:

If there are multiple version_uniqueId1.zip(changelog) and versioion.uniqueId2.zip, all are deleted.

Changelog Reader / Writer

We purpose to save the lineage to the first line of the changelog files.

For changelog reader, there is an abstract function readLineage created. In RocksDBCheckpointManager.getChangelogReader function, the readLineage will be called right after the initialization of the changelog reader to update the file pointer to after the lineage. Subsequent getNext function won't be affecter because of this.

For changelog writer, there is an abstract function writeLineage that writes the lineage. This function will be called before any actual changelog data is written in RocksDB.load().

Why are the changes needed?

Improve fault tolerance to RocksDB State Store.

Does this PR introduce any user-facing change?

Not yet, after the V2 format project is finished, customer can use the new config to enable it with better rocksDB state store fault tolerance

How was this patch tested?

Modified existing unit tests. For unit tests and backward compatibility tests please refer to: #48356

Was this patch authored or co-authored using generative AI tooling?

No

WweiL · 2024-10-04T20:50:09Z

This PR depends on #47932 and #47895

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

…27c82562f9a52f4672a5460826

siying · 2024-11-12T02:58:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+  /**
+ * In the case of checkpointFormatVersion 2, if we find multiple latest snapshot files of
+ * the same version but they have different uniqueIds, we need to find the correct one based on
+ * the lineage we have.


Not sure whether the comment is wrong or the implementation is wrong, but no matter whether we find multiple latest snapshot files, we need to use the correct one. If it doesn't exist, we should not use it.

This comment means the same as you said, can you tell me where is the confusion?

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

common/utils/src/main/resources/error/error-conditions.json

siying · 2024-12-04T18:29:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

@@ -965,7 +966,8 @@ class MicroBatchExecution(
      updateStateStoreCkptId(execCtx, latestExecPlan)
    }
    execCtx.reportTimeTaken("commitOffsets") {
-      if (!commitLog.add(execCtx.batchId, CommitMetadata(watermarkTracker.currentWatermark))) {
+      if (!commitLog.add(execCtx.batchId,
+        CommitMetadata(watermarkTracker.currentWatermark, currentStateStoreCkptId.toMap))) {


I don't get your comment. When customMetrics is empty, it means we have no metric reported. But here in the commit log, it will be better if we can distinguish there is no shuffle partition and there are shuffle partitions but we use V1 so there is no checkpointID. I provides another information to sanity check whether something goes wrong or not.
Commit log is an on disk format, so it is hard to change. We should hold a higher standard for that, because we cannot easily refactored.

siying · 2024-12-04T18:53:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+   * the latest snapshot (version, uniqueId) pair from file listing, this function finds
+   * the ground truth latest snapshot (version, uniqueId) the db instance needs to load.
+   *
+   * @param lineage the ground truth lineage loaded from changelog file


Is lineage required to be sorted by descending order of version? If it is, please add the comment for a requirement. I actually think may be it's a good idea not to make the assumption and just sort it here to guarantee we get the match of the largest version.

Lineage is updated when:

in commit, lineageManager.appendLineageItem(LineageItem(newVersion, sessionStateStoreCkptId.get)), where newVersion is loadedVersion + 1

in load, append current (version, id) to the lineage loaded from the changelog.

So logically it is sorted, but I could add a sort per your request

After the refactor here

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

Line 354 in 97d31af

* @param lineage the ground truth lineage loaded from changelog file

lineage being sorted is not an enforced requirement because there is a sort in the function itself.

I still sorted the array here:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

Line 344 in abe6056

currVersionLineage = currVersionLineage.sortBy(_.version)

Because I believe this is the only place a sort is needed. By searching through where appendLineage and resetLineage is used.

siying · 2024-12-04T19:48:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

@@ -965,7 +966,8 @@ class MicroBatchExecution(
      updateStateStoreCkptId(execCtx, latestExecPlan)
    }
    execCtx.reportTimeTaken("commitOffsets") {
-      if (!commitLog.add(execCtx.batchId, CommitMetadata(watermarkTracker.currentWatermark))) {
+      if (!commitLog.add(execCtx.batchId,
+        CommitMetadata(watermarkTracker.currentWatermark, currentStateStoreCkptId.toMap))) {


@WweiL [UPDATE: I was wrong. Ignore it.] the added the test is great. This is not what I meant in the comment though. I didn't mean downgrading to V1 in the same release, but downgrade to an older release where the code of CommitMetadata doesn't have the extra column added here.

…rmation ### What changes were proposed in this pull request? Break down #48355 into smaller PRs. ## Changelog Reader / Writer We purpose to save the lineage to the first line of the changelog files. For changelog reader, there is an abstract function `readLineage` created. In `RocksDBCheckpointManager.getChangelogReader` function, the `readLineage` will be called right after the initialization of the changelog reader to update the file pointer to after the lineage. Subsequent `getNext` function won't be affecter because of this. For changelog writer, there is an abstract function `writeLineage` that writes the lineage. This function will be called before any actual changelog data is written in `RocksDB.load()`. The lineage is stored as json. ### Why are the changes needed? Continue development of SPARK-49374 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #48880 from WweiL/changelog. Authored-by: WweiL <z920631580@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

WweiL · 2024-12-05T14:13:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+            case None =>
+              logWarning("Cannot find latest snapshot based on lineage: "
+                + printLineageItems(currVersionLineage))
+              (0L, None)


@siying
After the refactor requested, the error "cannot_load_base_snapshot" is removed

def cannotFindBaseSnapshotCheckpoint(lineage: String): Throwable = { new SparkException ( errorClass = "CANNOT_LOAD_STATE_STORE.CANNOT_FIND_BASE_SNAPSHOT_CHECKPOINT", messageParameters = Map("lineage" -> lineage), cause = null) }

When there is no valid value returned from getLatestSnapshotVersionAndUniqueIdFromLineage. We use (0L, None). This is for the following scenario:

load(0, None), load(1, id1), load(2, id2), load(3, id3)

commit -> now only 4_id4.zip file presents

maintenance -> 4_id4.zip file presents uploaded an becomes the groud truth

load(1, id1) -> Now getLatestSnapshotVersionAndUniqueIdFromLineage returns nothing because the lineage loaded from 1_id1.changelog only contains <1, id1>, but file listing returns <4, id4>
Now you have to reload from the beginning and use 0 as latestSnapshotVersion

The handling looks OK but can you explain why cannot_load_base_snapshot is removed? We still need to throw it if none of 1-4 is applicable?

Yea I think this is a flaw more deeply to the rocksDB changelog... Because we don't have an empty 0_id0.zip file

Now when there are no matches there could be two cases (even without file listing I believe this still persists)

When you load(v, id) but there is no snapshot created yet.

When there are snapshot created but the lineage doesn't match.

For 1. we need to just set latestSnapshot to 0 and latestSnapshotUniqueId to None for now. For 2 we should throw error.

How about fixing it like this:

latestSnapshotVersionsAndUniqueId match { case Some(pair) => pair case None if currVersionLineage.head.version == 1 && !fileManager.existsSnapshotFile(1, id1)=> logWarning("Cannot find latest snapshot based on lineage: " + printLineageItems(currVersionLineage)) (0L, None) case None => throw cannot_load_base_snapshot }

When the first version is 1 in currentLineage, it means there is no snapshot created yet, and use 0 as default. In other cases we throw.

This looks a bit messy tho.

@WweiL whether throwing error or not is determined by whether we are able to load a correct checkpoint. The condition should be simple:

we have valid snapshot for version v, and all valid changelog fils v+1, v+2, ..., versionToLoad, or

we have changelog file 1, 2, 3, ... versionToLoad.
If we cannot satisfy it, there is no way we can load a checkpoint, so we should throw an error.

I think your proposed change is good if you remove the && !fileManager.existsSnapshotFile(1, id1) part.

Thanks, changed

siying · 2024-12-05T21:20:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

@@ -965,7 +966,8 @@ class MicroBatchExecution(
      updateStateStoreCkptId(execCtx, latestExecPlan)
    }
    execCtx.reportTimeTaken("commitOffsets") {
-      if (!commitLog.add(execCtx.batchId, CommitMetadata(watermarkTracker.currentWatermark))) {
+      if (!commitLog.add(execCtx.batchId,
+        CommitMetadata(watermarkTracker.currentWatermark, currentStateStoreCkptId.toMap))) {


@WweiL there must be a misunderstanding here. My previous comment

Same as commented above. It is better to fill a None if it is it is V1.
Did you validate that for V1, the commit log generated can be read by an older version?

is two different comments. Even if validation for V1 is done, I still hope None is used for V1, and I've explained it above. See my explanation above https://github.com/apache/spark/pull/48355/files#r1870082653
I didn't see it addressed yet.

siying · 2024-12-05T21:25:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+            case None =>
+              logWarning("Cannot find latest snapshot based on lineage: "
+                + printLineageItems(currVersionLineage))
+              (0L, None)


The handling looks OK but can you explain why cannot_load_base_snapshot is removed? We still need to throw it if none of 1-4 is applicable?

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

siying · 2024-12-05T21:37:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+    snapshot: RocksDB#RocksDBSnapshot,
+    fileManager: RocksDBFileManager,
+    snapshotsPendingUpload: Set[RocksDBVersionSnapshotInfo],
+    loggingId: String): RocksDBFileManagerMetrics = {


@WweiL Oh I got it. It now needs to access lineageManager.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

siying · 2024-12-05T21:48:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+    recordedMetrics = None
+    logInfo(log"Loading ${MDC(LogKeys.VERSION_NUM, version)} with stateStoreCkptId: ${
+      MDC(LogKeys.UUID, stateStoreCkptId.getOrElse(""))}")
+    if (enableStateStoreCheckpointIds) {


Should be when stateStoreCkptId is non-empty, rather than enableStateStoreCheckpointIds. No matter whether we support V1->V2 compatibility or not, the source of truth of what we can load is from stateStoreCkptId loaded from the commit log, not the configuration, which is more about what version to be rewritten, nor read. Even if we don't support V1->V2, we still should use stateStoreCkptId but add check and explicitly fail the query, saying it is not supported. This is perhaps already done in driver?

We also cannot do it here because in version 0 the snapshot unique id is None. Then you are always using loadV1()

Good point. Can we have a special treatment of version 0 here?

siying · 2024-12-05T21:49:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+        stateStoreCkptId: Option[String],
+        readOnly: Boolean = false): RocksDB = {
+  // An array contains lineage information from [snapShotVersion, version] (inclusive both end)
+  var currVersionLineage: Array[LineageItem] = lineageManager.getLineage()


Can it be moved inside the if below?

No this is intentionally put here. In the case of (loadedVersion==version), we don't need to reload everything from DFS, which is the most common case in reality. Then we need to reuse the existing lineage from memory.

I see. It makes sense.
Here is an issue though. Now the assumption is that currVersionLineage is for loadedVersion. It is because currVersionLIneage is set in line 339 and 345, but loadedVersion is reset in line 382. Exceptions thrown in the middle would make these two inconsistent. Then you need to do the reset to two in the same place. Also it is better to rename function lineageManager.getLineage() to lineageManager.getLineageForCurVersion() or something like that.

Can you elaborate more? I think every exception thrown would invalidate data from the memory:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

Lines 401 to 409 in abe6056

} catch {

case t: Throwable =>

loadedVersion = -1 // invalidate loaded data

lastCommitBasedStateStoreCkptId = None

lastCommittedStateStoreCkptId = None

loadedStateStoreCkptId = None

sessionStateStoreCkptId = None

lineageManager.clear()

throw t

Renamed the function

It's a good point. Then there is no correctness issue here, which is good. It still feel better for those final update to happen in the same place, which is less prone to issues when we change the code in the future and do some early return, or handle exceptions differently.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

siying · 2024-12-09T19:54:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+      val metadata = fileManager.loadCheckpointFromDfs(latestSnapshotVersion,
+        workingDir, rocksDBFileMapping, latestSnapshotUniqueId)
+
+      loadedVersion = latestSnapshotVersion


I think we should also set the lineage here, perhaps with single entry here. It anyway needs to be set in the else case below.

siying · 2024-12-09T20:08:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

-    for (v <- loadedVersion + 1 to endVersion) {
-      logInfo(log"Replaying changelog on version " +
-        log"${MDC(LogKeys.VERSION_NUM, v)}")
+      log"${MDC(LogKeys.END_VERSION, versionsAndUniqueIds.lastOption.map(_._1))}")


Should assert here that the first item of the lineage has version loadedVersion.

siying · 2024-12-09T20:29:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+  /**
+   * Initialize key metrics based on the metadata loaded from DFS and open local RocksDB.
+   */
+  private def init(metadata: RocksDBCheckpointMetadata): Unit = {


Probably better call it openLocalRocksDB() or something like that.

siying · 2024-12-09T20:31:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+    if (stateStoreCkptId.isDefined || enableStateStoreCheckpointIds && version == 0) {
+      loadV2(version, stateStoreCkptId, readOnly)
+    } else {
+      loadV1(version, readOnly)


It's probably more descriptive names such as something like loadWithCheckpointId() and loadWithoutCheckpointId().

siying · 2024-12-09T21:31:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+ * for the first few versions. Because they are solely loaded from changelog file.
+ * (i.e. with default minDeltasForSnapshot, there is only 1_uuid1.changelog, no 1_uuid1.zip)
+ *
+ * The last item of "lineage" corresponds to one version before the to-be-committed version.


Why do we need to have a non-committed version in the lineage? Can you remind me at which scenario we need to deal with an uncommitted lineage item?

ah no this means lineage only contains committed versions. It basically means we only append lineage item during commit:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

Line 866 in 91d5e81

lineageManager.appendLineageItem(LineageItem(newVersion, sessionStateStoreCkptId.get))

Got it. Thanks for your explanation.

siying · 2024-12-09T21:32:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

@@ -313,6 +323,11 @@ class RocksDBFileManager(
    metadata
  }

+  def existsSnapshotFile(version: Long, checkpointUniqueId: Option[String] = None): Boolean = {
+    val path = new Path(dfsRootDir)
+    fm.exists(path) && fm.exists(dfsBatchZipFile(version, checkpointUniqueId))


We should check rootDirChecked before checking dfsRootDir existence again.

…mit log ### What changes were proposed in this pull request? Per comment from another PR: #48355 (comment) Add a new compatibility test. This test verifies the following scenario: A Spark running under version 3.5 trying to read a commit log generated with the new changelog change but under V1. Then the commit log would look like: ``` {"nextBatchWatermarkMs":1,"stateUniqueIds":{}} ``` But in Spark 3.5 and before, there is no such `stateUniqueIds` field, so we need to make sure queries won't throw error in such cases. In the new test, I create a `CommitMetadataLegacy` that only has `nextBatchWatermarkMs` and no `stateUniqueIds`, to read from a commit log as above. This simulates the scenario stated above, and the test passed. ### Why are the changes needed? More testing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test only addition ### Was this patch authored or co-authored using generative AI tooling? No Closes #49063 from WweiL/commitLog-followup. Authored-by: WweiL <z920631580@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

siying

LGTM. I left some easy-to-address comments. I will probably take a closer look at RocksDBSuite.scala later but I don't have any serious comment anymore.

brkyvz

LGTM, overall very small comments. Just one area where I'm concerned about potential race conditions

brkyvz · 2024-12-11T05:47:41Z

common/utils/src/main/resources/error/error-conditions.json

@@ -233,6 +233,11 @@
      "An error occurred during loading state."
    ],
    "subClass" : {
+      "CANNOT_FIND_BASE_SNAPSHOT_CHECKPOINT" : {
+        "message" : [
+          "Cannot find a base snapshot checkpoint. lineage: <lineage>."


Cannot find a base snapshot checkpoint with lineage: <lineage>.

brkyvz · 2024-12-11T05:52:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+  def contains(item: LineageItem): Boolean = {
+    lineage.contains(item)
+  }


how often will this be called and how long can the list be? Do we want to keep around a map or set to make this look up faster?

It is controlled by this config:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 2135 to 2142 in d0dbc6c

val STATE_STORE_MIN_DELTAS_FOR_SNAPSHOT =

buildConf("spark.sql.streaming.stateStore.minDeltasForSnapshot")

.internal()

.doc("Minimum number of state store delta files that needs to be generated before they " +

"consolidated into snapshots.")

.version("2.0.0")

.intConf

.createWithDefault(10)

Default is 10, here I believe at most we store 2x the min_delta_version so it should be a fairly small array

brkyvz · 2024-12-11T05:59:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+        val versionsAndUniqueIds = currVersionLineage
+          .filter(_.version > loadedVersion)
+          .filter(_.version <= version)


nit: to reduce intermediate garbage objects, can you squash these into a single filter? You can even call collect on it and just make it a single operation to even get rid of the map

currVersionLineage.collect { case i if i.version > loadedVersion && i.version <= version => (i.version, Option(i.checkpointUniqueId)) }

brkyvz · 2024-12-11T06:04:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+          snapshot = Some(createSnapshot(
+            checkpointDir, newVersion, colFamilyNameToIdMap.asScala.toMap,
+            maxColumnFamilyId.get().toShort, sessionStateStoreCkptId))


uber nit: one parameter per line please

brkyvz · 2024-12-11T06:09:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+      lineageManager.resetLineage(lineageManager.getLineageForCurrVersion()
+        .filter(i => i.version >= snapshot.version))


if the upload of snapshots is async, could there be some race conditions happening here?

It is possible that there are race conditions, but it is fine. Because we only require the first version in the lineage to be a snapshot version but not necessarily the latest snapshot version. So this could happen:

in loadWithCheckpointId(): lineage = lineageManager.getLineageForCurrVersion()

uploadSnapshot() // lineageManager.resetLineage

lineage wrote to changelog file.

Now the lineage written to the changelog file actually contains two snapshot versions. But this is fine because later when we really need to getLatestSnapshotVersionAndUniqueIdFromLineage, we list all snapshot files and find the latest one based on the lineage stored in the change log file. So even if it has two snapshot versions we can still find the last one

brkyvz · 2024-12-11T06:13:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

+
+  // (version, checkpointUniqueId) -> immutable files
+  private val versionToRocksDBFiles =
+    new ConcurrentHashMap[(Long, Option[String]), Seq[RocksDBImmutableFile]]()


nit: I probably would define a UniqueVersion class or something that can be used across everything and maybe that replaces LineageItem too. Don't need to address, but something to think about

brkyvz · 2024-12-11T06:17:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

+      fm.list(path, onlyZipFiles)
+        .map(_.getPath.getName.stripSuffix(".zip").split("_"))
+        .filter {
+          case Array(ver, id) => lineage.contains(LineageItem(ver.toLong, id))


does this not return a MatchError if there is no _ to split on? What if you see old files using checkpoint v1?

Yes then there would be errors... originaly it was implemented in a way to handle both v1 and v2 like

.filter { case Array(version, _) => xxx case Array(version) => xxx }

cc @siying says we should separate the logic

brkyvz · 2024-12-11T06:19:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

+        .map(_.getName.stripSuffix(".changelog").split("_"))
+        .map {
+          case Array(version, _) => version.toLong
+          case Array(version) => version.toLong
+        }


try to combine these map functions to avoid creating intermediate garbage - it may harm GC performance for low latency queries

brkyvz · 2024-12-11T06:20:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

+      .map(_.getName.stripSuffix(".zip").split("_"))
+      .map {
+        case Array(version, uniqueId) => (version.toLong, Some(uniqueId))
+        case Array(version) => (version.toLong, None)
+      }


brkyvz · 2024-12-13T01:18:06Z

Thank you for addressing the comments! Merging to master

brkyvz · 2024-12-13T02:06:00Z

I'm having trouble merging. Asking @HeartSaVioR for help in merging

HeartSaVioR · 2024-12-13T02:07:48Z

I'll take over and keep helping to merge till @brkyvz gets his power again :)

Thanks! Merging to master (on behalf of @brkyvz )

…h RocksDB and RocksDBFileManager ### What changes were proposed in this pull request? This PR enables RocksDB to read <zip, changelog> file watermarked with unique ids (e.g. `version_uniqueId.zip`, `version_uniqueId.changelog`). Below is a explanation on the changes and rationale. Now for each changelog file, we put a "version: uniqueId" to its first line, from it's current version to the previous snapshot version. For each snapshot (zip) file, there is no change other than their name (`version_uniqueId.zip`), because snapshot files are single source of truth. In addition to `LastCommitBasedCheckpointId, lastCommittedCheckpointId, loadedCheckpointId` added in apache#47895 (review), also add a in-memory map `versionToUniqueIdLineage` that maps version to unique Id. This is useful when you reuse the same rocksDB instance in the executor, so you don't need to load the lineage from the changelog file again. ## RocksDB: #### Load - When `loadedVersion != version`, try to load a changelog file with `version_checkpointUniqueId.changelog`. There could be multiple cases: 1. Version corresponds to a zip file: 1. `version_uniqueId.zip`, this means (either changelog was not enabled, or it was enabled but version happens to be a checkpoint file), and previously query run ckpt v2 2. `version.zip`, this means (either changelog was not enabled, or it was enabled but version happens to be a checkpoint file), and previously query run ckpt v1 2. Version corresponds to a changelog file: 1. `version_uniqueId.changelog`, this means changelog was enabled, and previously query run ckpt v2 2. `version.changelog`, this means changelog was enabled, and previously query run ckpt v1 - For case i.a, we construct a new empty lineage `(version, sessionCheckpointId)`. `version_uniqueId.changelog`. - For case ii.a, we read the lineage file stored in - For case i.b and ii.b, there is no need to load the lineage as they were not presented before, we just load the corresponding file without `uniqueId`, but newer files will be constructed with uniqueId. checkpoint version v1 to checkpoint version v2. Next the code finds the latest snapshot version through file listing. When there are multiple snapshot files with the same version but different unique Id (main problem this project was trying to solve), the correct one will be loaded based on the checkpoint id. Then changelog is replayed with the awareness of lineage. The lineage is stored in memory for next load(). Last, load the changelog writer for version + 1, and write the lineage (version + 1, sessionCheckpointId) to the first line of the file. While it seems that the lineage is written early, it is safe because the change log writer is not committed yet. - When `loadedVersion == version`, the same rocks db instance is reused and the lineage is stored in memory to `versionToUniqueIdLineage`. #### Commit - Also save `sessionCheckpointId` to `latestSnapshot` - Add `(newVersion, sessionCheckpointId)` to `versionToUniqueIdLineage` #### Abort Also clear up `versionToUniqueIdLineage` ## RocksDBFileManager: - A bunch of add-ups to make until code uniqueId aware. Now all places that return version returns a <version, Option[uniqueId]> pair, in v1 format, the option is None. - ### deleteOldVersions: If there are multiple `version_uniqueId1.zip`(changelog) and `versioion.uniqueId2.zip`, all are deleted. ## Changelog Reader / Writer We purpose to save the lineage to the first line of the changelog files. For changelog reader, there is an abstract function `readLineage` created. In `RocksDBCheckpointManager.getChangelogReader` function, the `readLineage` will be called right after the initialization of the changelog reader to update the file pointer to after the lineage. Subsequent `getNext` function won't be affecter because of this. For changelog writer, there is an abstract function `writeLineage` that writes the lineage. This function will be called before any actual changelog data is written in `RocksDB.load()`. ### Why are the changes needed? Improve fault tolerance to RocksDB State Store. ### Does this PR introduce _any_ user-facing change? Not yet, after the V2 format project is finished, customer can use the new config to enable it with better rocksDB state store fault tolerance ### How was this patch tested? Modified existing unit tests. For unit tests and backward compatibility tests please refer to: apache#48356 ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48355 from WweiL/integration-for-review. Lead-authored-by: WweiL <z920631580@gmail.com> Co-authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

review version

8eb89f4

github-actions bot added SQL STRUCTURED STREAMING labels Oct 4, 2024

WweiL added 2 commits October 4, 2024 13:35

split test

1c27324

retrigger

caf19dc

WweiL mentioned this pull request Oct 4, 2024

[SPARK-49884][SS] State Store Checkpoint Structure V2 Backward compatibility Tests #48356

Closed

WweiL changed the title ~~[DO-NOT-REVIEW] RocksDB StateV2 integration~~ [DO-NOT-REVIEW] [SPARK-49883] State Store Checkpoint Structure V2 Integration with RocksDB and RocksDBFileManager Oct 4, 2024

WweiL changed the title ~~[DO-NOT-REVIEW] [SPARK-49883] State Store Checkpoint Structure V2 Integration with RocksDB and RocksDBFileManager~~ [DO-NOT-REVIEW] [SPARK-49883][SS] State Store Checkpoint Structure V2 Integration with RocksDB and RocksDBFileManager Oct 4, 2024

WweiL mentioned this pull request Oct 8, 2024

[SPARK-49411][SS] Communicate State Store Checkpoint ID between driver and stateful operators #47895

Closed

siying reviewed Oct 14, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala Outdated Show resolved Hide resolved

siying reviewed Oct 15, 2024

View reviewed changes

WweiL added 3 commits October 18, 2024 11:59

merge master

2609cd2

retrigger

ca2e5d7

continue merge master

f7b4a29

WweiL force-pushed the integration-for-review branch from c6b8cb6 to f7b4a29 Compare October 18, 2024 18:14

WweiL added 4 commits October 22, 2024 10:47

Merge remote-tracking branch 'spark/master' into integration-for-review

e2bd070

retrigger

59bda44

long journey to integrate with dc6f9dfe11e76f890ff2986f866853bcac2630…

f563b26

…27c82562f9a52f4672a5460826

cleanup

a104ba9

WweiL changed the title ~~[DO-NOT-REVIEW] [SPARK-49883][SS] State Store Checkpoint Structure V2 Integration with RocksDB and RocksDBFileManager~~ [SPARK-49883][SS] State Store Checkpoint Structure V2 Integration with RocksDB and RocksDBFileManager Oct 22, 2024

WweiL added 3 commits November 7, 2024 16:18

empty

8d0537d

merge

d3f3201

minor

3cc3c1f

WweiL marked this pull request as ready for review November 11, 2024 15:29

WweiL added 4 commits November 11, 2024 23:47

comments

f2f6059

fix spark throwable suite

3768855

fix test

073ebf1

resolve conflicts

30a19b1

siying reviewed Nov 12, 2024

View reviewed changes

WweiL added 3 commits December 4, 2024 12:06

final comments

1a8357f

oops missed this, beforeEach should clear all open streams

9973b59

merge

c559e5f

WweiL mentioned this pull request Dec 4, 2024

[SPARK-49461][SS][TESTS][FOLLOWUP] Add compatibility test for new commit log #49063

Closed

retrigger

8201454

siying reviewed Dec 4, 2024

View reviewed changes

WweiL added 5 commits December 5, 2024 11:24

tmp

83f7fde

Merge branch 'changelog' into integration-for-review

6fa4177

siying's comments

97d31af

merge master

a7ea81d

cleanup

abe6056

WweiL commented Dec 5, 2024

View reviewed changes

siying reviewed Dec 5, 2024

View reviewed changes

WweiL added 3 commits December 6, 2024 12:22

comments

72876a8

siying's comment

e7120d5

continue address siying's comment

91d5e81

siying reviewed Dec 9, 2024

View reviewed changes

siying approved these changes Dec 10, 2024

View reviewed changes

WweiL and others added 2 commits December 9, 2024 22:14

siying's comment

db0a304

minor, the warning was too verbose for first version, change to debug

a973c8e

brkyvz approved these changes Dec 11, 2024

View reviewed changes

burak's comment

0273e4c

HeartSaVioR closed this in 429402b Dec 13, 2024

	} catch {
	case t: Throwable =>
	loadedVersion = -1 // invalidate loaded data
	lastCommitBasedStateStoreCkptId = None
	lastCommittedStateStoreCkptId = None
	loadedStateStoreCkptId = None
	sessionStateStoreCkptId = None
	lineageManager.clear()
	throw t

	val STATE_STORE_MIN_DELTAS_FOR_SNAPSHOT =
	buildConf("spark.sql.streaming.stateStore.minDeltasForSnapshot")
	.internal()
	.doc("Minimum number of state store delta files that needs to be generated before they " +
	"consolidated into snapshots.")
	.version("2.0.0")
	.intConf
	.createWithDefault(10)

		lineageManager.resetLineage(lineageManager.getLineageForCurrVersion()
		.filter(i => i.version >= snapshot.version))

[SPARK-49883][SS] State Store Checkpoint Structure V2 Integration with RocksDB and RocksDBFileManager #48355

[SPARK-49883][SS] State Store Checkpoint Structure V2 Integration with RocksDB and RocksDBFileManager #48355

Conversation

WweiL commented Oct 4, 2024 • edited Loading

What changes were proposed in this pull request?

RocksDB:

Load

Commit

Abort

RocksDBFileManager:

deleteOldVersions:

Changelog Reader / Writer

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

WweiL commented Oct 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WweiL Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

siying Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WweiL Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siying left a comment

Choose a reason for hiding this comment

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brkyvz commented Dec 13, 2024

brkyvz commented Dec 13, 2024

HeartSaVioR commented Dec 13, 2024

WweiL commented Oct 4, 2024 •

edited

Loading

WweiL Dec 5, 2024 •

edited

Loading

siying Dec 4, 2024 •

edited

Loading

WweiL Dec 6, 2024 •

edited

Loading