-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist and recover individual deleted messages #192
Conversation
I am not in favor of storing
It will raise a question: when can we delete the ledger?
|
@rdhabalia this would require much more work and yet another ledger for every cursor. Also, if a broker were to crash unexpectedly, it would not have persisted Maybe we could use the same ledger we have now, and force a read to it's last entry on recover. |
What about persisting a max number of intervals, stopping after that? |
This way we can control how big the BK entry or z-node can grow. It will not be perfectly accurate, though better than today. I think storing 1k intervals should be pretty safe. |
Can we persist this info into a ledger instead of storing it in zookeeper? Else we will introduce far more reads to zk during startup, which will be an issue when topic grows. |
@saandrews The mark-delete position is already stored either in a ledger or in a z-node. Normally it just gets written into the cursor ledger. During cursor ledger roll-over it also gets written in the z-node (and used as a fallback, in case the cursor ledgers fails to recover). The other time when mark-delete position is stored in the z-node is when we do the graceful close of the topic. We store in the z-node and throw away the cursor ledger. With this change, the number of writes stays exactly the same. It only changes (potentially) the amount of information stored. |
@merlimat if we're concerned with storing too much on ZK, we can always force a read to the last position in the last ledger to get the list of |
No, broker will create only one ledger where all the cursors will write their list when bundle will be unloaded/or on some interval. And broker can roll-over that ledger with expiry time stored at some zk-location and one of the task at leader-broker can purge expired ledger.
yes, when broker crashes it also fails to update metadata into zk as well. So, broker crashing always fails to store current state and it falls back to recovery with fallback information.
Yes, but I am just thinking it may not solve the problem entirely if there will be many intervals in a initial range of messages which will prevent to store latest individualDeletedMessageList and it resets cursor to old message range only.
|
But this entry will potentially grow very large and hit some limit?
Yes, but with what I've currently done, this information is also stored on the cursor ledger. |
@rdhabalia That would require to open and recover one more ledger when loading the topic. |
@merlimat another approach to reduce the size a bit, would be to only store the unacked messages of the ranges, and then build the ranges on runtime, this would require more logic that just picking up the ranges though |
I was mostly concerned about the amount of data stored in Zk. Storing ranges or having a limit is better to contain its growth. Though it won't help with one unacked message pockets. |
It will help, right? If you put a cap on 1000 (or whatever number) intervals, that means that, that we can record 1000 disjointed "holes" and remember about acked messages. After that, the rest will be replayed in case of broker restart (like today we're replaying everything). |
It will help. In a bigger cluster 1000 itself might be big. In a bigger cluster, knowing the overall size of individually deleted entries across topics might help to realize and contain its growth. But it's not straightforward. We can go with this for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach. As I commented before, I'd put a configurable max of how many ranges to store and after that fallback to current behavior.
@@ -15,19 +15,12 @@ | |||
*/ | |||
package org.apache.bookkeeper.mledger; | |||
|
|||
import com.google.common.annotations.Beta; | |||
import org.apache.bookkeeper.mledger.AsyncCallbacks.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you keep the import in the same format as Eclipse? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
@@ -1619,7 +1615,9 @@ public void asyncClose(final AsyncCallbacks.CloseCallback callback, final Object | |||
// hence we write it as -1. The cursor ledger is deleted once the z-node write is confirmed. | |||
ManagedCursorInfo info = ManagedCursorInfo.newBuilder().setCursorsLedgerId(-1) | |||
.setMarkDeleteLedgerId(markDeletePosition.getLedgerId()) | |||
.setMarkDeleteEntryId(markDeletePosition.getEntryId()).build(); | |||
.setMarkDeleteEntryId(markDeletePosition.getEntryId()) | |||
.addAllIndividualDeletedMessages(buildIndividualDeletedMessageRanges()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we skip this line if we have no individually deleted messages? We should avoid to create an empty list in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
@@ -1834,6 +1851,7 @@ void switchToNewLedger(final LedgerHandle lh, final VoidCallback callback) { | |||
// ledger and delete the old one. | |||
ManagedCursorInfo info = ManagedCursorInfo.newBuilder().setCursorsLedgerId(lh.getId()) | |||
.setMarkDeleteLedgerId(markDeletePosition.getLedgerId()) | |||
.addAllIndividualDeletedMessages(buildIndividualDeletedMessageRanges()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, avoid if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
@@ -233,84 +224,76 @@ public void asyncUpdateCursorInfo(final String ledgerName, final String cursorNa | |||
info.getCursorsLedgerId(), info.getMarkDeleteLedgerId(), info.getMarkDeleteEntryId()); | |||
|
|||
String path = prefix + ledgerName + "/" + cursorName; | |||
byte[] content = info.toString().getBytes(Encoding); | |||
byte[] content = info.toByteArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this maintaining the protobuf text format? I think toByteArray()
is using the binary format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this is using the binary format, and it's intentional, I mentioned it in the description of the PR.
Text format is REALLY verbose, and its size grows A LOT for each entry.
Here's what an entry with only 71 messages looks like:
cursorsLedgerId: -1 markDeleteLedgerId: 459 markDeleteEntryId: 59 individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 60 } upperEndpoint { ledgerId: 459 entryId: 1845 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1846 } upperEndpoint { ledgerId: 459 entryId: 1848 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1849 } upperEndpoint { ledgerId: 459 entryId: 1851 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1852 } upperEndpoint { ledgerId: 459 entryId: 1854 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1855 } upperEndpoint { ledgerId: 459 entryId: 1857 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1858 } upperEndpoint { ledgerId: 459 entryId: 1860 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1861 } upperEndpoint { ledgerId: 459 entryId: 1863 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1864 } upperEndpoint { ledgerId: 459 entryId: 1866 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1867 } upperEndpoint { ledgerId: 459 entryId: 1869 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1870 } upperEndpoint { ledgerId: 459 entryId: 1872 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1873 } upperEndpoint { ledgerId: 459 entryId: 1875 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1876 } upperEndpoint { ledgerId: 459 entryId: 1878 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1879 } upperEndpoint { ledgerId: 459 entryId: 1881 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1882 } upperEndpoint { ledgerId: 459 entryId: 1884 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1885 } upperEndpoint { ledgerId: 459 entryId: 1887 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1888 } upperEndpoint { ledgerId: 459 entryId: 1890 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1891 } upperEndpoint { ledgerId: 459 entryId: 1893 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1894 } upperEndpoint { ledgerId: 459 entryId: 1896 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1897 } upperEndpoint { ledgerId: 459 entryId: 1899 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1900 } upperEndpoint { ledgerId: 459 entryId: 1902 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1903 } upperEndpoint { ledgerId: 459 entryId: 1905 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1906 } upperEndpoint { ledgerId: 459 entryId: 1908 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1909 } upperEndpoint { ledgerId: 459 entryId: 1911 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1912 } upperEndpoint { ledgerId: 459 entryId: 1914 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1915 } upperEndpoint { ledgerId: 459 entryId: 1917 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1918 } upperEndpoint { ledgerId: 459 entryId: 1920 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1921 } upperEndpoint { ledgerId: 459 entryId: 1923 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1924 } upperEndpoint { ledgerId: 459 entryId: 1926 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1927 } upperEndpoint { ledgerId: 459 entryId: 1929 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1930 } upperEndpoint { ledgerId: 459 entryId: 1932 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1933 } upperEndpoint { ledgerId: 459 entryId: 1935 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1936 } upperEndpoint { ledgerId: 459 entryId: 1938 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1939 } upperEndpoint { ledgerId: 459 entryId: 1941 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1942 } upperEndpoint { ledgerId: 459 entryId: 1944 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1945 } upperEndpoint { ledgerId: 459 entryId: 1947 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1948 } upperEndpoint { ledgerId: 459 entryId: 1950 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1951 } upperEndpoint { ledgerId: 459 entryId: 1953 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1954 } upperEndpoint { ledgerId: 459 entryId: 1956 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1957 } upperEndpoint { ledgerId: 459 entryId: 1959 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1960 } upperEndpoint { ledgerId: 459 entryId: 1962 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1963 } upperEndpoint { ledgerId: 459 entryId: 1965 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1966 } upperEndpoint { ledgerId: 459 entryId: 1968 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1969 } upperEndpoint { ledgerId: 459 entryId: 1971 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1972 } upperEndpoint { ledgerId: 459 entryId: 1974 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1975 } upperEndpoint { ledgerId: 459 entryId: 1977 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1978 } upperEndpoint { ledgerId: 459 entryId: 1980 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1981 } upperEndpoint { ledgerId: 459 entryId: 1983 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1984 } upperEndpoint { ledgerId: 459 entryId: 1986 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1987 } upperEndpoint { ledgerId: 459 entryId: 1989 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1990 } upperEndpoint { ledgerId: 459 entryId: 1992 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1993 } upperEndpoint { ledgerId: 459 entryId: 1995 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1996 } upperEndpoint { ledgerId: 459 entryId: 1998 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 1999 } upperEndpoint { ledgerId: 459 entryId: 2001 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2002 } upperEndpoint { ledgerId: 459 entryId: 2004 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2005 } upperEndpoint { ledgerId: 459 entryId: 2007 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2008 } upperEndpoint { ledgerId: 459 entryId: 2010 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2011 } upperEndpoint { ledgerId: 459 entryId: 2013 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2014 } upperEndpoint { ledgerId: 459 entryId: 2016 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2017 } upperEndpoint { ledgerId: 459 entryId: 2019 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2020 } upperEndpoint { ledgerId: 459 entryId: 2022 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2023 } upperEndpoint { ledgerId: 459 entryId: 2025 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2026 } upperEndpoint { ledgerId: 459 entryId: 2028 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2029 } upperEndpoint { ledgerId: 459 entryId: 2031 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2032 } upperEndpoint { ledgerId: 459 entryId: 2034 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2035 } upperEndpoint { ledgerId: 459 entryId: 2037 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2038 } upperEndpoint { ledgerId: 459 entryId: 2040 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2041 } upperEndpoint { ledgerId: 459 entryId: 2043 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2044 } upperEndpoint { ledgerId: 459 entryId: 2046 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2047 } upperEndpoint { ledgerId: 459 entryId: 2048 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2050 } upperEndpoint { ledgerId: 459 entryId: 2051 } } individualDeletedMessages { lowerEndpoint { ledgerId: 459 entryId: 2053 } upperEndpoint { ledgerId: 459 entryId: 2054 } }
That's about 8k, and this is only going to get larger, larger Ids mean larger string size.
I know that introducing a change in format is a breaking change, but I changed usages of this to parse both formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the size concern for the text format is very real, but the breaking change would be dangerous in 2 ways:
- While a rolling upgrade is happening, an updated broker crashes and topics can end up in a non-updated broker
- In case there is any issue with the release, we need to be able to roll back to the previous release
In general this problem can be solved in a couple of ways:
- Do one release that will understand both formats. Next release will start writing the new format.
- Use a config switch to chose the format. Then retire the config variable once the feature is widely activated.
For the sake of this PR, I would say to not mix it with format changes, this we need to tackle separately in a controlled way.
In this case, we can simply avoid snapshotting this information into the z-node. In normal behavior the cursor position is appended in binary form into a ledger.
When doing graceful topic close, as an optimization to save on the number of zk writes to do, we write the information with the last mark-delete position in the z-node and throw the ledger away.
What we could do, when closing the topic, is:
- If we have individually-deleted-messages, we close the cursor ledger
- If not, continue with current behavior of storing snaphot in the z-node
In both cases the information is preserved.
Later, when we enable binary format (btw: we should also do that for managed-ledger z-nodes), we can revert to a unique behavior again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merlimat what do we suggest we do then?
I realized broker also saves in ZK the loadbalance info of all bunldes it owns, this can potentially grow very large too, right?
317c1c8
to
ea41da5
Compare
@merlimat I added a configuration field to specify the amount of ranges to persist, it defaults to 1000. |
fefbcfc
to
cb121b1
Compare
cb121b1
to
ec53a06
Compare
@merlimat I pushed a new commit rearranging some of the locking and synchronization in We were getting deadlocks because
|
@sschepens I have added few changes on top of this PR in #276. Please take a look. I haven't included your second commit about the lock refactoring. I didn't see any deadlock so far and that's kind of "sensitive" code :). Kind of easy to break something unexpected there. |
Correct myself.. I see the same deadlock |
@merlimat we've been running with this code for almost two weeks and we have not seen any more deadlocks nor unwanted behavior I believe, please check the lock refactoring I made, most changes make a lot of sense to me, and shouldn't trigger any side effects. This will also probably reduce contention a little bit, since some locks are never required and some others are released quicker. In some cases we were synchronizing on In other cases we were synchronizing on Then I don't see how these could cause any issues, but maybe you can see something I don't. |
@@ -1462,7 +1458,19 @@ public void asyncDelete(Position pos, final AsyncCallbacks.DeleteCallback callba | |||
newMarkDeletePosition = range.upperEndpoint(); | |||
} | |||
} | |||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this was the source of the deadlock, I had it fixed in my branch as well. ecf4501
// Resume normal mark-delete operations | ||
STATE_UPDATER.set(ManagedCursorImpl.this, State.Open); | ||
} | ||
flushPendingMarkDeletes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure on why the 2 operations were grouped on the same lock, though I'm sure there was some reason :), maybe not a good one. In any case, this is unrelated to the specific change of persisting the individually deleted position and should go into a separate PR.
Closing this one since the change was carried over in #276 |
fix apache#192 * remove unnecessary register producer * exist check * fix spotbug issue * remove blank line
Motivation
Issue #180
Modifications
Modified
ManagedCursorInfo
andPositionInfo
to store a list of ranges ofPositionInfo
, just likeindividualDeletedMessages
.When persisting
ManagedCursorInfo
orPositionInfo
they ges populated with the currentindividualDeletedMessages
, which are then used to repopulateindividualDeletedMessages
on recover.Theoretically it should be as simple as repopulating
individualDeletedMessages
andManagedCursor
should skip already acked messages when reading.Changed
MetaStoreImplZookeeper
to store Protobuf's byte representation rather than string representation, as this change produces and exponential growth on the string representation, due to the format of the Protobuf structures.When reading,
MetaStoreImplZookeeper
will attempt to decode the data as byte representation, if it fails, it will then fallback to parsing the string representation.I realize this, is a short-comming, since the data stored in ZooKeeper wouldn't be human-readable, but we could maybe expose a command in pulsar-admin to ease the reading. This also benefits ZooKeeper, as the data written is now much smaller (9 bytes for a
ManagedCursorInfo
with emptyindividualDeletedMessages
).Maybe persisting all
individualDeletedMessages
everytimePositionInfo
is stored in BookKeeper doesn't make much sense, but users can tune writes though markDeleteThrottling. Also, setting an arbitrary value on the amount of messages between first unacked message and current read position, doesn't make much sense.Result
Should allow for consumers to have unacked messages without affecting their backlog when bundles get unloading.
I would like for you guys @merlimat @rdhabalia to comment on this, tell me what you think, If you see this could bring unforseen issues, etc.