Exception in PersistentShardCoordinator ReceiveRecover #3414

joshgarnett · 2018-04-22T16:47:58Z

Akka 1.3.5

This morning while making some provisioning changes, we ended up in a state where two single node clusters were running that pointed to the same database. After fixing the error and starting only a single node, the underlying akka code was failing to recover.

2018-04-22 15:28:45.148 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [40650] for persistenceId [/system/sharding/zCoordinator/singleton/coordinator]
System.ArgumentException: Region [akka://AkkaCluster/system/sharding/z#673731278] not registered
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

2018-04-22 15:28:45.438 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [50633] for persistenceId [/system/sharding/oCoordinator/singleton/coordinator]
System.ArgumentException: Shard 78 is already allocated
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message

2018-04-22 15:29:07.714 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [11774] for persistenceId [/system/sharding/pCoordinator/singleton/coordinator]
System.ArgumentException: Region [akka://AkkaCluster/system/sharding/p#1560991559] not registered
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

2018-04-22 15:29:10.192 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [3087] for persistenceId [/system/sharding/wCoordinator/singleton/coordinator]
System.ArgumentException: Region [akka://AkkaCluster/system/sharding/w#43619595] not registered
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

2018-04-22 15:29:15.210 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [11109] for persistenceId [/system/sharding/eCoordinator/singleton/coordinator]
System.ArgumentException: Region [akka://AkkaCluster/system/sharding/e#1963167556] not registered
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

My expectation is in the case of two nodes attempting to own the same data, that one would eventually see a journal write error as the journal sequence number would not be unique, and that the ActorSystem would then shut itself down. On recovery it should always be able to get back into a consistent state.

In our case, it was caused by a user error, but this could easily occur in the case of a network partition where two nodes claim to own the same underlying dataset.

The text was updated successfully, but these errors were encountered:

Aaronontheweb · 2018-05-23T22:24:01Z

Going to look into this while I'm at it with #3455

Aaronontheweb · 2019-02-27T22:09:25Z

The issue is that the PersistentShardCoordinator doesn’t save or recover its state in the correct order - the ShardHomeAllocated messages that are tripping the exception during recovery should only be persisted after a ShardRegionRegistered message (shards belong to shard regions.)

Three possible causes of this:

There’s a fair bit of async code inside the PersistentShardCoordinator I’m still untangling - it’s possible a race condition could cause this if the actor were trying to Persist its events out of order. I’m still looking into that possibility. Technically, the actor should never even be asked to host a shard until its region gets created first. I doubt this is the issue, but I can’t rule it out 100%.
I’m wondering if the data we write to Akka.Persistence when we save the sharding snapshot is accurate. I took a look through the serialization code and I’m a little suspicious that it’s persisting the region state correctly. That would also cause this issue: the data was saved, but not in the correct format.
Last issue is the Akka.Persistence implementation itself: if the journal isn’t saving or replaying events for the PersistentShardCoordinator in the correct order, that would certainly cause this.

Going to eliminate number 2 first since that's the simplest - will look into the others next.

Aaronontheweb · 2019-02-27T22:49:55Z

Manually verified the output of this spec:

akka.net/src/contrib/cluster/Akka.Cluster.Sharding.Tests/ClusterShardingMessageSerializerSpec.cs

Lines 54 to 78 in e15b935

    
           [Fact] 
        
           public void ClusterShardingMessageSerializer_must_be_able_to_serializable_ShardCoordinator_snapshot_State() 
        
           { 
        
               var shards = ImmutableDictionary 
        
                   .CreateBuilder<string, IActorRef>() 
        
                   .AddAndReturn("a", region1) 
        
                   .AddAndReturn("b", region2) 
        
                   .AddAndReturn("c", region2) 
        
                   .ToImmutableDictionary(); 
        
               var regions = ImmutableDictionary 
        
                   .CreateBuilder<IActorRef, IImmutableList<string>>() 
        
                   .AddAndReturn(region1, ImmutableArray.Create("a")) 
        
                   .AddAndReturn(region2, ImmutableArray.Create("b", "c")) 
        
                   .AddAndReturn(region3, ImmutableArray<string>.Empty) 
        
                   .ToImmutableDictionary(); 
        
               var state = new PersistentShardCoordinator.State( 
        
                   shards: shards, 
        
                   regions: regions, 
        
                   regionProxies: ImmutableHashSet.Create(regionProxy1, regionProxy2), 
        
                   unallocatedShards: ImmutableHashSet.Create("d")); 
        
               CheckSerialization(state); 
        
           }

Can vouch for its accuracy - the sharding serializer appears to be working correctly.

Aaronontheweb · 2019-02-27T23:25:35Z

This is probably the issue #3204

Going to create some reproduction specs and then see where things go.

Aaronontheweb · 2019-02-28T02:58:25Z

Working on a fun reproduction of this using an actual integration test against SQL Server spun up via docker-compose https://github.com/Aaronontheweb/AkkaClusterSharding3414Repro

izavala · 2019-03-18T23:57:17Z

I was able to reproduce this issue on my end with with my copy of the above project: https://github.com/Aaronontheweb/AkkaClusterSharding3414Repro.

And received the same failing to recover error message:

[ERROR][03/18/2019 23:34:59][Thread 0003][[akka://ShardFight/system/sharding/fubersCoordinator/singleton/coordinator#329445421]] Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [113] for persistenceId [/system/sharding/fubersCoordinator/singleton/coordinator] sharding.shard_1 | Cause: System.ArgumentException: Shard 23 is already allocated sharding.shard_1 | Parameter name: e sharding.shard_1 | at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e) sharding.shard_1 | at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message) sharding.shard_1 | at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)

I've attached the data from the EvenJournal database in hopes to find more information on what is causing this behavior.
Journal.zip

Aaronontheweb · 2019-03-19T03:27:26Z

@izavala I'll deserialize the data gathered from the repo here and see what's up - that should paint a clearer picture as to what's going on.

Aaronontheweb · 2019-03-22T17:12:46Z

Wrote a custom tool using Akka.Persistence.Query using the dataset that created this error: https://github.com/Aaronontheweb/Cluster.Sharding.Viewer

Attached is the output. Haven't analyzed it yet, but this is the same data from @izavala's reproduction.

Shard-replay-crash-data.log

Aaronontheweb · 2019-03-22T17:50:23Z

Worth noting in these logs: no Snapshots were ever saved for the PersistentShardCoordinator under this run - it logged 30-34 "ShardHomeAllocated" messages per run using our reproduction app. The default Akka.Cluster.Sharding settings have us only take a snapshot once every 1000 journaled entries.

Aaronontheweb · 2019-03-22T19:15:47Z

So the logs we've produced confirm that #3204 is the issue - the exception in recovery only occurs when it's the same node with the same address trying to deserialize its own RemoteActorRefs each time. The issue doesn't occur when the node reboots using a new hostname when we tear down our Docker cluster and recreate it.

…o par with JVM

Aaronontheweb · 2019-03-26T18:11:11Z

Moving this to 1.4.0 - changes are too big to put into a point release. We're going to need to make a lot of changes to the serialization system for IActorRefs to complete this.

heatonmatthew · 2019-05-16T09:31:55Z

Hey cool, I've run into this one too. Still in a prototype phase but it was on my mind for issues to address in moving to a more production preparation phase.

@Aaronontheweb Since you're making serialization system changes, just a heads up that with your netstandard2.0 update in #3668 the difference between Framework and Core disappear. See my commit referencing the issue for the code that removes the difference.

Aaronontheweb · 2019-06-26T19:13:13Z

I've been able to verify via Aaronontheweb/AkkaClusterSharding3414Repro#10 that #3744 resolves this issue. I'm note done with #3744 yet - still need to make sure this works with serialize-messages and so on, but we're getting there.

* fixed typo in RemoteActorRefProvider comment * Working on #3414 - bringing SerializeWithTransport API up to par with JVM * added spec to help validate CurrentTransportInformation issues Based on the equivalent JVM spec * working on bringing serialization up to snuff * brought serialization class up to snuff * wrapping up RmeoteActorRefProvider implementation * WIP * cleaning up Serialization class * looks like there's a Lazy<SerializationInfo> translation from Scala to C# that we haven't quite done * fixed Serialization class * fixed bug with Akka.Remote.Serialization.SerializationTransportInformationSpec * forced a couple of specs using default akka.remote configs to run sequentially This was done in order to avoid the two specs trying to bind on the same port at the same time. * added serialization verification to the Akka.Persistence.TCK * fixed issues with default Akka.Perisstence.TCK specs * fixed IActorRef serialziation support in Akka.Persistence journals and snapshot stores * fixed compilation issuyes * fixed Akka.Sql.Common serialization in a backwards-compatible fashion * had to disable serialization specs for Sql Journals * Added API approvals * updated creator and serialize-all-messages serialization * added ITestOutputHelper to Akka.Cluster.Sharding.Tests.SupervisionSpec * made changes to LocalSnapshotSerializer * fixed bug in WithTransport method * updated Akka.Remote MessageSerializer

Aaronontheweb · 2019-07-18T02:04:57Z

This is now resolved as of #3744

…et#3744) * fixed typo in RemoteActorRefProvider comment * Working on akkadotnet#3414 - bringing SerializeWithTransport API up to par with JVM * added spec to help validate CurrentTransportInformation issues Based on the equivalent JVM spec * working on bringing serialization up to snuff * brought serialization class up to snuff * wrapping up RmeoteActorRefProvider implementation * WIP * cleaning up Serialization class * looks like there's a Lazy<SerializationInfo> translation from Scala to C# that we haven't quite done * fixed Serialization class * fixed bug with Akka.Remote.Serialization.SerializationTransportInformationSpec * forced a couple of specs using default akka.remote configs to run sequentially This was done in order to avoid the two specs trying to bind on the same port at the same time. * added serialization verification to the Akka.Persistence.TCK * fixed issues with default Akka.Perisstence.TCK specs * fixed IActorRef serialziation support in Akka.Persistence journals and snapshot stores * fixed compilation issuyes * fixed Akka.Sql.Common serialization in a backwards-compatible fashion * had to disable serialization specs for Sql Journals * Added API approvals * updated creator and serialize-all-messages serialization * added ITestOutputHelper to Akka.Cluster.Sharding.Tests.SupervisionSpec * made changes to LocalSnapshotSerializer * fixed bug in WithTransport method * updated Akka.Remote MessageSerializer

Caldas · 2021-06-12T13:45:35Z

Hey guys, since this issue has been fixed I recommend updating README at https://github.com/petabridge/akkadotnet-cluster-workshop, since at end of it still point to this issue as an active one

Aaronontheweb added potential bug akka-cluster-sharding labels May 23, 2018

Aaronontheweb added this to the 1.3.8 milestone May 23, 2018

marcpiechura modified the milestones: 1.3.8, 1.3.9 Jun 5, 2018

Aaronontheweb modified the milestones: 1.3.9, 1.3.10 Sep 5, 2018

ctrlaltdan mentioned this issue Nov 28, 2018

Cluster sharding deserialization issue #3664

Closed

Aaronontheweb modified the milestones: 1.3.10, 1.3.11, 1.3.12 Dec 14, 2018

Aaronontheweb self-assigned this Feb 27, 2019

Aaronontheweb added confirmed bug critical and removed potential bug labels Feb 27, 2019

Aaronontheweb mentioned this issue Feb 28, 2019

Fix serialize-messages for Akka.Cluster and Akka.Remote #3725

Merged

Aaronontheweb modified the milestones: 1.3.12, 1.4.0 Mar 18, 2019

Aaronontheweb removed this from the 1.4.0 milestone Mar 22, 2019

Aaronontheweb added this to the 1.3.13 milestone Mar 22, 2019

Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue Mar 22, 2019

Working on akkadotnet#3414 - bringing SerializeWithTransport API up t…

531c7eb

…o par with JVM

Aaronontheweb mentioned this issue Mar 22, 2019

Fix sharding recovery error and WithTransport serialization #3744

Merged

Aaronontheweb modified the milestones: 1.3.13, 1.4.0 Mar 26, 2019

Aaronontheweb closed this as completed in #3744 Jul 18, 2019

Aaronontheweb mentioned this issue Jul 18, 2019

added Akka.NET v1.4 beta1 release notes #3859

Merged

Aaronontheweb mentioned this issue Jul 30, 2019

validated that v1.3.14 fixes the issue Aaronontheweb/AkkaClusterSharding3414Repro#11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception in PersistentShardCoordinator ReceiveRecover #3414

Exception in PersistentShardCoordinator ReceiveRecover #3414

joshgarnett commented Apr 22, 2018

Aaronontheweb commented May 23, 2018

Aaronontheweb commented Feb 27, 2019

Aaronontheweb commented Feb 27, 2019

Aaronontheweb commented Feb 27, 2019

Aaronontheweb commented Feb 28, 2019

izavala commented Mar 18, 2019

Aaronontheweb commented Mar 19, 2019

Aaronontheweb commented Mar 22, 2019

Aaronontheweb commented Mar 22, 2019

Aaronontheweb commented Mar 22, 2019

Aaronontheweb commented Mar 26, 2019

heatonmatthew commented May 16, 2019

Aaronontheweb commented Jun 26, 2019

Aaronontheweb commented Jul 18, 2019

Caldas commented Jun 12, 2021

Exception in PersistentShardCoordinator ReceiveRecover #3414

Exception in PersistentShardCoordinator ReceiveRecover #3414

Comments

joshgarnett commented Apr 22, 2018

Aaronontheweb commented May 23, 2018

Aaronontheweb commented Feb 27, 2019

Aaronontheweb commented Feb 27, 2019

Aaronontheweb commented Feb 27, 2019

Aaronontheweb commented Feb 28, 2019

izavala commented Mar 18, 2019

Aaronontheweb commented Mar 19, 2019

Aaronontheweb commented Mar 22, 2019

Aaronontheweb commented Mar 22, 2019

Aaronontheweb commented Mar 22, 2019

Aaronontheweb commented Mar 26, 2019

heatonmatthew commented May 16, 2019

Aaronontheweb commented Jun 26, 2019

Aaronontheweb commented Jul 18, 2019

Caldas commented Jun 12, 2021