Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing staking rewards #683

Closed
erikd opened this issue Jul 8, 2021 · 25 comments
Closed

Missing staking rewards #683

erikd opened this issue Jul 8, 2021 · 25 comments

Comments

@erikd
Copy link
Contributor

erikd commented Jul 8, 2021

So far, I have only seen this on the Shelley QA network (with the 10.0.0 release tag). I have looked for it on mainnet but have not yet found one. If it happens on the shelley_qa network then it is likely to happen on mainnet.

The address we are interested in:

select id, view from stake_address
    where hash_raw = '\xe08190331207ecedfcaf448340d8bb84354a0e8c285982a2d7f98bb234' limit 5 ;
  id  |                               view
------+------------------------------------------------------------------
 2829 | stake_test1uzqeqvcjqlkwml90gjp5pk9mss655r5v9pvc9gkhlx9mydqrpmwv6
(1 row)

Looking at its registrations:

select stake_registration.*, block.block_no, block.epoch_no from stake_registration
    inner join tx on tx.id = stake_registration.tx_id
    inner join block on tx.block_id = block.id
    where stake_registration.addr_id = 2829 order by epoch_no asc ;
 id  | addr_id | cert_index | tx_id | block_no | epoch_no
-----+---------+------------+-------+----------+----------
 885 |    2829 |          0 | 27141 |  1297117 |     3816
(1 row)

and de-registrations:

select stake_deregistration.*, block.block_no, block.epoch_no from stake_deregistration
    inner join tx on tx.id = stake_deregistration.tx_id
    inner join block on tx.block_id = block.id
    where stake_deregistration.addr_id = 2829 order by epoch_no asc ;
 id  | addr_id | cert_index | tx_id | block_no | epoch_no
-----+---------+------------+-------+----------+----------
 238 |    2829 |          0 | 27144 |  1301084 |     3827
(1 row)

looking at the rewards:

select * from reward where addr_id = 2829 order by epoch_no asc;
  id   | addr_id |  type  | amount | epoch_no | pool_id
-------+---------+--------+--------+----------+---------
 68634 |    2829 | member | 121039 |     3820 |       3
 68664 |    2829 | member |  99125 |     3821 |       3
 68693 |    2829 | member | 109972 |     3822 |       3
 68722 |    2829 | member |  95739 |     3823 |       3
 68752 |    2829 | member | 109854 |     3824 |       3
 68782 |    2829 | member | 111134 |     3825 |       3
(6 rows)

and orphaned rewards:

select * from orphaned_reward where addr_id = 2829 order by epoch_no asc;
 id | addr_id | type | amount | epoch_no | pool_id
----+---------+------+--------+----------+---------
(0 rows)

The odd thing here is that although the address was only de-registered in epoch 3827 it got no rewards after epoch 3825.

We would expect to get rewards for epoch 3826 and orphaned rewards for epochs 3827 and 3829.

So how was this stake address delegated?:

select * from delegation where addr_id = 2829 ;
 id  | addr_id | cert_index | pool_hash_id | active_epoch_no | tx_id | slot_no
-----+---------+------------+--------------+-----------------+-------+----------
 285 |    2829 |          0 |            3 |            3820 | 27142 | 27476134
(1 row)

The fact that active_epoch_no here is 3820 explains why the first reward is in that epoch and not earlier.
So, did this pool get any rewards for the relevant epochs?

select epoch_no, count (addr_id) as reward_count from
    reward where pool_id = 3 and epoch_no >= 3815 and epoch_no < 3830
    group by epoch_no order by epoch_no asc ;
 epoch_no | reward_count
----------+--------------
     3815 |           21
     3816 |           21
     3817 |           21
     3818 |           21
     3819 |           21
     3820 |           22
     3821 |           22
     3822 |           22
     3823 |           22
     3824 |           22
     3825 |           22
     3826 |           21
     3827 |           21
     3828 |           21
     3829 |           21
(15 rows)

So the pool did earn rewards for epochs 3826-3829 but db-sync did not capture them for this address. Why?

There is no change between epoch 3825 and 3826, but the former got rewards and the later did not. Why?

There is only a single pool update (ie when the pool is created), so there were no pool changes:

select count (*) from pool_update where hash_id = 3 ;
 count
-------
     1
(1 row)

Is there an era change in the region of interest?

No, it switches from Shelley to Allegra somewhere between epoch 1331 and epoch 1347 and from Allegra to Mary somewhere between epoch 1978 and epoch 1992 (can;t be more exact during to sparseness of logging) The era before, during and after the problem is Mary.

As I have said elsewhere, having two vastly different implementations of the same complex logic and hoping that they produce the same results is un-reasonable and is just an incredibly bad idea. The ledger-specs library should provide the rewards already tagged to show whether they are good or orphaned.

@JaredCorduan
Copy link

I agree that having logic in two places is asking for trouble. The problem is that the ledger cannot know what rewards are orphaned until the first block of a new epoch, at which time the information is used and discarded. For the ledger to provided such tagged rewards would mean storing a large map that is not used at all in the ledger.

@erikd
Copy link
Contributor Author

erikd commented Jul 9, 2021

Re-tested this with the master branch (commit 1c2262a) and the results are identical. The node is also running from a recent commit on its master branch (commit IntersectMBO/cardano-node@7a056fd).

@erikd
Copy link
Contributor Author

erikd commented Jul 9, 2021

Adding instrumentation to this was a HUGE pain in the neck, but I managed to do it.

Instrumenting the rewards for the address we are interested in before they are split into good vs orphaned rewards:

genericRewards: epoch 3820 reward is 121039
genericRewards: epoch 3821 reward is 99125
genericRewards: epoch 3822 reward is 109972
genericRewards: epoch 3823 reward is 95739
genericRewards: epoch 3824 reward is 109854
genericRewards: epoch 3825 reward is 111134

This matches the data in the database (addr_id has changed, but that is irrelevant):

select * from reward where addr_id = 2865 order by epoch_no asc;
  id   | addr_id |  type  | amount | epoch_no | pool_id 
-------+---------+--------+--------+----------+---------
 68955 |    2865 | member | 121039 |     3820 |       3
 68983 |    2865 | member |  99125 |     3821 |       3
 69012 |    2865 | member | 109972 |     3822 |       3
 69042 |    2865 | member |  95739 |     3823 |       3
 69072 |    2865 | member | 109854 |     3824 |       3
 69102 |    2865 | member | 111134 |     3825 |       3
(6 rows)

That means any error is in ledger-specs rather than db-sync .

If we had orphaned rewards in the rewards table (or vice versa) then that would be a db-sync issue. However, the reward map db-sync extracts from ledger state has no rewards at all from epochs 3826 and later.

@erikd
Copy link
Contributor Author

erikd commented Jul 9, 2021

Ah hang on. When Cardano.Sync.Era.Shelley.Generic.Rewards.genericRewards is called it is called with an EpochNo and from that the rwdEpoch field of the Cardano.Sync.Era.Shelley.Generic.Rewards.Rewards struct is calculated using:

      Rewards
        { rwdEpoch = epoch - 1 -- Epoch in whch rewards were earned.

So rewards for epoch 3826 would be calculated in epoch 3827 and would be distributed at the start of epoch 3828. However, we know that the stake address is de-registered in epoch 3827 and if it is de-registered it should be classified as an orphaned reward.

However, this is not just a mis-classification, the reward for the address in question is missing completely. That points to ledger-specs being the problem.

@erikd
Copy link
Contributor Author

erikd commented Jul 9, 2021

The part of this that is probably most amenable to debugging is that the address is de-regestered in epoch 3827 but receives no rewards for epoch 3826 (or later) even though the stake pool it was delegating to received rewards for that epoch 3826.

Recently, ledger-specs started calculating rewards incrementally (because the number of stake addresses had grown so large that the calculation could no longer be done in one go). I wonder if, as part of that change, an address that is valid at the epoch boundary, it is now invalid when the calculation is being done. Since it is now invalid, it no longer ends up in the reward map that db-sync extracts from the ledger state.

@JaredCorduan
Copy link

I can think of a scenario consistent with what we are seeing:

If the owner of \xe08190331207ecedfcaf448340d8bb84354a0e8c285982a2d7f98bb234 transferred all their ada to "another wallet" (ie to an address with a different stake credential) sometime in epoch 324, then it would be a registered stake address for what folks are calling the "epoch 326 rewards", but it would not show up in the reward update, since zero-valued member rewards are filtered out.

We could test this by seeing how much stake this address has in the snapshot taken on the 324 / 325 boundary.

@erikd
Copy link
Contributor Author

erikd commented Jul 9, 2021

@JaredCorduan Here are the transactions for that stake address:

PGPASSFILE=cardano-db-sync/config/pgpass-shelley-qa cardano-db-sync/cardano-db-tool-exe report transactions --stake-address stake_test1uzqeqvcjqlkwml90gjp5pk9mss655r5v9pvc9gkhlx9mydqrpmwv6

Transactions for: stake_test1uzqeqvcjqlkwml90gjp5pk9mss655r5v9pvc9gkhlx9mydqrpmwv6

                             tx_hash                              |        date/time        | direction |      amount
------------------------------------------------------------------+-------------------------+-----------+----------------
 27ec87b571d993868ced286ab6005556f6e8c407848c03ea9624d855d06e2b28 | 2021-07-01 11:54:13 UTC |  Incoming |    1000.000000
 ab7c8591fc962607c40b7aafac6527c082f086c5da20c29552a929b9e6952744 | 2021-07-01 13:02:21 UTC |  Outgoing |       4.200000
 47ba267adf4afa9734f9f461a16913bbd06715a22c5139378e9f0ed6c1e762cf | 2021-07-01 17:03:34 UTC |  Outgoing |       0.200000
 fda0900031df2ff3fb40f00dfe2b1d50f2517f3a3535f7cf8e2fcb7015d3ce15 | 2021-07-02 11:23:29 UTC |  Outgoing |       0.200000
 c6d51b157281c1971126b13f6321414c0b4457728c36e00a114abda20fc750ab | 2021-07-02 11:35:39 UTC |  Incoming |       1.800000

This is at epoch 3916 so your hypothesis that the balance was transferred out of the address is incorrect.

@JaredCorduan
Copy link

ok, thank you, back to the drawing board...

@erikd
Copy link
Contributor Author

erikd commented Jul 13, 2021

On the shelley_qa network there are over 200 stake addresses which have been de-registered. Finding addresses that have been de-registered and not re-registered can be found from the following table where the reg_count is 1 and it has received a non-zero number of rewards:

select stake_registration.*, count (stake_registration.addr_id) as reg_count,
        (select count (*) as rew_count from reward where addr_id = stake_registration.addr_id)
    from stake_deregistration
    inner join stake_registration on stake_registration.addr_id = stake_deregistration.addr_id
    group by stake_registration.id
    order by reg_count, rew_count asc ; 

From that table, entries of interest are:

 id  | addr_id | cert_index | tx_id | reg_count | rew_count 
-----+---------+------------+-------+-----------+-----------
269 |    1082 |          0 | 25308 |         1 |         1
176 |     649 |          0 | 23786 |         1 |         1
903 |    2865 |          0 | 27157 |         1 |         6
160 |     609 |          1 | 23693 |         1 |         8
162 |     612 |          5 | 23693 |         1 |         8
230 |     900 |          1 | 24755 |         1 |        13
232 |     903 |          5 | 24755 |         1 |        15
166 |     620 |          5 | 23719 |         1 |        45
164 |     617 |          1 | 23719 |         1 |        54

So, picking the first of these the registration:

select stake_registration.*, block.block_no, block.epoch_no from stake_registration
    inner join tx on tx.id = stake_registration.tx_id
    inner join block on tx.block_id = block.id
    where stake_registration.addr_id = 1082 order by epoch_no asc ;
 id  | addr_id | cert_index | tx_id | block_no | epoch_no
-----+---------+------------+-------+----------+----------
 269 |    1082 |          0 | 25308 |   869838 |     2609
(1 row)

the deregistration:

select stake_deregistration.*, block.block_no, block.epoch_no from stake_deregistration
    inner join tx on tx.id = stake_deregistration.tx_id
    inner join block on tx.block_id = block.id
    where stake_deregistration.addr_id = 1082 order by epoch_no asc ;
 id | addr_id | cert_index | tx_id | block_no | epoch_no
----+---------+------------+-------+----------+----------
 80 |    1082 |          0 | 25539 |   871267 |     2613
(1 row)

the delegation:

select * from delegation where addr_id = 1082 ;
 id  | addr_id | cert_index | pool_hash_id | active_epoch_no | tx_id | slot_no
-----+---------+------------+--------------+-----------------+-------+----------
 204 |    1082 |          1 |          152 |            2611 | 25308 | 18771236
(1 row)

the rewards:

select * from reward where addr_id = 1082 ;
  id   | addr_id |  type  | amount | epoch_no | pool_id
-------+---------+--------+--------+----------+---------
 36173 |    1082 | member |   9219 |     2611 |     152
(1 row)

absence of orphaned rewards:

select * from orphaned_reward where addr_id = 1082 order by epoch_no asc;
 id | addr_id | type | amount | epoch_no | pool_id
----+---------+------+--------+----------+---------
(0 rows)

but the pool did receive rewards for all the epochs of interest:

select epoch_no, count (addr_id) as reward_count from
    reward where pool_id = 152 and epoch_no >= 2608 and epoch_no < 2615
    group by epoch_no order by epoch_no asc ;
 epoch_no | reward_count 
----------+--------------
     2609 |            9
     2610 |           10
     2611 |           13
     2612 |           12
     2613 |           12
     2614 |           12
(6 rows)

Address add_id = 649 has the same issue. I did not check further on shelley_qa .

@JaredCorduan
Copy link

I know exactly what is going on now. I wrote a custom version of the ledger-state client to view the history of 8190331207ecedfcaf448340d8bb84354a0e8c285982a2d7f98bb234.

We've run into one of the known warts described in the formal spec in 17 Errata:

17.2 Active stake registrations and the Reward Calculation

The reward calculation takes the set of active reward accounts as a parameter (the forth parameter
of reward in Figure 48). It is only used at the end of the calculation for filtering out unregistered
accounts. Therefore, if a stake credential is deregistered just before the reward calculation starts,
it cannot reregister before the end of the epoch in order to get the rewards. The time within the
epoch when the reward calculation starts should not make any difference. Note that this does
not affect the earnings that stake pools make from their margin, since each pool's total stake is
summed before filtering. The solution is to not filter by the current active reward accounts, since
rewards for unregistered accounts go to the treasury already anyway.

In particular, the credential 8190331207ecedfcaf448340d8bb84354a0e8c285982a2d7f98bb234 was de-registered at slot 27542859. The reward calculation that started in epoch 3827 (which we've been calling "the rewards for epoch 326") began at slot 27543606 (747 slots after the de-registration). So this credential was filtered out of the reward update after its stake was accounted for, as described in the paragraph above. Which means, as I understand it, that db-sync would not be able to know there was an orphan, since it defines orphans as "things in the reward update that are not in the reward mapping at the epoch boundary" (which is a totally reasonable definition).

This logic has been around since the start of Shelley, there should be many instances of this on mainnet as well.

@JaredCorduan
Copy link

I guess my question now is, do we need to track orphans?

@erikd
Copy link
Contributor Author

erikd commented Jul 28, 2021

i am still detecting the presence of what should be orphaned rewards when using ledger-specs at commit bc5e54f0611a416db1e906d3783e4b973429f5eb .

@JaredCorduan
Copy link

yep, the ledger will continue to produce them

@dmitrystas
Copy link

I guess my question now is, do we need to track orphans?

in case we want to accurately calculate the pool's ROS - yes, we need :)

@JaredCorduan
Copy link

The easier way to compute the pool's ROS is to look at the pool provenance. In particular, poorRP:

https://github.com/input-output-hk/cardano-ledger-specs/blob/4933c8c32ce3adca7d048da37204f678e8042fee/shelley/chain-and-ledger/executable-spec/src/Shelley/Spec/Ledger/RewardProvenance.hs#L73-L74

The provenance is exposed by the node in one of the mini-protocols.

@dmitrystas
Copy link

The easier way to compute the pool's ROS is to look at the pool provenance. In particular, poorRP

good stuff, but is this value available for each epoch? we need to know this value for each epoch for pool's ROS calculation.

@erikd
Copy link
Contributor Author

erikd commented Aug 1, 2021

I am at least 90% certain that the orphaned rewards being retrieved by db-sync has changed recently. Specifically, if one runs master and an earlier version (either 8.0.0 or 9.0.0) on mainnet and compares the orphaned_reward table entries for a given epoch they will be different (ie this is at least partly what this bug is about).

If the orphaned_reward table is being used to calculate RoS then both master and 9.0.0 cannot be correct.

@erikd
Copy link
Contributor Author

erikd commented Aug 1, 2021

Probably the best solution is for ledger-specs to provide some way to extract a per epoch RoS for each pool. That way db-sync can provide the actual RoS values that the ledger uses to calculate rewards instead of users trying to calculate it themselves using data which is known to be incorrect/incomplete.

@dmitrystas
Copy link

or at least provide the sum of rewards distributed by the pool (including orphan rewards) for each epoch

@JaredCorduan
Copy link

good stuff, but is this value available for each epoch? we need to know this value for each epoch for pool's ROS calculation.

The node exposes a query that can be used to get it:

https://github.com/input-output-hk/ouroboros-network/blob/e9cda57df7ea6969edbc3bfc4e117668277d09c8/ouroboros-consensus-shelley/src/Ouroboros/Consensus/Shelley/Ledger/Query.hs#L162-L163

@erikd
Copy link
Contributor Author

erikd commented Aug 1, 2021

@JaredCorduan Users should not need to query the node to get data that should be in the database. I assume that is in ledger state somewhere. Please tell how and when I can access it.

@JaredCorduan
Copy link

it is not in the ledger state, but you can query the node for it

@redxaxder
Copy link

@dmitrystas

You almost certainly don't want to include the existing notion of orphaned rewards in any kind of calculation. The concept depends on incidental events.

In the reward scheme, the rewards calculated for epoch n are calculated using a snapshot of the stake distribution from the end of epoch n-2. These rewards are available to withdraw from the end of epoch n+1.

Rewards attributed to stake keys which are not registered go to the treasury. There are two points in time when this is checked. Once at the start of the reward calculation, and once at the end of epoch n+1.

What I suspect you want is the set of all rewards for an epoch that were lost due to deregistrations.

But this is not what the "orphaned rewards" give you! They only include the ones which were registered at the start of the reward calculation but not and the end of the epoch. Since the reward calculation starts partway into epoch n+1, I wouldn't expect this to be a reliable source of information.

The easier way to compute the pool's ROS is to look at the pool provenance

The approach Jared mentioned above will work to find the whole set.

@dmitrystas
Copy link

dmitrystas commented Aug 3, 2021

There are two points in time when this is checked. Once at the start of the reward calculation, and once at the end of epoch n+1.

Thanks for the info @redxaxder , I didn't know that

But this is not what the "orphaned rewards" give you! They only include the ones which were registered at the start of the reward calculation but not and the end of the epoch.

I see, thanks for the explanations

@erikd
Copy link
Contributor Author

erikd commented Aug 3, 2021

@redxaxder db-sync should provide a table providing something like ReturnOnStake. Users should not have to query the node for this information. Similarly, db-sync should not be querying the node for this information. If the node is currently the only place this is available, then that is a design flaw in the system and we need to fix it..

@erikd erikd closed this as completed in 92719ea Sep 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants