Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator crashes in 2.2.2 #676

Closed
jmcnevin opened this issue Apr 4, 2019 · 21 comments
Closed

Validator crashes in 2.2.2 #676

jmcnevin opened this issue Apr 4, 2019 · 21 comments
Assignees

Comments

@jmcnevin
Copy link

jmcnevin commented Apr 4, 2019

System information

Geth version: 2.2.2

OS & Version: Linux

Expected behaviour

Validators should not crash.

Actual behaviour

With IBFT, under load, we are able to get validators to crash with errors such as fatal error: concurrent map writes.

Steps to reproduce the behaviour

We have a four-node validator set with validators running the following options:

--datadir /data/geth --emitcheckpoints --nat none --networkid 19827658 --permissioned --port 30303 --syncmode full --gcmode archive --nousb  --metrics --mine --miner.gasprice 0 --miner.gaslimit 700000000

Backtrace

dump.txt

@jbhurat
Copy link
Contributor

jbhurat commented Apr 4, 2019

Hi @jmcnevin, can you please paste your genesis.json

@jmcnevin
Copy link
Author

jmcnevin commented Apr 4, 2019

{
  "config": {
    "chainId": 19827658,
    "homesteadBlock": 1,
    "eip150Block": 2,
    "eip150Hash": "0x0000000000000000000000000000000000000000000000000000000000000000",
    "eip155Block": 3,
    "eip158Block": 3,
    "byzantiumBlock": 4,
    "istanbul": {
      "epoch": 30000,
      "policy": 0
    },
    "isQuorum": true,
    "txnSizeLimit": 64
  },
  "nonce": "0x0",
  "timestamp": "0x5a7fd1b0",
  "extraData": "0x0000000000000000000000000000000000000000000000000000000000000000f89af8549408e9e9cd8ca072b435832d00d3a266f6cf7444ac941daf95f9c9129c44c888345ae0ee4bde9152b3fa9446f4512d6cddb8c776046103b1022d5b33058ebd94fdbd79619bee02829e53b0331062ece15367676fb8410000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000c0",
  "gasLimit": "0x47b760",
  "difficulty": "0x1",
  "mixHash": "0x63746963616c2062797a616e74696e65206661756c7420746f6c6572616e6365",
  "coinbase": "0x0000000000000000000000000000000000000000",
  "alloc": {
    "08e9e9cd8ca072b435832d00d3a266f6cf7444ac": {
      "balance": "0x446c3b15f9926687d2c40534fdb564000000000000"
    },
    "1daf95f9c9129c44c888345ae0ee4bde9152b3fa": {
      "balance": "0x446c3b15f9926687d2c40534fdb564000000000000"
    },
    "46f4512d6cddb8c776046103b1022d5b33058ebd": {
      "balance": "0x446c3b15f9926687d2c40534fdb564000000000000"
    },
    "fdbd79619bee02829e53b0331062ece15367676f": {
      "balance": "0x446c3b15f9926687d2c40534fdb564000000000000"
    }
  },
  "number": "0x0",
  "gasUsed": "0x0",
  "parentHash": "0x0000000000000000000000000000000000000000000000000000000000000000"
}

@jbhurat
Copy link
Contributor

jbhurat commented Apr 4, 2019

Thanks @jmcnevin. What is the TPS when you are seeing the issue. Was this network also upgraded from 2.2.1 to 2.2.2

@jmcnevin
Copy link
Author

jmcnevin commented Apr 4, 2019

Yes this was a network upgraded from 2.2.1.

Wasn't able to clock TPS, it happens so quickly, but I did manage to do a screen capture of an incident.

weird1.mp4.zip

@jbhurat
Copy link
Contributor

jbhurat commented Apr 4, 2019

I have been trying to reproduce this, by upgrading the network from v2.2.1 to v2.2.2, but so far no luck even with sustained 500 TPS. The only thing I had to change was increase the gasLimit. Can you increase the gasLimit and try again and see if you get better luck

@thpun
Copy link

thpun commented Apr 6, 2019

I faced the same error in v2.2.3. It's a 4-node Istanbul network init-ed in v2.2.3. Doing 1TPS only.

One of the 4 nodes returned fatal error: concurrent map read and map write, and another one returned fatal error: concurrent map writes.

Genesis as follows:

{
    "config": {
        "chainId": 2017,
        "homesteadBlock": 0,
        "byzantiumBlock": 0,
        "eip150Block": 0,
        "eip150Hash": "0x0000000000000000000000000000000000000000000000000000000000000000",
        "eip155Block": 0,
        "eip158Block": 0,
        "istanbul": {
            "epoch": 30000,
            "policy": 0
        },
        "isQuorum": true
    },
    "nonce": "0x0",
    "timestamp": "0x5ca70e02",
    "extraData": "0x0000000000000000000000000000000000000000000000000000000000000000f89af854948e8a891401c591089ae6af3adcad81851580a47794cc6b795beb02477975c58615924010881bc4ab229403848249a94f393867a1c2243c791a020b7176d79471b8ea022c3eba13cdc4b3182d7c65d1e59d8849b8410000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000c0",
    "gasLimit": "0x2faf0800",
    "difficulty": "0x1",
    "mixHash": "0x63746963616c2062797a616e74696e65206661756c7420746f6c6572616e6365",
    "coinbase": "0x0000000000000000000000000000000000000000",
    "alloc": {
        "03848249a94f393867a1c2243c791a020b7176d7": {
            "balance": "0x446c3b15f9926687d2c40534fdb564000000000000"
        },
        "71b8ea022c3eba13cdc4b3182d7c65d1e59d8849": {
            "balance": "0x446c3b15f9926687d2c40534fdb564000000000000"
        },
        "8e8a891401c591089ae6af3adcad81851580a477": {
            "balance": "0x446c3b15f9926687d2c40534fdb564000000000000"
        },
        "cc6b795beb02477975c58615924010881bc4ab22": {
            "balance": "0x446c3b15f9926687d2c40534fdb564000000000000"
        }
    },
    "number": "0x0",
    "gasUsed": "0x0",
    "parentHash": "0x0000000000000000000000000000000000000000000000000000000000000000"
}

Command to start geth (Run miner.start() later in console):

geth --datadir qdata --nodiscover --verbosity 5 --unlock 0 --password password.txt --gasprice 0 --rpc --rpccorsdomain '*' --rpcaddr 0.0.0.0 --rpcport 21000 --rpcapi admin,db,eth,debug,net,shh,txpool,personal,web3,quorum,istanbul --port 30303 --networkid 2017

concurrent-map-read-and-map-write.txt
concurrent-map-writes.txt

@jbhurat jbhurat self-assigned this Apr 8, 2019
@jbhurat
Copy link
Contributor

jbhurat commented Apr 8, 2019

Hi @thpun, looks like it could be some race condition. I am trying to reproduce it by mimicking my setup similar to yours and submitting transactions at 1 tps, but so far no luck. Can you provide more details like what kind and how you are submitting the transactions. Also, how is the network setup. Are all the nodes running on different hosts?

@thpun
Copy link

thpun commented Apr 9, 2019

Actually I was trying the same thing in issue #669 with a clean network set up with v2.2.3.
That is sending (public) transactions with size almost 64KB at 1tps programatically.
Each node in that 4-node network is on different hosts.

@jmcnevin
Copy link
Author

jmcnevin commented Apr 9, 2019

I should note, too, that the transactions we are testing with range from 800K-1mil gas.

@jbhurat
Copy link
Contributor

jbhurat commented Apr 9, 2019

Thanks for the updates @jmcnevin and @thpun, I will try to reproduce the issue with these

@jbhurat
Copy link
Contributor

jbhurat commented Apr 10, 2019

Hi @jmcnevin and @thpun, I have been trying to reproduce the above issue, but have not been able to so far. These are below scenarios I have tried so far;

  • Straight up Quorum v2.2.3
  • Start with Quorum v2.2.1 and then update the validators to v2.2.3

In the both cases, I sent transactions each of size 64k and with 1, 10 and 20 TPS and it did not see any issues; below is the sample block when running 10 TPS

eth.getBlock(1269)
{
  difficulty: 1,
  extraData: "0xd983010812846765746888676f312e31312e348664617277696e000000000000f90164f854946571d97f340c8495b661a823f2c2145ca47d63c294d8dba507e85f116b1f7e231ca8525fc9008a696694e36cbeb565b061217930767886474e3cde903ac594f512a992f3fb749857d758ffda1330e590fa915eb8417f48e70ec287c1a50fc6f71ca5e0705a6edf8484c64dccabca9d50a2c06ee22d6f3c8e40ecbe850b3700ecef66ab1a6ee7b9721de426396772fb7811a0e3855100f8c9b841fad392c6800fb89827acaba4d45130b7a10ef7fa8df3d915600e504ddfdbfd825ad5da40611ef0178f45c0de6be7064b8a67a4dd130380dbcde9cc90a3a8643401b8411ee23c968f05639479354c22539eae833f3751dd539ea1ea27c5fdd042fa7ad608dc1d5e67f4f5affadcaf873d74d34790cc242d79fc09c4917c288ba15e147301b841d7965609eec05782c42c3e4c5593f1cf471e6641682107939282aafe552de0ee371d9fa9fd09bb34db1792beb5614b894c7cc632b38ff0f9692724630310fc6c01",
  gasLimit: 150247741193,
  gasUsed: 148283060,
  hash: "0x60e2afcd313439c3f758686eb66529d42d50d1aca4021b822668c9a7cfd5525f",
  logsBloom: "0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
  miner: "0x0000000000000000000000000000000000000000",
  mixHash: "0x63746963616c2062797a616e74696e65206661756c7420746f6c6572616e6365",
  nonce: "0x0000000000000000",
  number: 1269,
  parentHash: "0x298fdc5b04b744159d31303d0e27c3ae07655975d2727e9e54376f316ab43a42",
  receiptsRoot: "0xda5ea18001b1296be35aeff38d9af4a73d7c26c7633c14c1f03c300528140854",
  sha3Uncles: "0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347",
  size: 655463,
  stateRoot: "0x66b75d4b8d5f1fca2778c8e9b6b03ec4d0842c7ed152b7bf7af0b6c52a327623",
  timestamp: 1554917796,
  totalDifficulty: 1270,
  transactions: ["0xcb1bc31e1250b2e45bfc8626b8e67f3cd9a205b000f9f9041db7a84a2283b66c", "0x7a510ea8d612af3f2e39160a65c79aada4d342ffc290f49ca9f099cbc8278513", "0x991d5cf8e7f7d7108f4f3e03dabb9dffba659e5b7aa369f100523846c6158b7f", "0xf46591189f95c06d6e81f7272e274003e29bf852d3b081e8ac95a76460717bb1", "0x6699b6b7a5c63acbb6e7c14a7d65d7f756c367aba906dc160a3cea96348f82d6", "0xa48099904f7f413d00b9b5594772ef943073aa90bfd9b10d31976f62abe8c03f", "0x4c37dab34f70ff90d70782cebcb91f413d2e911c8fbe9f01078a940fab507687", "0xa1968790c09f87f4f6837b28484d8b87aa106b821778657bb6046438063369c3", "0xa97f31af0bee836f23a18d54586b3e675d0fab5feee8d29b0cf8d37fd0138cab", "0x0aa98f6d46247cb2006d14af557c9d0d9b87f800aa8cdd43d30c01053e46df94"],
  transactionsRoot: "0x1b088ccf8f31abd8b94990e11beca013812e7fdf757c18aef649961fbc4c3698",
  uncles: []
}
> eth.getBlockTransactionCount(1269)
10

As can be seen from above that avg transaction size is 64k and gas is around 1.4 mil. Block period is 1 sec, which is the default when it is not specified as geth startup argument.

I am using the below contract

pragma solidity ^0.5.0;

contract SimpleStorage {
    string storedData;

    function set(string memory x) public {
        storedData = x;
    }

    function get() public view returns (string memory) {
        return storedData;
    }
}

Can you reproduce this issue at will, and if yes, is it possible to give a reproducible test case.

@thpun
Copy link

thpun commented Apr 12, 2019

Reproducible for me.

I'm using the following contract:

pragma solidity ^0.4.0;

contract SimpleKeyStorage {
    mapping(bytes32 => string) public map;

    function set(bytes32 x, string str) public {
        map[x] = str;
    }
}

Perhaps I used a mapping to store the value. Gas used for each transaction is 45269026.

And keep firing transaction at 1 TPS: set(x, str), where x is a randomly generated 32-byte array and str is a fixed 65024-byte string (almost 64KB).

And 2 of the nodes got fatal error: concurrent map read and map write after running for some time.

I'm testing in a 4-node network with v2.2.3. Each node is in different host. Each host has 4 vCPUs and 16GB memory.

@jbhurat
Copy link
Contributor

jbhurat commented Apr 12, 2019

@thpun, you are right, this could be because of the mapping to store the values. I am able to reproduce, once I added the mapping

@jbhurat
Copy link
Contributor

jbhurat commented Apr 19, 2019

Hi @jmcnevin, @thpun, can you please verify if the #684 resolves the concurrent map access issue you are seeing

@jmcnevin
Copy link
Author

I will try testing this branch today

@jmcnevin
Copy link
Author

jmcnevin commented Apr 19, 2019

I did have one of my validators crash with this code, but it's a different panic:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0xb16b5d]
goroutine 308 [running]:
github.com/ethereum/go-ethereum/consensus/istanbul/core.(*core).IsCurrentProposal(0xc0000f0ea0, 0x924a0bde40f31516, 0xf2b760b3f26e7e36, 0x901b6b11e4662e9b, 0x32f8be184fabe05c, 0x9221097bf0)
#011/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/consensus/istanbul/core/core.go:169 +0x2d
github.com/ethereum/go-ethereum/consensus/istanbul/backend.(*backend).HandleMsg(0xc0000f0d80, 0x614b6c42517821df, 0xc3b73bc81f898ef6, 0xc056ef4236, 0x7, 0x38e, 0x130e220, 0xc02f9a63f0, 0xbf269b8046624e13, 0x9221097bf0, ...)
#011/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/consensus/istanbul/backend/handler.go:122 +0xaa3
github.com/ethereum/go-ethereum/eth.(*ProtocolManager).handleMsg(0xc018b39110, 0xc000290300, 0x0, 0x0)
#011/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/eth/handler.go:365+0x429f
github.com/ethereum/go-ethereum/eth.(*ProtocolManager).handle(0xc018b39110, 0xc000290300, 0x0, 0x0)
#011/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/eth/handler.go:333+0x7eb
github.com/ethereum/go-ethereum/eth.NewProtocolManager.func1(0xc019b28900, 0x1312f60, 0xc0002cc090, 0x0, 0x0)
#011/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/eth/handler.go:157+0x1b8
github.com/ethereum/go-ethereum/p2p.(*Peer).startProtocols.func1(0xc0002cc090, 0xc019b28900, 0x1312f60, 0xc0002cc090)
#011/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/p2p/peer.go:361 +0x66
created by github.com/ethereum/go-ethereum/p2p.(*Peer).startProtocols
#011/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/p2p/peer.go:360 +0x1fb

@thpun
Copy link

thpun commented Apr 22, 2019

Same as @jmcnevin. Got the same panic in one of the validators:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0xb5fb0d]

goroutine 24701 [running]:
github.com/ethereum/go-ethereum/consensus/istanbul/core.(*core).IsCurrentProposal(0xc0002386c0, 0xd9b3fd2bb2a755a6, 0x857cf92e165d4b, 0x6fe3274b652ac68b, 0x8e47472d75c55a8b, 0xad7332b633)
        /home/thpun/quorum/build/_workspace/src/github.com/ethereum/go-ethereum/consensus/istanbul/core/core.go:169 +0x2d
github.com/ethereum/go-ethereum/consensus/istanbul/backend.(*backend).HandleMsg(0xc0002385a0, 0x891c50114898a8e, 0x8581addc3aafe69a, 0xc077a48015, 0x7, 0x38e, 0x135ff60, 0xc029aa8420, 0xbf277fd7dcf98103, 0xad7332b633, ...)
        /home/thpun/quorum/build/_workspace/src/github.com/ethereum/go-ethereum/consensus/istanbul/backend/handler.go:122 +0xade
github.com/ethereum/go-ethereum/eth.(*ProtocolManager).handleMsg(0xc0002a32b0, 0xc00015aa80, 0x0, 0x0)
        /home/thpun/quorum/build/_workspace/src/github.com/ethereum/go-ethereum/eth/handler.go:365 +0x4bd1
github.com/ethereum/go-ethereum/eth.(*ProtocolManager).handle(0xc0002a32b0, 0xc00015aa80, 0x0, 0x0)
        /home/thpun/quorum/build/_workspace/src/github.com/ethereum/go-ethereum/eth/handler.go:333 +0x7ee
github.com/ethereum/go-ethereum/eth.NewProtocolManager.func1(0xc03828c0c0, 0x1373420, 0xc00123f050, 0x0, 0x0)
        /home/thpun/quorum/build/_workspace/src/github.com/ethereum/go-ethereum/eth/handler.go:157 +0x1c7
github.com/ethereum/go-ethereum/p2p.(*Peer).startProtocols.func1(0xc00123f050, 0xc03828c0c0, 0x1373420, 0xc00123f050)
        /home/thpun/quorum/build/_workspace/src/github.com/ethereum/go-ethereum/p2p/peer.go:361 +0x66
created by github.com/ethereum/go-ethereum/p2p.(*Peer).startProtocols
        /home/thpun/quorum/build/_workspace/src/github.com/ethereum/go-ethereum/p2p/peer.go:360 +0x203

@jbhurat
Copy link
Contributor

jbhurat commented Apr 23, 2019

Thanks @jmcnevin and @thpun for testing the branch. I have investigated what is happening and pushed another commit to the PR branch. Is it possible to give it another try.

@jmcnevin
Copy link
Author

Looking much more stable on my end. 🤞

@thpun
Copy link

thpun commented Apr 29, 2019

Didnt see the same error so far.

@jbhurat
Copy link
Contributor

jbhurat commented May 20, 2019

Fixed in master

@jbhurat jbhurat closed this as completed May 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants