Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgraded from vault 0.2.0 to 0.3.1 and now it fails to read mount table #708

Closed
msabramo opened this issue Oct 20, 2015 · 23 comments
Closed

Comments

@msabramo
Copy link

I upgraded from vault 0.2.0 to 0.3.1. After unsealing the vault, I get a zillion of these messages in the log:

015/10/20 15:55:20 [INFO] core: post-unseal setup starting
2015/10/20 15:55:20 [ERR] core: failed to read mount table: decryption failed: version bytes mis-match
2015/10/20 15:55:20 [ERR] core: post-unseal setup failed: failed to setup mount table
2015/10/20 15:55:20 [INFO] core: acquired lock, enabling active operation
2015/10/20 15:55:20 [INFO] core: post-unseal setup starting
2015/10/20 15:55:20 [ERR] core: failed to read mount table: decryption failed: version bytes mis-match
2015/10/20 15:55:20 [ERR] core: post-unseal setup failed: failed to setup mount table
2015/10/20 15:55:20 [INFO] core: acquired lock, enabling active operation
2015/10/20 15:55:20 [INFO] core: post-unseal setup starting
2015/10/20 15:55:20 [ERR] core: failed to read mount table: decryption failed: version bytes mis-match
2015/10/20 15:55:20 [ERR] core: post-unseal setup failed: failed to setup mount table
2015/10/20 15:55:20 [INFO] core: acquired lock, enabling active operation

I assume that I heave to clear out information from the old version...? How do I do this?

@jefferai
Copy link
Member

No, this upgrade should have happened transparently...there's no manual change that should be necessary. Definitely a bug of some sort!

I think the key to figuring this out will be understanding why you're hitting this bug but nobody else has reported a problem.

Some questions:

  • Are you running in HA mode? If so, did you upgrade all nodes?
  • Was this install upgraded from 0.1.x to 0.2 and then to 0.3.1, or was it first installed at 0.2?

@msabramo
Copy link
Author

Thanks for responding!

Yes, we am running in HA mode. We run Vault on each of our 3 Consul servers and Consul is used as the backend.

CORP\marca@consul01:~$ vault version
Vault v0.3.1

CORP\marca@consul01:~$ consul version
Consul v0.5.2
Consul Protocol: 2 (Understands back to: 1)

CORP\marca@consul01:~$ head -n 5 /etc/vault.d/vault.hcl
backend "consul" {
  address = "127.0.0.1:8500"
  advertise_addr = "https://consul01.mycompany.com:8200"
  path = "vault"
}

I did upgrade all of the nodes at once, because I ran an ansible playbook that upgraded them - actually in retrospect, I probably shouldn't have done that, because now Vault is totally down for us. 😦

I don't think we ever had Vault 0.1.x. If I'm not mistaken, we started with 0.2 and this was our first upgrade.

@jefferai
Copy link
Member

Unfortunately I'm not sure how to go about debugging this without some custom debugging code. We have unit tests that mimic an upgrade from 0.2 to 0.3, including reading and writing values to a generic backend, so this isn't an untested path, and as a result I'm not sure what the issue could be without getting more real data.

The most efficient way to do this would be if we could screen share and do some real-time debugging, but that would require a terminal with access to your systems, the ability to build Vault, and enough people around that can unseal Vault (or alternately, giving you the unseal keys and performing a rotation once we're done debugging). Is there any chance we could set this up?

Alternately I could upload some modified binaries containing debugging, and we could bounce back and forth a few times that way if need be. Even in that situation, if we can open up some sort of IM or real-time communication channel, it would help so that we can coordinate.

Let me know...thanks!

@jefferai
Copy link
Member

In each entry stored in a physical backend, a specific byte is set to indicate the version of the encryption/decryption function. There has been a byte set for this purpose since the first public release of Vault; in 0.3.1 this version was incremented and logic was included to handle decryption for both the old and new versions, but encryption will only happen with the new version.

The most likely cause of you seeing this would be that the mount table was rewritten with the new version of the encryption logic, and it's trying to be read by an old version server, which would not know how to handle decrypting it.

So I need to ask -- can you verify that all Vault nodes successfully upgraded to 0.3 and were restarted? Running vault version will give you the version of the Vault binary, but not the version of the running server.

@msabramo
Copy link
Author

Ah ok, sorry for not being responsive..let me check that out; unfortunately our ISP is down today, but I can check on the servers when it comes back up.

@msabramo
Copy link
Author

msabramo commented Nov 6, 2015

OK I went and and restarted vault on all of the servers. As mentioned previously, there are 3 servers running vault and they are the same servers that are running Consul, which still works.

Now I get:

$ vault read secret/foo/bar/dog
Error reading secret/foo/bar/dog: Error making API request.

URL: GET https://consul01.XXX:8200/v1/secret/foo/bar/dog
Code: 503. Errors:

* no active Vault instance found

and in the vault.log on the server we have a bunch of this:

CORP\marca@consul01:~$ tail -f /var/log/vault.log
2015/11/06 07:34:09 http: TLS handshake error from 10.128.40.8:44997: EOF
2015/11/06 07:34:11 http: TLS handshake error from 10.128.40.9:47551: EOF
2015/11/06 07:34:14 http: TLS handshake error from 10.128.40.8:45444: EOF
2015/11/06 07:34:16 http: TLS handshake error from 10.128.40.9:47985: EOF
2015/11/06 07:34:19 http: TLS handshake error from 10.128.40.8:45892: EOF
2015/11/06 07:34:21 http: TLS handshake error from 10.128.40.9:48437: EOF
2015/11/06 07:34:24 http: TLS handshake error from 10.128.40.8:46344: EOF
2015/11/06 07:34:26 http: TLS handshake error from 10.128.40.9:48867: EOF
2015/11/06 07:34:29 http: TLS handshake error from 10.128.40.8:46767: EOF
2015/11/06 07:34:31 http: TLS handshake error from 10.128.40.9:49318: EOF

If we can't figure this out, then we can think about setting up a screen share. Do you have a screen-sharing tool that's worked well for you in the past? I think a lot of folks here have used join.me pretty successfully. Since you're HashiCorp, maybe you prefer using Vagrant Share?

@jefferai
Copy link
Member

jefferai commented Nov 6, 2015

Can you send all of the Vault log? The first thing that needs to be sorted is why there is no active vault instance found. When an instance becomes active it might respond more normally to client requests.

Stupid question, maybe, but did you unseal Vault on each server after restarting it?

@msabramo
Copy link
Author

msabramo commented Nov 6, 2015

Oh hmmmm. I think I only unsealed Vault on the one server. That gives me something else to try, thanks! And if that doesn't get stuff going, then should we continue this over email so I can send you the logs and such?

@jefferai
Copy link
Member

jefferai commented Nov 6, 2015

If that doesn't work, gist the logs somewhere and link them. If there's nothing obvious in there, we can figure out some real-time debugging method.

@msabramo
Copy link
Author

msabramo commented Nov 6, 2015

It looks like I can eliminate the TLS handshake error by changing the config as follows:

listener "tcp" {
 address = "0.0.0.0:8200"
}

to

listener "tcp" {
 address = "consul01.company.com:8200"
}

I wonder if the former worked in older versions of Vault but doesn't work anymore?

@msabramo
Copy link
Author

msabramo commented Nov 6, 2015

Even after making the config change I just mentioned above, I still get an error trying to access vault:

CORP\marca@consul01:~$ vault read secret/devops/teamcity/email
Error reading secret/devops/teamcity/email: Error making API request.

URL: GET https://consul01.company.com:8200/v1/secret/devops/teamcity/email
Code: 503. Errors:

* no active Vault instance found

and the log files look pretty clean:

CORP\marca@consul01:~$ tail -f /var/log/vault.log
         Log Level: info
             Mlock: supported: true, enabled: false
           Backend: consul (HA available)
 Advertise Address: https://consul01.company.com:8200
        Listener 1: tcp (addr: "consul01.company:8200", tls: "enabled")

==> Vault server started! Log data will stream in below:

2015/11/06 13:46:29 [INFO] core: vault is unsealed
2015/11/06 13:46:29 [INFO] core: entering standby mode

@jefferai
Copy link
Member

jefferai commented Nov 6, 2015

Do you only have one instance of Vault running? If there's only one instance, and it's failing to become leader, this suggests that somehow a lock has gotten stale in Consul.

@msabramo
Copy link
Author

msabramo commented Nov 6, 2015

So I discovered that one of the 3 Consul servers (consul03) had its disk utilization go to 100% and wasn't working right.

I stopped Consul, fixed the disk issue (largely caused by a multi-gigabyte /var/log/consul file) and started Consul again and got Vault working again. That server then had this encouraging output in /var/log/consul:

CORP\marca@consul03:/var/log$ tail -f consul
    2015/11/06 14:32:36 [INFO] raft: Node at 10.128.40.149:8300 [Candidate] entering Candidate state
    2015/11/06 14:32:36 [ERR] raft: Failed to make RequestVote RPC to 10.128.40.126:8300: dial tcp 10.128.40.126:8300: connection refused
    2015/11/06 14:32:36 [INFO] raft: Node at 10.128.40.149:8300 [Follower] entering Follower state
    2015/11/06 14:32:37 [INFO] consul: New leader elected: consul02.corp.surveymonkey.com
    2015/11/06 14:34:15 [INFO] consul: New leader elected: consul02.corp.surveymonkey.com
    2015/11/06 14:34:15 [INFO] consul: New leader elected: consul02.corp.surveymonkey.com
    2015/11/06 14:34:15 [INFO] consul: New leader elected: consul02.corp.surveymonkey.com
    2015/11/06 14:34:15 [INFO] consul: New leader elected: consul02.corp.surveymonkey.com
    2015/11/06 14:34:15 [INFO] consul: New leader elected: consul02.corp.surveymonkey.com
    2015/11/06 14:34:15 [INFO] consul: New leader elected: consul02.corp.surveymonkey.com

I had high hopes that everything was going to be good, but now the vault logs on all 3 servers are spewing this over and over into /var/log/vault.log:

2015/11/06 14:35:01 [INFO] core: acquired lock, enabling active operation
2015/11/06 14:35:01 [INFO] core: post-unseal setup starting
2015/11/06 14:35:01 [ERR] core: failed to read mount table: decryption failed: version bytes mis-match
2015/11/06 14:35:01 [ERR] core: post-unseal setup failed: failed to setup mount table
2015/11/06 14:35:02 [INFO] core: acquired lock, enabling active operation

@jefferai
Copy link
Member

jefferai commented Nov 6, 2015

At least we're now back to the original problem :-) Did you check the Vault version on each node?

@msabramo
Copy link
Author

msabramo commented Nov 6, 2015

and also seeing occasional leadership badness things:

CORP\marca@consul01:~$ tail -f /var/log/vault.log
2015/11/06 14:40:51 [INFO] core: acquired lock, enabling active operation
2015/11/06 14:40:52 [ERR] core: leader advertisement setup failed: Unexpected response code: 500 (rpc error: node is not the leader)
2015/11/06 14:40:52 [ERR] core: failed to acquire lock: failed to create session: Unexpected response code: 500 (rpc error: No cluster leader)
2015/11/06 14:41:42 [ERR] core: failed to acquire lock: failed to read lock: Unexpected response code: 500
2015/11/06 14:42:14 [ERR] core: failed to acquire lock: failed to acquire lock: Unexpected response code: 500 (rpc error: rpc error: leadership lost while committing log)
CORP\marca@consul02:~$ tail -f /var/log/vault.log
2015/11/06 14:43:10 [INFO] core: acquired lock, enabling active operation
2015/11/06 14:43:11 [INFO] core: post-unseal setup starting
2015/11/06 14:43:11 [ERR] core: failed to read mount table: decryption failed: version bytes mis-match
2015/11/06 14:43:11 [ERR] core: post-unseal setup failed: failed to setup mount table
2015/11/06 14:43:12 [ERR] core: failed to acquire lock: failed to create session: Unexpected response code: 500 (No cluster leader)
2015/11/06 14:43:22 [ERR] core: failed to acquire lock: failed to create session: Unexpected response code: 500 (No cluster leader)
2015/11/06 14:43:27 [ERR] core: upgrade due to key rotation failed: Unexpected response code: 500
2015/11/06 14:43:57 [ERR] core: upgrade due to key rotation failed: Unexpected response code: 500
CORP\marca@consul03:/var/log$ tail -f vault.log
2015/11/06 14:43:16 [INFO] core: post-unseal setup starting
2015/11/06 14:43:16 [ERR] core: failed to read mount table: decryption failed: version bytes mis-match
2015/11/06 14:43:16 [ERR] core: post-unseal setup failed: failed to setup mount table
2015/11/06 14:43:18 [INFO] core: acquired lock, enabling active operation
2015/11/06 14:43:18 [INFO] core: post-unseal setup starting
2015/11/06 14:43:18 [ERR] core: failed to read mount table: decryption failed: version bytes mis-match
2015/11/06 14:43:18 [ERR] core: post-unseal setup failed: failed to setup mount table
2015/11/06 14:43:19 [ERR] core: failed to acquire lock: failed to create session: Unexpected response code: 500 (rpc error: leadership lost while committing log)
2015/11/06 14:43:29 [ERR] core: failed to acquire lock: failed to create session: Unexpected response code: 500 (rpc error: No cluster leader)
2015/11/06 14:44:56 [ERR] core: failed to acquire lock: failed to read lock: Unexpected response code: 500

@msabramo
Copy link
Author

msabramo commented Nov 6, 2015

CORP\marca@consul03:/var/log$ sudo stop vault
vault stop/waiting
CORP\marca@consul03:/var/log$ vault --version
Vault v0.3.1
CORP\marca@consul03:/var/log$ sudo start vault
vault start/running, process 374709

<<<< unseal vault >>>>

CORP\marca@consul03:/var/log$ tail -f vault.log
         Log Level: info
             Mlock: supported: true, enabled: false
           Backend: consul (HA available)
 Advertise Address: https://consul03.corp.surveymonkey.com:8200
        Listener 1: tcp (addr: "consul03.corp.surveymonkey.com:8200", tls: "enabled")

==> Vault server started! Log data will stream in below:

2015/11/06 14:48:16 [INFO] core: vault is unsealed
2015/11/06 14:48:16 [INFO] core: entering standby mode
2015/11/06 14:48:46 [INFO] core: acquired lock, enabling active operation
2015/11/06 14:48:46 [INFO] core: post-unseal setup starting
2015/11/06 14:48:46 [ERR] core: failed to read mount table: decryption failed: version bytes mis-match
2015/11/06 14:48:46 [ERR] core: post-unseal setup failed: failed to setup mount table
2015/11/06 14:48:48 [INFO] core: acquired lock, enabling active operation
2015/11/06 14:48:48 [INFO] core: post-unseal setup starting
2015/11/06 14:48:48 [ERR] core: failed to read mount table: decryption failed: version bytes mis-match
2015/11/06 14:48:48 [ERR] core: post-unseal setup failed: failed to setup mount table

@msabramo
Copy link
Author

msabramo commented Nov 6, 2015

CORP\marca@consul02:~$ vault --version
Vault v0.2.0

Aaagggh!

@msabramo
Copy link
Author

msabramo commented Nov 6, 2015

OK, found some of the servers had a /usr/local/bin/vault that was 0.2.0 and an /opt/vault/bin/vault that was 0.3.1.

Made sure that the first one is definitely running 0.3.1 and now I get different errors 😦

CORP\marca@consul01:~$ tail -f /var/log/vault.log
2015/11/06 15:51:00 [ERR] core: post-unseal setup failed: failed to setup mount table
2015/11/06 15:51:00 [INFO] core: acquired lock, enabling active operation
2015/11/06 15:51:01 [INFO] core: post-unseal setup starting
2015/11/06 15:51:01 [ERR] core: failed to mount entry secret/: cannot mount under existing mount 'secret/'
2015/11/06 15:51:01 [ERR] core: post-unseal setup failed: failed to setup mount table
2015/11/06 15:51:01 [INFO] core: acquired lock, enabling active operation
2015/11/06 15:51:01 [INFO] core: post-unseal setup starting
2015/11/06 15:51:01 [ERR] core: failed to mount entry secret/: cannot mount under existing mount 'secret/'
2015/11/06 15:51:01 [ERR] core: post-unseal setup failed: failed to setup mount table
2015/11/06 15:51:01 [INFO] core: acquired lock, enabling active operation
2015/11/06 15:51:01 [INFO] core: post-unseal setup starting
2015/11/06 15:51:01 [ERR] core: failed to mount entry secret/: cannot mount under existing mount 'secret/'
2015/11/06 15:51:01 [ERR] core: post-unseal setup failed: failed to setup mount table
2015/11/06 15:51:01 [INFO] core: acquired lock, enabling active operation
2015/11/06 15:51:01 [INFO] core: post-unseal setup starting
2015/11/06 15:51:01 [ERR] core: failed to mount entry secret/: cannot mount under existing mount 'secret/'
2015/11/06 15:51:01 [ERR] core: post-unseal setup failed: failed to setup mount table

and this

CORP\marca@consul01:~$ tail -f /var/log/vault.log
2015/11/06 15:54:52 [ERR] expire: failed to revoke 'secret/devops/ansible/vault/1cc721e5-40a5-7cd1-2889-b871663aa77b': failed to revoke entry: secret is unsupported by this backend
2015/11/06 15:54:52 [ERR] expire: failed to revoke 'secret/devops/ansible/vault/1cecaafb-b98b-ef90-df7e-913236e63cb0': failed to revoke entry: secret is unsupported by this backend
2015/11/06 15:54:52 [ERR] expire: failed to revoke 'secret/devops/ansible/vault/20c158bf-80c6-6daf-dbf8-9db1ffaeee50': failed to revoke entry: secret is unsupported by this backend
2015/11/06 15:54:52 [ERR] expire: failed to revoke 'secret/devops/ansible/vault/23442683-70b9-10bd-bb97-601731d43280': failed to revoke entry: secret is unsupported by this backend
2015/11/06 15:54:52 [ERR] expire: failed to revoke 'secret/devops/ansible/vault/25f77f49-11d1-65e0-1bb0-8ea7491fde22': failed to revoke entry: secret is unsupported by this backend
2015/11/06 15:54:52 [ERR] expire: failed to revoke 'secret/devops/ansible/vault/2876a42b-0ab9-e7df-939e-d18b88089515': failed to revoke entry: secret is unsupported by this backend
2015/11/06 15:54:52 [ERR] expire: failed to revoke 'secret/devops/ansible/vault/2c97ac47-5ebf-ddd0-2a82-afa1ac651dd1': failed to revoke entry: secret is unsupported by this backend
2015/11/06 15:54:52 [ERR] expire: failed to revoke 'secret/devops/ansible/vault/2fa650e4-36c9-9d6b-d5fe-647cdb517884': failed to revoke entry: secret is unsupported by this backend
2015/11/06 15:54:52 [ERR] expire: failed to revoke 'secret/devops/ansible/vault/3170996f-f12e-7786-fdbf-f48b48c04492': failed to revoke entry: secret is unsupported by this backend
2015/11/06 15:55:04 [ERR] core: failed to acquire lock: failed to read lock: Unexpected response code: 500
2015/11/06 15:55:08 [ERR] core: key rotation periodic upgrade check failed: Unexpected response code: 500

@msabramo
Copy link
Author

msabramo commented Nov 7, 2015

The failed to revoke entry: secret is unsupported by this backend error sounds kinda/sorta like #648 ?

@msabramo
Copy link
Author

msabramo commented Nov 7, 2015

or #733 ?

@msabramo
Copy link
Author

msabramo commented Nov 7, 2015

OK, I finally got Vault working again.

It was a harrowing combination of:

  1. making sure that all the vault servers were running the new version of vault
  2. fixing consul leader election problems by editing the peers.json file as described at http://consul.io/docs/guides/outage.html. Somehow, all of my peers.json files just had null in them.
  3. fixing failed to mount entry &vault.MountEntry{Path:"secret/" #733 by doing some stracing to figure out what audit log file it was trying to write and then making sure it was present with the right permissions and editing the upstart config so that vault always uses /var/log as its working directory.
  4. maybe other things that i'm forgetting because it took a long time to fix and I tried lots of things. 😄

Suggestions:

  • Regarding item 1 above, could vault help diagnose this problem by printing the Vault version when the server starts up? Maybe also the error message could be clearer, along the lines of "vault 0.2.0 is attempting to read data that was written by vault 0.3.1"
  • Regarding item 2, I don't think the error messages that you get when the leader election stuff is messed up do a great job of informing the user as to what is wrong and it was not easy at all to find out about editing the peers.json. It probably should be telling you that there are no known peers or printing out what peers it read from peers.json. And maybe when something is wrong, it can point the user to the outage page or otherwise hold their hand a bit more. I wonder if this is partly due to the fact that HashiCorp devs are probably very familiar with leader election, Raft, etc. whereas some developers are new to the fancy distributed stuff and need a bit more guidance, especially when things go wrong, I'd love for the error messages to be clear and actionable, because when things go down, panic tends to sit in and rational thinking might fly out the window.
  • Regarding item 3, a new version of vault with recent fixes that print the audit log filename and prevent infinite loop problems, would be handy. It might also be useful to have some mechanism to turn off audit logs.

I'd like to thank @jefferai for patiently working with me on this and @mitchellh for responding to my tweet that mentioned I was having trouble.

@jefferai
Copy link
Member

jefferai commented Nov 9, 2015

Regarding item 1 above, could vault help diagnose this problem by printing the Vault version when the server starts up? Maybe also the error message could be clearer, along the lines of "vault 0.2.0 is attempting to read data that was written by vault 0.3.1"

Printing the version in the log sounds great. As for the second part, we don't store the exact version of Vault that wrote data, because that would balloon the size of the stored information -- not a huge amount, but enough. But the version bytes themselves are a good clue when you're asking for support, as there aren't many cases that can cause problems there.

Regarding item 2, that's really a Consul issue.

Regarding item 3, I think that some of the changes I have made recently will cause a leader in a bad state to give up the lock. But this depends on having newer versions on all servers, and it also won't help if there is only one active machine.

@msabramo
Copy link
Author

msabramo commented Nov 9, 2015

Yeah, I think printing the Vault version on startup is super easy and would probably have helped us diagnose quite a bit faster. => #765

And I feel like the Consul docs could possibly be improved a bit.

Thanks again, @jefferai!

@msabramo msabramo closed this as completed Nov 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants