-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgraded from vault 0.2.0 to 0.3.1 and now it fails to read mount table #708
Comments
No, this upgrade should have happened transparently...there's no manual change that should be necessary. Definitely a bug of some sort! I think the key to figuring this out will be understanding why you're hitting this bug but nobody else has reported a problem. Some questions:
|
Thanks for responding! Yes, we am running in HA mode. We run Vault on each of our 3 Consul servers and Consul is used as the backend.
I did upgrade all of the nodes at once, because I ran an ansible playbook that upgraded them - actually in retrospect, I probably shouldn't have done that, because now Vault is totally down for us. 😦 I don't think we ever had Vault 0.1.x. If I'm not mistaken, we started with 0.2 and this was our first upgrade. |
Unfortunately I'm not sure how to go about debugging this without some custom debugging code. We have unit tests that mimic an upgrade from 0.2 to 0.3, including reading and writing values to a The most efficient way to do this would be if we could screen share and do some real-time debugging, but that would require a terminal with access to your systems, the ability to build Vault, and enough people around that can unseal Vault (or alternately, giving you the unseal keys and performing a rotation once we're done debugging). Is there any chance we could set this up? Alternately I could upload some modified binaries containing debugging, and we could bounce back and forth a few times that way if need be. Even in that situation, if we can open up some sort of IM or real-time communication channel, it would help so that we can coordinate. Let me know...thanks! |
In each entry stored in a physical backend, a specific byte is set to indicate the version of the encryption/decryption function. There has been a byte set for this purpose since the first public release of Vault; in 0.3.1 this version was incremented and logic was included to handle decryption for both the old and new versions, but encryption will only happen with the new version. The most likely cause of you seeing this would be that the mount table was rewritten with the new version of the encryption logic, and it's trying to be read by an old version server, which would not know how to handle decrypting it. So I need to ask -- can you verify that all Vault nodes successfully upgraded to 0.3 and were restarted? Running |
Ah ok, sorry for not being responsive..let me check that out; unfortunately our ISP is down today, but I can check on the servers when it comes back up. |
OK I went and and restarted vault on all of the servers. As mentioned previously, there are 3 servers running vault and they are the same servers that are running Consul, which still works. Now I get:
and in the
If we can't figure this out, then we can think about setting up a screen share. Do you have a screen-sharing tool that's worked well for you in the past? I think a lot of folks here have used join.me pretty successfully. Since you're HashiCorp, maybe you prefer using Vagrant Share? |
Can you send all of the Vault log? The first thing that needs to be sorted is why there is no active vault instance found. When an instance becomes active it might respond more normally to client requests. Stupid question, maybe, but did you unseal Vault on each server after restarting it? |
Oh hmmmm. I think I only unsealed Vault on the one server. That gives me something else to try, thanks! And if that doesn't get stuff going, then should we continue this over email so I can send you the logs and such? |
If that doesn't work, gist the logs somewhere and link them. If there's nothing obvious in there, we can figure out some real-time debugging method. |
It looks like I can eliminate the
to
I wonder if the former worked in older versions of Vault but doesn't work anymore? |
Even after making the config change I just mentioned above, I still get an error trying to access vault:
and the log files look pretty clean:
|
Do you only have one instance of Vault running? If there's only one instance, and it's failing to become leader, this suggests that somehow a lock has gotten stale in Consul. |
So I discovered that one of the 3 Consul servers ( I stopped Consul, fixed the disk issue (largely caused by a multi-gigabyte
I had high hopes that everything was going to be good, but now the vault logs on all 3 servers are spewing this over and over into
|
At least we're now back to the original problem :-) Did you check the Vault version on each node? |
and also seeing occasional leadership badness things:
|
|
Aaagggh! |
OK, found some of the servers had a Made sure that the first one is definitely running 0.3.1 and now I get different errors 😦
and this
|
The |
or #733 ? |
OK, I finally got Vault working again. It was a harrowing combination of:
Suggestions:
I'd like to thank @jefferai for patiently working with me on this and @mitchellh for responding to my tweet that mentioned I was having trouble. |
Printing the version in the log sounds great. As for the second part, we don't store the exact version of Vault that wrote data, because that would balloon the size of the stored information -- not a huge amount, but enough. But the version bytes themselves are a good clue when you're asking for support, as there aren't many cases that can cause problems there. Regarding item 2, that's really a Consul issue. Regarding item 3, I think that some of the changes I have made recently will cause a leader in a bad state to give up the lock. But this depends on having newer versions on all servers, and it also won't help if there is only one active machine. |
I upgraded from vault 0.2.0 to 0.3.1. After unsealing the vault, I get a zillion of these messages in the log:
I assume that I heave to clear out information from the old version...? How do I do this?
The text was updated successfully, but these errors were encountered: