Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build failures due to 'hash mismatch importing path' in hydra-queue-runner #816

Closed
wizeman opened this issue Sep 16, 2020 · 13 comments · Fixed by #832
Closed

Build failures due to 'hash mismatch importing path' in hydra-queue-runner #816

wizeman opened this issue Sep 16, 2020 · 13 comments · Fixed by #832

Comments

@wizeman
Copy link
Member

wizeman commented Sep 16, 2020

I'm running into a persistent issue with Hydra and I was hoping someone could help me debug it.

I'm getting the following hydra-queue-runner error on one of the build steps:

Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: possibly transient failure building \u2018/nix/store/9z25c4bngm1a3s72sgd9dnqcrz9197zq-perl5.30.1-XML-LibXML-2.0202.drv\u2019 on \u2018root@tvrasp\u2019: error: --- Error --- hydra-queue-runner                                                                                                                                                           
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: hash mismatch importing path '/nix/store/l39z03xkbqd97iv8737mqpb5ias93dh9-perl5.30.1-XML-LibXML-2.0202-devdoc';
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]:   wanted: sha256:1fjm711i49dyhsdd2ygbr6bznbr289swp7x2xw70mvjjqwqridiz
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]:   got:    sha256:0ms3z0rw17ldmgsq00i9874hs0hzx1qbhwhwmvyhakgmq5bc490c                                          
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: will retry \u2018/nix/store/9z25c4bngm1a3s72sgd9dnqcrz9197zq-perl5.30.1-XML-LibXML-2.0202.drv\u2019 after 544s

As far as I understand, this seems to happen while importing the derivation from the remote builder.
The error seems to be persistent, which means that the builds end up getting aborted after some number of retries.
This was happening on only one of the remote builders, when building another derivation (the nix package itself).
However, now the same build step (this perl package) seems to fail with both remote builders, down to the same hash values.

These are all the `hydra-queue-runner` log messages printed when the build step fails.

Sep 16 16:11:20 homeserv hydra-queue-runner[108594]: killing process 36463                                                                                                                    
Sep 16 16:11:20 homeserv hydra-queue-runner[108594]: marking build 8113 as failed
Sep 16 16:11:20 homeserv hydra-queue-runner[108594]: finishing build step \u2018/nix/store/dsbq2lifkjacf98dnd0sq25adj3fmbg4-linux-config-5.4.65-1.20200914.drv\u2019
Sep 16 16:11:20 homeserv hydra-queue-runner[108594]: dispatcher woken up
Sep 16 16:11:20 homeserv hydra-queue-runner[108594]: dispatcher sleeping for 40s
Sep 16 16:11:20 homeserv hydra-queue-runner[108594]: performing step \u2018/nix/store/9z25c4bngm1a3s72sgd9dnqcrz9197zq-perl5.30.1-XML-LibXML-2.0202.drv\u2019 1 times on \u2018root@tvrasp\u2019 (needed by build 8109 and 0 others)                                                                             
Sep 16 16:11:20 homeserv hydra-queue-runner[108594]: sending closure of \u2018/nix/store/9z25c4bngm1a3s72sgd9dnqcrz9197zq-perl5.30.1-XML-LibXML-2.0202.drv\u2019 to \u2018root@tvrasp\u2019
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: building \u2018/nix/store/9z25c4bngm1a3s72sgd9dnqcrz9197zq-perl5.30.1-XML-LibXML-2.0202.drv\u2019 on \u2018root@tvrasp\u2019
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: outputs of \u2018/nix/store/9z25c4bngm1a3s72sgd9dnqcrz9197zq-perl5.30.1-XML-LibXML-2.0202.drv\u2019 substituted or already valid on \u2018root@tvrasp\u2019
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: copying outputs of \u2018/nix/store/9z25c4bngm1a3s72sgd9dnqcrz9197zq-perl5.30.1-XML-LibXML-2.0202.drv\u2019 from \u2018root@tvrasp\u2019 (1006192 bytes)
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: acquiring write lock on '/nix/var/nix/temproots/108610'                                                                                  
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: downgrading to read lock on '/nix/var/nix/temproots/108610'                                                                           
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: acquiring write lock on '/nix/var/nix/temproots/108610'
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: downgrading to read lock on '/nix/var/nix/temproots/108610'
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: locking path '/nix/store/l39z03xkbqd97iv8737mqpb5ias93dh9-perl5.30.1-XML-LibXML-2.0202-devdoc'
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: lock acquired on '/nix/store/l39z03xkbqd97iv8737mqpb5ias93dh9-perl5.30.1-XML-LibXML-2.0202-devdoc.lock'
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: lock released on '/nix/store/l39z03xkbqd97iv8737mqpb5ias93dh9-perl5.30.1-XML-LibXML-2.0202-devdoc.lock'
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: killing process 45914                                                                                                                    
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: will disable machine \u2018root@tvrasp\u2019 for 65s                                                                                               
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: possibly transient failure building \u2018/nix/store/9z25c4bngm1a3s72sgd9dnqcrz9197zq-perl5.30.1-XML-LibXML-2.0202.drv\u2019 on \u2018root@tvrasp\u2019: error: --- Error --- hydra-queue-runner                                                                                                                                                           
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: hash mismatch importing path '/nix/store/l39z03xkbqd97iv8737mqpb5ias93dh9-perl5.30.1-XML-LibXML-2.0202-devdoc';
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]:   wanted: sha256:1fjm711i49dyhsdd2ygbr6bznbr289swp7x2xw70mvjjqwqridiz
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]:   got:    sha256:0ms3z0rw17ldmgsq00i9874hs0hzx1qbhwhwmvyhakgmq5bc490c                                          
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: will retry \u2018/nix/store/9z25c4bngm1a3s72sgd9dnqcrz9197zq-perl5.30.1-XML-LibXML-2.0202.drv\u2019 after 544s
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: step \u2018/nix/store/9z25c4bngm1a3s72sgd9dnqcrz9197zq-perl5.30.1-XML-LibXML-2.0202.drv\u2019 is now runnable
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: dispatcher woken up                                                                                                                      
Sep 16 16:11:21 homeserv hydra-queue-runner[108594]: dispatcher sleeping for 40s                                                                                                       

This is my /etc/nix/machines file in my Hydra server, for reference:

root@tvrasp aarch64-linux /etc/nix/tvrasp 1 20 big-parallel 
localhost builtin,i686-linux,x86_64-linux - 32 1 big-parallel,kvm,nixos-test 
root@builda64 aarch64-linux /etc/nix/builda64 4 1 big-parallel 

Any ideas on how I can debug this problem?

@wizeman
Copy link
Member Author

wizeman commented Sep 16, 2020

For reference, I ran following command on the Hydra server and it succeeded:

$ env NIX_SSHOPTS='-i /etc/nix/tvrasp' nix copy --from ssh://tvrasp /nix/store/l39z03xkbqd97iv8737mqpb5ias93dh9-perl5.30.1-XML-LibXML-2.0202-devdoc
[1 copied (0.1 MiB)]

However, the build is still failing due to another build step having the same problem.

@wizeman
Copy link
Member Author

wizeman commented Sep 22, 2020

This is happening much more often now.
I have been able to determine the point where the hash mismatch error is thrown, which is inside the Nix 2.3.6 code:

#4  0x000073a3a3f2f832 in nix::LocalStore::addToStore (this=0xa97cf0, info=..., source=..., repair=nix::NoRepair, checkSigs=nix::NoCheckSigs, 
    accessor=std::shared_ptr<class nix::FSAccessor> (empty) = {...}) at src/libstore/local-store.cc:1025
#5  0x000000000055cef3 in performOp (logger=0xa97b30, store=..., trusted=true, clientVersion=280, from=..., to=..., op=39) at src/nix-daemon/nix-daemon.cc:721
#6  0x000000000055e60f in processConnection (trusted=true, userName="hydra-queue-runner", userId=235) at src/nix-daemon/nix-daemon.cc:816
#7  0x000000000055f24d in <lambda()>::operator()(void) const (__closure=0xa6ab40) at src/nix-daemon/nix-daemon.cc:1059
#8  0x00000000005619f3 in std::_Function_handler<void(), daemonLoop(char**)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /nix/store/b3zsk4ihlpiimv3vff86bb5bxghgdzb9-gcc-9.2.0/include/c++/9.2.0/bits/std_function.h:300
#9  0x0000000000563c5a in std::function<void ()>::operator()() const (this=0x7ffe42893520) at /nix/store/b3zsk4ihlpiimv3vff86bb5bxghgdzb9-gcc-9.2.0/include/c++/9.2.0/bits/std_function.h:690
#10 0x000073a3a3a46763 in nix::<lambda()>::operator()(void) const (__closure=0x7ffe42893250) at src/libutil/util.cc:962
#11 0x000073a3a3a4ab05 in std::_Function_handler<void(), nix::startProcess(std::function<void()>, const nix::ProcessOptions&)::<lambda()> >::_M_invoke(const std::_Any_data &) (
    __functor=...) at /nix/store/b3zsk4ihlpiimv3vff86bb5bxghgdzb9-gcc-9.2.0/include/c++/9.2.0/bits/std_function.h:300
#12 0x0000000000563c5a in std::function<void ()>::operator()() const (this=0x7ffe42893250) at /nix/store/b3zsk4ihlpiimv3vff86bb5bxghgdzb9-gcc-9.2.0/include/c++/9.2.0/bits/std_function.h:690
#13 0x000073a3a3a466a4 in nix::doFork(bool, std::function<void()>) (allowVfork=false, fun=...) at src/libutil/util.cc:946
#14 0x000073a3a3a4689b in nix::startProcess(std::function<void ()>, nix::ProcessOptions const&) (fun=..., options=...) at src/libutil/util.cc:974
#15 0x0000000000560391 in daemonLoop (argv=0x7ffe42894098) at src/nix-daemon/nix-daemon.cc:1040
#16 0x000000000056149e in _main (argc=2, argv=0x7ffe42894098) at src/nix-daemon/nix-daemon.cc:1142
#17 0x00000000005002d9 in std::_Function_handler<void (int, char**), int (*)(int, char**)>::_M_invoke(std::_Any_data const&, int&&, char**&&) (__functor=..., __args#0=@0x7ffe42893a84: 2, 
    __args#1=@0x7ffe42893a78: 0x7ffe42894098) at /nix/store/b3zsk4ihlpiimv3vff86bb5bxghgdzb9-gcc-9.2.0/include/c++/9.2.0/bits/std_function.h:300
#18 0x00000000005d1f80 in std::function<void (int, char**)>::operator()(int, char**) const (this=0x7ffe42893b40, __args#0=2, __args#1=0x7ffe42894098)
    at /nix/store/b3zsk4ihlpiimv3vff86bb5bxghgdzb9-gcc-9.2.0/include/c++/9.2.0/bits/std_function.h:690
#19 0x00000000005cfc06 in nix::mainWrapped (argc=2, argv=0x7ffe42894098) at src/nix/main.cc:132
#20 0x00000000005d001e in <lambda()>::operator()(void) const (__closure=0x7ffe42893f70) at src/nix/main.cc:176
#21 0x00000000005d0600 in std::_Function_handler<void(), main(int, char**)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /nix/store/b3zsk4ihlpiimv3vff86bb5bxghgdzb9-gcc-9.2.0/include/c++/9.2.0/bits/std_function.h:300
#22 0x0000000000563c5a in std::function<void ()>::operator()() const (this=0x7ffe42893f70) at /nix/store/b3zsk4ihlpiimv3vff86bb5bxghgdzb9-gcc-9.2.0/include/c++/9.2.0/bits/std_function.h:690
#23 0x000073a3a410bf61 in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) (programName="nix-daemon", 
    fun=...) at src/libmain/shared.cc:287
#24 0x00000000005d009c in main (argc=2, argv=0x7ffe42894098) at src/nix/main.cc:175

As far as I can tell, there is a hash mismatch but there is also a size mismatch:

(gdb) frame 4
#4  0x000073a3a3f2f832 in nix::LocalStore::addToStore (this=0xa97cf0, info=..., source=..., repair=nix::NoRepair, checkSigs=nix::NoCheckSigs, 
    accessor=std::shared_ptr<nix::FSAccessor> (empty) = {...}) at src/libstore/local-store.cc:1025
1025    src/libstore/local-store.cc: No such file or directory.
(gdb) print hashResult.second
$1 = 4088
(gdb) print info.narSize
$2 = 54728
(gdb) print info.path
$3 = "/nix/store/yrjb0hb9gh1hdp98vzx76hmaxm8bp2fy-parted-3.3-dev"

However, I'm not sure yet why there are persistent size mismatches, in fact I find it a bit hard to understand the code due to so many layers of abstraction...

@spacefrogg
Copy link
Contributor

spacefrogg commented Oct 7, 2020

I can support this observation. I also battle with this kind of errors. Normally, the path got realized by some other means, such that restarting the failed build makes progress.

I have a parallel observation to contribute, which I suspect is related: The queue runner has sometimes trouble finishing off a job, hanging in "Receiving outputs" forever. The ssh connection on the build host is long gone by that time, but the ssh process forked by the queue runner still runs on the server and does nothing. I have also observed this behavior, when running a build on localhost (which does not involve ssh but a nix-store --serve).

EDIT: The build hosts run various versions of nix (with localhost obviously running hydra's version of nix). So, this is likely not related to a behavioral change in the protocol.

@wizeman
Copy link
Member Author

wizeman commented Oct 7, 2020

I have also observed a weird issue with hydra-queue-runner's ssh connections, but slightly different from yours.

Sometimes some build step gets stuck running for hours on one of the build machines without any progress, according to the Hydra web interface (this never happens for localhost, it only happens on the build machines). When I go to inspect the status of the processes on the Hydra server and the build machines, this is what I find:

On the Hydra server, hydra-queue-runner has a ssh child process which has a valid TCP connection to the build machine.
On the build machine, there is still an sshd process associated with the same TCP connection, which is also still open and valid.

However, this sshd process on the build machine has no child process! It seems like the sshd server should have terminated the ssh connection when the child process died, but for some reason it didn't.

Note that on both my Hydra server and my build machines, I have the following on my sshd_config:

ClientAliveInterval 30
ClientAliveCountMax 6

... and the following on my ssh_config:

ServerAliveInterval 30                                                                         
ServerAliveCountMax 6                                                                          
ConnectTimeout 60                           

This behavior I have just described happens a lot to me, and I also suspect it has something to do with this bug but I have no idea how it can be related...

@wizeman
Copy link
Member Author

wizeman commented Oct 29, 2020

This specific issue has been solved for me by upgrading my Hydra server to NixOS 20.09, even though my build machines weren't upgraded yet.

However, I've observed another issue, similar to what @spacefrogg has said above (but I didn't diagnose anything related to ssh processes yet).

I'm closing this issue for now, but feel free to reopen if you still observe this behavior in NixOS 20.09.
It would also be good to open another issue for the "receiving outputs" bug.

@wizeman wizeman closed this as completed Oct 29, 2020
@wizeman
Copy link
Member Author

wizeman commented Nov 6, 2020

I'm still experiencing this issue on NixOS 20.09.

@wizeman wizeman reopened this Nov 6, 2020
@kquick
Copy link
Contributor

kquick commented Nov 6, 2020

One possibility is that there is a network connectivity issue that is causing corruption/packet dropping between the hosts involved. You might try repeated transfers of nix copy or similarly sized files and verify the sha2sum after each transfer; even if only 1 out of 100 transfers is corrupted that would be significant since hydra regularly transfers store contents between workers and the hydra machine.

@wizeman
Copy link
Member Author

wizeman commented Nov 7, 2020

@kquick: Thanks for the suggestion!

It is true in my case that the communications between the two hosts are running under an experimental network, but I've been using this network for many months, doing thousands of connections and transferring 100s of GBs of information (over scp and rsync, which use ssh, and NFS) but I haven't observed any other problems so far (no dropped connections, etc).

That said, I can see how it could be part of the problem, although the transfers shouldn't be getting corrupted (because ssh guarantees integrity) but it's possible that some transfers might be getting truncated, which could explain why I saw the size mismatches above.

However, the bigger problem is that the failures become persistent. Once a derivation fails to get copied due to hash mismatch, it always fails with a hash mismatch (always with the same erroneous hash), even if I restart the build several times... Manually doing nix copy does work, however.

Also, it is interesting that the hash mismatches happen when building on both remote builders, even causing the same erroneous hashes on the same derivations but when copying from different remote hosts!

@wizeman
Copy link
Member Author

wizeman commented Nov 8, 2020

Ok, I've been doing some debugging and it doesn't seem like this is a network problem. As far as I can see there is no corruption or truncation happening.
Here's what I found so far:

  1. My hydra server is trying to import the dcm...-zlib-1.2.11-dev path from my remote build server but runs into a hash mismatch:
Nov 08 20:17:44 homeserv hydra-queue-runner[16111]: hash mismatch importing path '/nix/store/dcmfv3m9cdjjzzmi1nmg593f9fx6vfyk-zlib-1.2.11-dev';
Nov 08 20:17:44 homeserv hydra-queue-runner[16111]:   wanted: sha256:0ix3xfbssb9szdsha5fb280jz5hvpqdcjmvzk9ivvpis79aw851v
Nov 08 20:17:44 homeserv hydra-queue-runner[16111]:   got:    sha256:0yqjjddda8xi6qhcbnyfx0a91ag15n1waxlw1wqf37qn25y1jldi
  1. If you actually debug this issue by stopping the nix daemon where the hash mismatch was found, you will see that it was not only expecting a different hash, but it was also expecting a NAR file with a different size.

Backtrace:

(gdb) bt
...
#4  0x0000773b1193d7ad in nix::LocalStore::addToStore (this=0x250ace0, info=..., source=..., repair=<optimized out>, checkSigs=<optimized out>, accessor=...)
    at src/libstore/local-store.cc:1025
#5  0x0000000000488223 in performOp (logger=0x2700000000, store=..., trusted=<optimized out>, clientVersion=<optimized out>, op=<optimized out>, to=..., from=...)
    at /nix/store/h31cy7jm6g7cfqbhc5pm4rf9c53i3qfb-gcc-9.3.0/include/c++/9.3.0/bits/shared_ptr_base.h:1118
#6  0x0000000000489cfb in processConnection (trusted=<optimized out>, userName="hydra-queue-runner", userId=235)
    at /nix/store/h31cy7jm6g7cfqbhc5pm4rf9c53i3qfb-gcc-9.3.0/include/c++/9.3.0/ext/atomicity.h:96
...

(gdb) frame 4
#4  0x0000773b1193d7ad in nix::LocalStore::addToStore (this=0x250ace0, info=..., source=..., repair=<optimized out>, checkSigs=<optimized out>, accessor=...)
    at src/libstore/local-store.cc:1025
1025                    assert(false);

(gdb) print info.path
$6 = "/nix/store/dcmfv3m9cdjjzzmi1nmg593f9fx6vfyk-zlib-1.2.11-dev"

Expected size:

(gdb) print info.narSize
$10 = 114456

Received size:

(gdb) print hashResult.second
$11 = 140448

So the Nix daemon actually received a NAR file larger than it was expecting.

  1. On my build server, if a NAR file of the /nix/store/dcm...-zlib-1.2.11-dev path were to be generated, it would actually result in a NAR file with the expected hash and size, so nothing wrong here:
$ nix-store -q --hash /nix/store/dcmfv3m9cdjjzzmi1nmg593f9fx6vfyk-zlib-1.2.11-dev
sha256:0ix3xfbssb9szdsha5fb280jz5hvpqdcjmvzk9ivvpis79aw851v
$ nix-store -q --size /nix/store/dcmfv3m9cdjjzzmi1nmg593f9fx6vfyk-zlib-1.2.11-dev
114456
  1. However, if I inspect to see what files the Hydra server actually received as it was importing the NAR into the Nix store, here's what I find:
$ find /nix/store/dcmfv3m9cdjjzzmi1nmg593f9fx6vfyk-zlib-1.2.11-dev
/nix/store/dcmfv3m9cdjjzzmi1nmg593f9fx6vfyk-zlib-1.2.11-dev
/nix/store/dcmfv3m9cdjjzzmi1nmg593f9fx6vfyk-zlib-1.2.11-dev/lib
/nix/store/dcmfv3m9cdjjzzmi1nmg593f9fx6vfyk-zlib-1.2.11-dev/lib/libz.a
  1. After further digging, I discovered that the libz.a file that was being imported into /nix/store/dcm...-zlib-1.2.11-dev doesn't actually belong there at all! The remote build server has no such file in that path.
    In fact, the libz.a file belongs to the /nix/store/2fhhbwkm7nahx23nvm6k6nhhp5nvhy29-zlib-1.2.11-static path.
    No wonder there was a hash (and size) mismatch!

  2. In fact, what happened was that while trying to import the dcm...-zlib-1.2.11-dev path, hydra-queue-runner seems to have supplied nix-daemon with the NAR file for the 2fh...-zlib-1.2.11-static path instead, because the contents of the /nix/store/dcm...-zlib-1.2.11-dev path on the Hydra server exactly match the contents of the /nix/store/2fh...-zlib-1.2.11-static path on the remote build server, having exactly the same files, directories and contents (confirmed with sha256sum).

  3. Moreover, here's what my build server says about dfh..-zlib-1.2.11-static:

$ nix-store -q --hash /nix/store/2fhhbwkm7nahx23nvm6k6nhhp5nvhy29-zlib-1.2.11-static
sha256:0yqjjddda8xi6qhcbnyfx0a91ag15n1waxlw1wqf37qn25y1jldi
$ nix-store -q --size /nix/store/2fhhbwkm7nahx23nvm6k6nhhp5nvhy29-zlib-1.2.11-static
140448

As you can see, it exactly matches the actual received hash and size on the Hydra server (but not the ones that were expected), even though the Hydra server was importing a different path!

  1. As a result of the hash mismatch, the /nix/store/dcm...-zlib-1.2.11-dev path on the Hydra server was not fully imported into the Nix store and therefore remains invalid according to Nix:
$ nix-store -q --hash /nix/store/dcmfv3m9cdjjzzmi1nmg593f9fx6vfyk-zlib-1.2.11-dev
error: path '/nix/store/dcmfv3m9cdjjzzmi1nmg593f9fx6vfyk-zlib-1.2.11-dev' is not valid

So as far as I can tell, there was no corruption at all, it's just that hydra-queue-runner was trying to import a path but supplied the wrong NAR file to the Nix daemon, and therefore the contents (of the wrong path) got copied correctly but due to the Nix daemon expecting the contents of a different path/size/hash, it reported a hash mismatch, did not complete the import and therefore did not mark the path as valid in Nix's database.

@wizeman

This comment has been minimized.

@wizeman

This comment has been minimized.

@wizeman
Copy link
Member Author

wizeman commented Nov 9, 2020

It turns out the previous 2 stack traces were due to an unrelated issue, so I'm back to where I was but I'm now running the latest commits of nixUnstable master and hydra-unstable master (which are still exhibiting the same bug).

@wizeman
Copy link
Member Author

wizeman commented Nov 10, 2020

Ok, sorry for all the comments, but I've finally figured out where the bug is and it all makes sense now. This is definitely a bug in hydra-queue-runner.

What's happening is that hydra-queue-runner is trying to import a derivation (say, libunistring) and it tries to import the outputs in this order:

/nix/store/dddy9gck9893rmskq02112b128p17jir-libunistring-0.9.10-info
/nix/store/grsxfb7iki717xvq2yw9mn50rpd5h20x-libunistring-0.9.10-doc
/nix/store/wlmgvi1kip97wkj5835fp697qzls1c9z-libunistring-0.9.10-dev
/nix/store/x5p5aaqfrcig5jr3j3d749w4g38522bi-libunistring-0.9.10

Now let's say that the network connection between the Hydra server and the remote build server breaks while importing the -dev output (this is what must have happened to me).
If the connection was broken at that point, then that means that hydra-queue-runner was able to successfully import the -info and -doc outputs, but not the -dev or the main libunistring-0.9.10 output.

At some point, network connection is reestablished and hydra-queue-runner tries to import libunistring again.
It will try to import the outputs in the same order.

But now the following events happen:

  1. hydra-queue-runner sends the remote build server the command to generate a NAR dump of the -info output.
  2. hydra-queue-runner tells the local Nix daemon to import the -info output, acting as a proxy/middle-man between the remote build server and the local Nix daemon (i.e. it will forward the NAR dump to the daemon).
  3. However (and here's where the bug starts...) the local Nix daemon realizes that it already has imported the -info output successfully before, so it does not even try to read the NAR dump from hydra-queue-runner.
  4. The same 3 steps above happen again, but now for the -doc output.
  5. Finally, hydra-queue-runner sends the remote build server the command to generate a NAR dump of the -dev output.
  6. Again, hydra-queue-runner tells the local Nix daemon to import the -dev output, acting as a middle-man.
  7. Now the local Nix daemon, since it hasn't imported the -dev output yet, tries to import it and therefore reads the NAR dump from hydra-queue-runner, but instead ends up reading the NAR dump of the -info output because it hadn't been read before!
  8. This leads to the Nix daemon importing the -info output into the directory of the -dev output, which is what leads to the hash mismatch (and size mismatch too, but the size mismatch error is hidden behind the the hash mismatch).

The actual code is a lot more convoluted than what the above events suggest, but one possible solution is simply for hydra-queue-runner to only send the remote build server the command to generate the NAR dump if the Nix daemon actually tries to read it. That's exactly what I did here locally, and as soon as I applied this simple fix, it immediately solved the bug for me, i.e., hydra-queue-runner is finally (!!!!!) able to finish importing these problematic outputs.

I will submit a PR soon.

wizeman added a commit to wizeman/hydra that referenced this issue Nov 10, 2020
This would happen if the network connection between the Hydra server and
the remote build server breaks after sucessfully importing at least one
output of a derivation, but before having finished importing all
outputs.

Fixes NixOS#816.
wizeman added a commit to wizeman/hydra that referenced this issue Nov 10, 2020
This would start happening if the network connection between the Hydra
server and the remote build server breaks after sucessfully importing
at least one output of a derivation, but before having finished
importing all outputs.

Fixes NixOS#816.
wizeman added a commit to wizeman/hydra that referenced this issue Nov 10, 2020
This would start happening if the network connection between the Hydra
server and the remote build server breaks after sucessfully importing
at least one output of a derivation, but before having finished
importing all outputs.

Fixes NixOS#816.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants