Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FreeIPA fails silently to add DNS record A #79

Closed
cmd-ntrf opened this issue Dec 1, 2020 · 9 comments · Fixed by #80
Closed

FreeIPA fails silently to add DNS record A #79

cmd-ntrf opened this issue Dec 1, 2020 · 9 comments · Fixed by #80
Assignees
Labels
bug Something isn't working

Comments

@cmd-ntrf
Copy link
Member

cmd-ntrf commented Dec 1, 2020

In MC 8.5 and 9.1 at least, it has been reported that ipa-client-install sometimes fail silently to add the record A.

The DNS update error can be found in /var/log/ipaclient-install.log and look like this:

2020-11-30T21:41:14Z DEBUG nsupdate failed: Command '/usr/bin/nsupdate -g /etc/ipa/.dns_update.txt' returned non-zero exit status 1
2020-11-30T21:41:14Z ERROR Failed to update DNS records.

The content of /etc/ipa/.dns_update.txt during the update would something like this:

update delete node25.int.ivado-spark.calculquebec.cloud. IN A
show
send

update delete node25.int.ivado-spark.calculquebec.cloud. IN AAAA
show
send

update add node25.int.ivado-spark.calculquebec.cloud. 1200 IN A 192.168.239.136
show
send
@cmd-ntrf cmd-ntrf self-assigned this Dec 1, 2020
@cmd-ntrf cmd-ntrf added the bug Something isn't working label Dec 1, 2020
@cmd-ntrf cmd-ntrf changed the title FreeIPA fail silently to add DNS record A FreeIPA fails silently to add DNS record A Dec 3, 2020
@cmd-ntrf
Copy link
Member Author

cmd-ntrf commented Dec 3, 2020

I did some digging in FreeIPA install code and it appears the as observed when the client installation failing to update the DNS is not fatal:
https://github.com/freeipa/freeipa/blob/release-4-6-8/ipaclient/install/client.py#L1367

It is only logged as an error in /var/log/ipaclient-install.log.

I was also able to reproduce the problem by creating a new clusters with 25 compute nodes. The higher number of nodes trying register their record at the same time appears to create some contingency on bind and some records are dropped.

This means FreeIPA alone cannot be trusted and we will have to implement some mechanism to validate the records were created with Puppet. A complementary solution would be to populate /etc/hosts with consul-template using the adresses registered by the consul agents.

@cmd-ntrf
Copy link
Member Author

cmd-ntrf commented Dec 3, 2020

It does not only happen for record A but for SSHFP records too.

Closest issue I could find so far:
https://bugzilla.redhat.com/show_bug.cgi?id=1755643

@cmd-ntrf
Copy link
Member Author

cmd-ntrf commented Dec 3, 2020

On my last test, the time when nsupdate failed for three nodes (2 record A and 1 SSHFP) out of 25 was the same time a keytab was generated for Mokey. I will launch a new cluster with 25 nodes again but using MC 9.0 puppet environment to see if Mokey keytab generation could be the issue with 9.1.

@cmd-ntrf
Copy link
Member Author

cmd-ntrf commented Dec 3, 2020

Rolling-back to 9.0 did not solve the problem, so Mokey is fortunately not the culprit.

After decyphering a bit more the error message in ipaclient-install.log, I am now suspecting an entropy issue, because of these sort of log that precedes nsupdate not being able to update the DNS:

> ;; QUESTION SECTION:
> ;671202487.sig-ipa.example.test.        ANY     TKEY
> ;; ANSWER SECTION:
> 671202487.sig-ipa.example.test. 0 ANY   TKEY    gss-tsig. 0 0 3 BADNAME 0  0

and the following analysis of what is happening in nsupdate client:
https://bugzilla.redhat.com/show_bug.cgi?id=1755643#c67

A potential solution currently tested would be to make sure there is enough entropy to avoid having two nodes generating the same key by making sure the service haveged is started before trying to install ipa client.

@cmd-ntrf
Copy link
Member Author

cmd-ntrf commented Dec 3, 2020

Requiring haveged did not solve the issue, some nodes in a 50-node-cluster were not able to add their SSHFP records:

2020-12-03T15:44:03Z DEBUG stderr=Reply from SOA query:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id:  44115
;; flags: qr aa rd ra; QUESTION: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;node34.int.vetoed.calculquebec.cloud. IN SOA

;; AUTHORITY SECTION:
int.vetoed.calculquebec.cloud. 3600 IN  SOA     mgmt1.int.vetoed.calculquebec.cloud. hostmaster.int.vetoed.calculquebec.cloud. 1607010268 3600 900 1209600 3600

Found zone name: int.vetoed.calculquebec.cloud
The master is: mgmt1.int.vetoed.calculquebec.cloud
start_gssrequest
Found realm from ticket: INT.VETOED.CALCULQUEBEC.CLOUD
send_gssrequest
recvmsg reply from GSS-TSIG query
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id:  19769
;; flags: qr ra; QUESTION: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;3964912905.sig-mgmt1.int.vetoed.calculquebec.cloud. ANY        TKEY

;; ANSWER SECTION:
3964912905.sig-mgmt1.int.vetoed.calculquebec.cloud. 0 ANY TKEY gss-tsig. 0 0 3 BADNAME 0  0

dns_tkey_gssnegotiate: TKEY is unacceptable

2020-12-03T15:44:03Z DEBUG nsupdate failed: Command '/usr/bin/nsupdate -g /etc/ipa/.dns_update.txt' returned non-zero exit status 1
2020-12-03T15:44:03Z WARNING Could not update DNS SSHFP records.```

@cmd-ntrf
Copy link
Member Author

cmd-ntrf commented Dec 3, 2020

I suspect the issue has occured in previous version of MC, but until 8.3, sssd dynamic dns update was activated which would fix the missing A record automatically.
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/windows_integration_guide/sssd-dyndns

@cmd-ntrf
Copy link
Member Author

cmd-ntrf commented Dec 3, 2020

and the following analysis of what is happening in nsupdate client:
https://bugzilla.redhat.com/show_bug.cgi?id=1755643#c67

A potential solution currently tested would be to make sure there is enough entropy to avoid having two nodes generating the same key by making sure the service haveged is started before trying to install ipa client.

The culprit is the key generate by the client in nsupdate, but haveged have no influence because nsupdate uses srand seeded with the time, which gives a high probability of collision given that all nodes are booted at almost the same time and that they follow the exact same provisioning steps.

@cmd-ntrf
Copy link
Member Author

cmd-ntrf commented Dec 3, 2020

Latest approach is to wrap ipa-client-install in a small helper script mc-ipa-client-install that will verify there is no badname error in /var/log/ipaclient-install.log if ipa-client-install exit code is 0. If an error is found, the client is uninstalled and the wrapper return an exit code of 1.

The exit code will instruct puppet the client was not install, and puppet will try again after some sleep time.

@cmd-ntrf
Copy link
Member Author

cmd-ntrf commented Dec 4, 2020

Bug summary

ipa-client-install sometimes fails to create record in the DNS, sometime the A records are not registered, sometimes it is the SSHFP, sometime none of the records are registered, yet the installation return an exit code 0 meaning a successful installation.

Culprit research

ipa-client-install

FreeIPA ipa-install-client updates the DNS record A, AAAA and SSHFP records using the update_dns and do_nsupdate functions. do_nsupdate uses nsupdate command line tool to delete existing records and create new one with the new client information.

These functions cannot generate exception and return nothing. Failing of nsupdate for any reason is being logged but it is not fatal to the execution of ipa-client-install. This is why Puppet ignores the installation was partly unsuccessful and never tries to enroll the node again.

nsupdate

nsupdate uses GSS-TSIG protocol to communicate with the DNS server. In this protocol, the client (nsupdate) and the server exchange TKEY records. TKEY record first field is a key that which must be unique on client and server.

nsupdate 9.11 uses the following code to generate the key:
bin/nsupdate/nsupdate.c:start_gssrequest():

        isc_random_get(&val);
        result = isc_string_printf(mykeystr, sizeof(mykeystr), "%u.sig-%s",
                                   val, namestr);

where lib/isc/random.c:isc_random_get():

        initialize();
        *val = ((((unsigned int)rand()) & 0xffff0) >> 4) |
               ((((unsigned int)rand()) & 0xffff0) << 12);

and initialize() -> initialize_rand():

        pid = ((pid << 16) & 0xffff0000) | ((pid >> 16) & 0xffff);
        srand((unsigned)time(NULL) ^ pid);

Because every compute node is provisionned the same way at almost the same time, seeding the random-number generator with time and pid combination has a non-null probability of collision. Collisions have been observed on multiple occasions.

To verify the existence of collision, we looked at /var/log/ipaclient-install.log for matching value of the integer in the question section of SOA query, like this:

;3964912905.sig-mgmt1.int.vetoed.calculquebec.cloud. ANY        TKEY

Furthermore, collisions generate a BADNAME answer from the DNS server when trying to update the records, which can also be found in /var/log/ipaclient-install.log.

Solutions

Fix ipa-client-install to make sure dns recording failure is fatal

Since we can check the log for error after running ipa-client-install, we can wrap the installer in a script that verifies the presence of BADNAME TKEY error in /var/log/ipaclient-install.log after calling ipa-client-install. If the error is present, the client is uninstalled and a exit code greater than one is returned. Otherwise, everything is returned as if the wrapper script was ipa-client-install.

The wrapper script would look like this:

#!/bin/bash
# keep previous installation logs
mv /var/log/ipaclient-install.log{,.$(ls /var/log/ipaclient-install.log* | wc -l)}
if /sbin/ipa-client-install "${@}"; then
  if grep -q "BADNAME" /var/log/ipaclient-install.log; then
    /sbin/ipa-client-install --uninstall -U
    exit 1
  else
    exit 0
  fi
else
  exit $?
fi

Upgrade nsupdate

According to BIND9 changelog, since BIND 9.13.1, usage of isc_random in start_gssrequest has been replaced by isc_nonce_buf which appears to be more robust in terms of being able to generate unique key names.

Since the protocol to update the DNS records has not changed between BIND 9.11 and most recent version, it would be possible to update the BIND9 utils (dig, nsupdate, and nslookup) for a version >= 9.13.1.

@cmd-ntrf cmd-ntrf linked a pull request Dec 4, 2020 that will close this issue
cmd-ntrf added a commit that referenced this issue Dec 18, 2020
Implementation of ipa-client-install solution for issue #79
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant