Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent local extraction of binaries fails sometimes #457

Open
travisdowns opened this issue Nov 8, 2024 · 5 comments · May be fixed by #498
Open

Concurrent local extraction of binaries fails sometimes #457

travisdowns opened this issue Nov 8, 2024 · 5 comments · May be fixed by #498

Comments

@travisdowns
Copy link
Contributor

travisdowns commented Nov 8, 2024

When the node_exporter role is used to install the exporter on > 1 hosts we see occasional failures of binary extraction after download like so:

TASK [prometheus.prometheus._common : Unpack binary archive node_exporter-1.8.2.linux-arm64.tar.gz] ***
changed: [52.27.162.6 -> localhost]
changed: [34.212.107.78 -> localhost]
changed: [35.160.16.87 -> localhost]
changed: [34.211.30.255 -> localhost]
fatal: [35.91.204.184 -> localhost]: FAILED! => {"changed": true, "dest": "/tmp/node_exporter-linux-arm64/1.8.2", "extract_results": {"cmd": ["/usr/bin/tar", "--extract", "-C", "/tmp/node_exporter-linux-arm64/1.8.2", "-z", "--show-transformed-names", "--strip-components=1", "-f", "/root/.ansible/tmp/ansible-tmp-1730343705.1365232-1179-87748719038915/source"], "err": "", "out": "", "rc": 0}, "gid": 0, "group": "root", "handler": "TgzArchive", "mode": "0755", "msg": "Unexpected error when accessing exploded file: [Errno 2] No such file or directory: b'/tmp/node_exporter-linux-arm64/1.8.2/node_exporter'", "owner": "root", "size": 81, "src": "/root/.ansible/tmp/ansible-tmp-1730343705.1365232-1179-87748719038915/source", "state": "directory", "uid": 0}
changed: [54.212.243.52 -> localhost]
changed: [35.91.104.132 -> localhost]

It's probably less than 1% of total extraction executions, but with many hosts the overall failure rate is quite high.

We delegate the download & extraction to localhost, so this runs in parallel N times for N hosts but the paths are the same in every case, so it's concurrently extracting to the same destination path which I guess causes this, e.g, from the manual:

When extracting files, if tar discovers that the extracted file already exists, it normally replaces the file by removing it before extracting it, to prevent confusion in the presence of hard or symbolic links.

So there will be brief periods where a written file (in this case, the top level dir) will be deleted and replaced by a concurrent extraction and if some earlier writer tries to list it it will see it missing.

Besides the race, it's also inefficient to extract the same file N times.

I guess run_once would fix this if it wasn't for the complication of multi-arch. Maybe serial execution + skip if already exists would work with little refactoring of the existing approach?

I think this race existed before the big refactor in 0.20.0, but in the refactor list_files was added which may be the thing that triggers the specific failure above.

@travisdowns
Copy link
Contributor Author

Actually maybe run_once would work? Because the TASK text above indicates that it is called for a single binary (arm64 here), so maybe all the per-host invocations will have the same result?

@gardar
Copy link
Member

gardar commented Nov 13, 2024

The problem with run_once is that it will take the ansible_architecture fact from the first host in the play and then skip the rest of the hosts. So if you have multiple architectures in your play you will not get the binaries for them.

But a possible solution to this issue might be to use the creates parameter of the unarchive module as it would make the extraction skip completely if the binaries already exist.

@travisdowns
Copy link
Contributor Author

The problem with run_once is that it will take the ansible_architecture fact from the first host in the play and then skip the rest of the hosts. So if you have multiple architectures in your play you will not get the binaries for them.

Right, but what I'm confused about is that the task title already embeds the architecture as in my log above:

TASK [prometheus.prometheus._common : Unpack binary archive node_exporter-1.8.2.linux-arm64.tar.gz] ***

It already says arm64, so is the arch not already baked in? If I had a mix of host types, would this title include multiple binaries? I'm only a casual observer though, so don't pay this too much heed.

But a possible solution to this issue might be to use the creates parameter of the unarchive module as it would make the extraction skip completely if the binaries already exist.

This still seems racy maybe? First host starts unpacking then this root specified in creates dir is going to exist immediately, but the unpacking will be ongoing, then second host come in right after, see the root existing and proceeds to skip the task. I guess that's fine with the default execution strategy which holds all hosts from proceeding until all are done the task ("linear"), but if someone is using strategy: free the 2nd host could proceed and use invalid unpacked binaries.

Maybe it should be creates + serial: 1?

@travisdowns
Copy link
Contributor Author

It already says arm64, so is the arch not already baked in? If I had a mix of host types, would this title include multiple binaries? I'm only a casual observer though, so don't pay this too much heed.

To answer my own question, yes this task does run with different variable values for different architectures, but the output in the task name just appears to use the variable names from the first host, so it simply appears that the binaries for one architecture is downloaded while under the covers we are downloading for all architectures.

travisdowns added a commit to travisdowns/prometheus-ansible that referenced this issue Dec 20, 2024
During the common install tasks, we download and unpack the binary to
install on localhost and eventually upload it to each host (among many
other things). Currently the "unpack" step, which extracts the gzipped
tar archive, is perform N times if there are N hosts in the inventory,
but the target directory (something like
/tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with
the same architecture.

This means that all the unarchive tasks are extracting in parallel in an
unsynchronized manner to the same directory. It a miracle that this
works, but at least usually it like it does, but sometimes tar will
complain about a missing file, failing the install. I've confirmed
locally that this is due to the race described above.

To fix this, we use `throttle: 1` on the unpack task. This means that
the first task (for each architecture) will do the unpack and the other
tasks are effectively no-ops, they are reported as "ok" rather than
changed in the output.

Fixes prometheus-community#457.
@travisdowns travisdowns linked a pull request Dec 20, 2024 that will close this issue
travisdowns added a commit to travisdowns/prometheus-ansible that referenced this issue Dec 20, 2024
During the common install tasks, we download and unpack the binary to
install on localhost and eventually upload it to each host (among many
other things). Currently the "unpack" step, which extracts the gzipped
tar archive, is perform N times if there are N hosts in the inventory,
but the target directory (something like
/tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with
the same architecture.

This means that all the unarchive tasks are extracting in parallel in an
unsynchronized manner to the same directory. It a miracle that this
works, but at least usually it like it does, but sometimes tar will
complain about a missing file, failing the install. I've confirmed
locally that this is due to the race described above.

To fix this, we use `throttle: 1` on the unpack task. This means that
the first task (for each architecture) will do the unpack and the other
tasks are effectively no-ops, they are reported as "ok" rather than
changed in the output.

Fixes prometheus-community#457.
travisdowns added a commit to travisdowns/prometheus-ansible that referenced this issue Dec 20, 2024
During the common install tasks, we download and unpack the binary to
install on localhost and eventually upload it to each host (among many
other things). Currently the "unpack" step, which extracts the gzipped
tar archive, is perform N times if there are N hosts in the inventory,
but the target directory (something like
/tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with
the same architecture.

This means that all the unarchive tasks are extracting in parallel in an
unsynchronized manner to the same directory. It a miracle that this
works, but at least usually it like it does, but sometimes tar will
complain about a missing file, failing the install. I've confirmed
locally that this is due to the race described above.

To fix this, we use `throttle: 1` on the unpack task. This means that
the first task (for each architecture) will do the unpack and the other
tasks are effectively no-ops, they are reported as "ok" rather than
changed in the output.

Fixes prometheus-community#457.

Signed-off-by: Travis Downs <travis.downs@gmail.com>
@travisdowns
Copy link
Contributor Author

I have pushed a patch that fixes this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants