Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize is using a lot of memory during sync of many files #377

Open
lmm-git opened this issue Jul 4, 2022 · 2 comments
Open

Synchronize is using a lot of memory during sync of many files #377

lmm-git opened this issue Jul 4, 2022 · 2 comments
Labels
synchronize Issue and PR for synchronize module

Comments

@lmm-git
Copy link

lmm-git commented Jul 4, 2022

SUMMARY

Synchronize is using a lot of memory during sync (some may also call it leaking)

ISSUE TYPE
  • Bug Report
COMPONENT NAME

Synchronize

ANSIBLE VERSION
ansible [core 2.13.1]
  config file = None
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.9/dist-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.9.5 (default, Nov 18 2021, 16:00:48) [GCC 10.3.0]
  jinja version = 3.1.2
  libyaml = True

COLLECTION VERSION
# /usr/local/lib/python3.9/dist-packages/ansible_collections
Collection    Version
------------- -------
ansible.posix 1.4.0  

OS / ENVIRONMENT

Linux, but should affect all OS

STEPS TO REPRODUCE
- name: Synchronization of OS image
  ansible.posix.synchronize:
    src: /imager/image/
    dest: "{{ imager_mount_dir_new_image }}/"
    archive: yes
    checksum: yes
    verify_host: yes
    delay_updates: no
    rsync_timeout: 60
    rsync_opts:
      # use rsync `--delete-during` instead the default `delete`, which results in `--delete-after` imposing higher ram usage
      - '--delete-during'
      # use --ignore-times to calculate the checksum for all files even if size and time is equal
      - '--ignore-times'
EXPECTED RESULTS

No significant RAM increase

ACTUAL RESULTS

RAM usage rises up to a few hundred Megabytes for ~240k files.

All RAM gets freed when there is no error or the log output gets printed (which was > 50MB in some cases) on error.

Technical Details

I suspect the RAM leakage coming from

cmd.append('--out-format=%s' % shlex_quote(changed_marker + '%i %n%L'))
as this order rsync to print every changed file (which by itself should be fine). As in this line
(rc, out, err) = module.run_command(cmdstr)
it rsync is called synchronously, which results in the whole output written to the variable out or err. As the output might get really large for many files, it might be better to stream it to a temporary file or directly process it line by line (also streaming).

Furthermore, the processing of the output in the following lines might be quite inefficient depending on the number of empty lines:

while '' in out_lines:
out_lines.remove('')

Just adding a quiet switch to rsync does not help directly, as this will impose the loss of the changed status of the job.

@lmm-git
Copy link
Author

lmm-git commented Jul 5, 2022

One possible solution would be streaming the run_command output, but unfortunately this is not yet merged / incorporated into ansible, see ansible/proposals#92

@lmm-git
Copy link
Author

lmm-git commented Jul 6, 2022

Just did a few more tests regarding the memory consumption and the real culprit seems to be in the handling of Ansible itself:

The log output generated with my test sample is about 30MB (variable out written to a text file).

All tests were run in a container and numbers were gathered with GNU time.

Run Maximum memory consumption in kB
Original 467700
Omitting all output to Ansible (just do all the processing, but do not pass anything back to Ansible, see option 3) 165776
Streaming all rsync output to tail and just evaluating the last 50 lines, see option 2 113164

In order to solve this issue I came up with three possible options (all together with introducing a new flag for omitting the list of changed files):

  1. Implement a proper streaming run_command call with Ansible (might be the most clean option, but also the most work intensive one). With this solution neither Ansible nor the module should use a significant amount of memory. See Provide mechanism for streaming logs from modules ansible/proposals#92
  2. Just use tail to process only a sample of changed files. As far as I have seen this should be sufficient for all features but returning the full list of changed files. Downside here is that it is necessary to use the option use_unsafe_shell=True when running run_command. As the name implies this might incur security issues, which I would not like to take.
  3. Just process everything as it is right now but do not pass it to Ansible. This kinda feels like a hacky solution but reduces the required memory significantly.

@saito-hideki saito-hideki added the synchronize Issue and PR for synchronize module label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
synchronize Issue and PR for synchronize module
Projects
None yet
Development

No branches or pull requests

2 participants