Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashDiff never finishes when comparing two arrays #49

Open
nbarrientos opened this issue Feb 19, 2019 · 3 comments
Open

HashDiff never finishes when comparing two arrays #49

nbarrientos opened this issue Feb 19, 2019 · 3 comments

Comments

@nbarrientos
Copy link

nbarrientos commented Feb 19, 2019

Hi,

When debugging an issue with octocatalog-diff we narrowed down the problem to HashDiff unable to compare a couple of arrays when LCS is used. Here's a reproducer:

require 'hashdiff'

a = {x: []}
(0...17000).each{ |x|
        a[:x].push((0...11).map { ('a'..'z').to_a[rand(26)] }.push('.example.org').join)
        }

b = {x: a[:x]}
 
puts "Without LCS"
diff = HashDiff.diff(a, b, :use_lcs => false)
puts diff
puts "Done!"
puts "With LCS"
diff = HashDiff.diff(a, b)
puts diff
puts "This should never be printed"

When comparing those two arrays using LCS (which is on by default) the application starts to eat up memory very quickly with 100% CPU usage until, in our case, the process is killed by the kernel's OOM killer as the test machine does not have any swap:

$ ruby reproducer_synth.rb 
Without LCS
Done!
With LCS
...
Killed
kernel: Out of memory: Kill process 18783 (ruby) score 526 or sacrifice child
kernel: Killed process 18783 (ruby) total-vm:1192540kB, anon-rss:1013324kB, file-rss:0kB, shmem-rss:0kB

Yes, that's >1GiB!

# ruby --version
ruby 2.0.0p648 (2015-12-16) [x86_64-linux]
# gem list hashdiff

*** LOCAL GEMS ***

hashdiff (0.3.8)
# 

I understand that LCS is O(n2) but not sure this is the expected behaviour as the array size is moderate.

@liufengyun
Copy link
Owner

Thanks a lot for reporting the issue @nbarrientos . The memory behavior is to some extent interpretable, as 17000 * 17000 * 4.0 / 1024 / 1024 / 1024 = 1.07GB. A note should be added to readme that the lib is not situable for computing arrays larger than 10000 elements.

As far as I know, in genetics there are a lot of clever algorithms to avoid such explosion problems. But I am unaware of simple algorithms that do not resort to external storage.

@nbarrientos
Copy link
Author

Hi @liufengyun!

Thanks for replying.

Sorry, when I read the documentation of the library I did see indeed that the complexity of the algorithm was O(n2), but I naively assumed that it was the time complexity. I should have read further about the space complexity 😄

For the fun of it, I allowed a problem of input size 10000 (diffing two arrays of 10000 1-byte elements) to consume as much memory as it needed until it finished. Here's the memory footprint of the process.

image

Same plot but increasing the size to 15000:

image

I agree that perhaps adding a note clarifying this is a good idea, especially taking into account that LCS is enabled by default.

Bonne journée! 👋

@liufengyun
Copy link
Owner

@nbarrientos Cool benchmarks 👍 Merci beaucoup et bonne journée!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants