fix(cmSketch test) and add better cmSketch test #325

One Sketch test was intended to check that the seeds had caused each sketch row to be unique but the way the test was iterating, the failure wasn't going to be triggered. With this fix, the test passes seeming to indicate the seeds are working as intended on the row creation. But there is a problem. A new Sketch test is added that goes to the trouble of sorting the counters of each row to ensure that each row isn't just essentially a copy of another row, where the only difference is which position the counters occupy. And the new Sketch test is performed twice, once with high-entropy hashes, because they come from a PRNG, and once with low-entropy hashes, which in many cases will be normal, because they are all small numbers, good for indexing, but not good for getting a spread when simply applied to a bitwise XOR operation. These new tests show a problem with the counter increment logic within the cmSketch.Increment method which was most likely introduced by commit f823dc4. A subsequent commit addresses the problems surfaced. But as the discussion from issue dgraph-io#108 shows (discussion later moved to https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712 ), other ideas for incrementing the counters were considered by previous authors as well. Fix existing cmSketch test and add improved cmSketch test (Marked as draft because this introduces a test that fails for now. I can commit a fix to the cmSketch increment method to get the new test to pass - if a maintainer agrees there is a problem to be fixed. See dgraph-io#108. I tried a few years ago.) One Sketch test was intended to check that the seeds had caused each sketch row to be unique but the way the test was iterating, the failure wasn't going to be triggered. With this fix, the test passes seeming to indicate the seeds are working as intended on the row creation. But there is a problem with the actual increment method. A new Sketch test is added that goes further and sorts the counters of each row to ensure that each row isn't just essentially a copy of another row, where the only difference is which position the counters occupy. And the new Sketch test is performed twice, once with high-entropy hashes, because they come from a PRNG, and once with low-entropy hashes, which in many cases will be normal, because they are all small numbers, good for indexing, but not good for getting a spread when simply applied with a bitwise XOR operation. These new tests show a problem with the counter increment logic within the cmSketch.Increment method which was most likely introduced by commit f823dc4. A subsequent commit addresses the problems surfaced. But as the discussion from issue dgraph-io#108 shows (discussion later moved to https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712), other ideas for incrementing the counters were considered by previous authors of the original Java code as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cmSketch test) and add better cmSketch test #325

fix(cmSketch test) and add better cmSketch test #325

Commits on Oct 14, 2024

fix(cmSketch test) and add better cmSketch test #325

Are you sure you want to change the base?

fix(cmSketch test) and add better cmSketch test #325

Commits on Oct 14, 2024