Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cmSketch test) and add better cmSketch test #325

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Commits on Oct 14, 2024

  1. fix(cmSketch test) and add better cmSketch test

    One Sketch test was intended to check that the seeds had caused each
    sketch row to be unique but the way the test was iterating, the failure
    wasn't going to be triggered.
    
        With this fix, the test passes seeming to indicate the seeds are
        working as intended on the row creation. But there is a problem.
    
    A new Sketch test is added that goes to the trouble of sorting the
    counters of each row to ensure that each row isn't just essentially a
    copy of another row, where the only difference is which position the
    counters occupy.
    
    And the new Sketch test is performed twice, once with high-entropy
    hashes, because they come from a PRNG, and once with low-entropy hashes,
    which in many cases will be normal, because they are all small numbers,
    good for indexing, but not good for getting a spread when simply applied
    to a bitwise XOR operation.
    
        These new tests show a problem with the counter increment logic
        within the cmSketch.Increment method which was most likely
        introduced by commit f823dc4.
    
        A subsequent commit addresses the problems surfaced. But as the
        discussion from issue dgraph-io#108 shows (discussion later moved to
        https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712 ),
        other ideas for incrementing the counters were considered
        by previous authors as well.
    
    Fix existing cmSketch test and add improved cmSketch test
    
    (Marked as draft because this introduces a test that fails for now. I can
    commit a fix to the cmSketch increment method to get the new test to
    pass - if a maintainer agrees there is a problem to be fixed. See dgraph-io#108.
    I tried a few years ago.)
    
    One Sketch test was intended to check that the seeds had caused each
    sketch row to be unique but the way the test was iterating, the failure
    wasn't going to be triggered.
    
        With this fix, the test passes seeming to indicate the seeds are
        working as intended on the row creation. But there is a problem with
        the actual increment method. A new Sketch test is added that goes
        further and sorts the counters of each row to ensure that each row
        isn't just essentially a copy of another row, where the only
        difference is which position the counters occupy.
    
    And the new Sketch test is performed twice, once with high-entropy
    hashes, because they come from a PRNG, and once with low-entropy hashes,
    which in many cases will be normal, because they are all small numbers,
    good for indexing, but not good for getting a spread when simply applied
    with a bitwise XOR operation.
    
        These new tests show a problem with the counter increment logic
        within the cmSketch.Increment method which was most likely
        introduced by commit f823dc4.
    
        A subsequent commit addresses the problems surfaced. But as the
        discussion from issue dgraph-io#108 shows (discussion later moved to
        https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712),
        other ideas for incrementing the counters were considered by
        previous authors of the original Java code as well.
    FrankReh authored and mangalaman93 committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    994e81b View commit details
    Browse the repository at this point in the history