Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHT returning incorrect ranges on wiki dataset #4

Open
marcelmaltry opened this issue Jul 9, 2021 · 1 comment
Open

CHT returning incorrect ranges on wiki dataset #4

marcelmaltry opened this issue Jul 9, 2021 · 1 comment

Comments

@marcelmaltry
Copy link

Hi,

I noticed that CHT returns incorrect ranges on the wiki dataset. I suspect that this is due to the duplicates in wiki. If so, is support for datasets containing duplicates planned?

Code to reproduce:

#include <fstream>
#include <iostream>
#include <iterator>
#include <vector>

#include "cht/builder.h"
#include "cht/cht.h"


template<typename Key>
std::vector<Key> load_data(const std::string &filename) {
    using key_type = Key;

    /* Open file. */
    std::ifstream in(filename, std::ios::binary);
    if (!in.is_open())
        exit(EXIT_FAILURE);

    /* Read number of keys. */
    uint64_t n_keys;
    in.read(reinterpret_cast<char*>(&n_keys), sizeof(uint64_t));

    /* Initialize vector. */
    std::vector<key_type> data;
    data.resize(n_keys);

    /* Read keys. */
    in.read(reinterpret_cast<char*>(data.data()), n_keys * sizeof(key_type));
    in.close();

    return data;
}

int main(int argc, char *argv[])
{
    /* Load keys. */
    auto keys = load_data<uint64_t>("data/wiki_ts_200M_uint64");
    std::cout << "keys loaded" << std::endl;

    /* Build CHT. */
    const std::size_t num_bins = 128;
    const std::size_t max_error = 32;
    cht::Builder<uint64_t> chtb(keys.front(), keys.back(), num_bins, max_error);
    for (const auto &key : keys) chtb.AddKey(key);
    cht::CompactHistTree<uint64_t> cht = chtb.Finalize();
    std::cout << "cht built" << std::endl;

    /* Lookup all key. */
    for (std::size_t i = 0; i != keys.size(); ++i) {
        auto key = keys.at(i);
        auto range = cht.GetSearchBound(key);
        if (i < range.begin or i > range.end) {
            std::cout << "key not in range" << '\n'
                      << "key: " << key << '\n'
                      << "pos: " << i << '\n'
                      << "lo: " << range.begin << '\n'
                      << "hi: " << range.end << '\n'
                      << std::endl;
        }
    }
}
@stoianmihail
Copy link
Owner

Hi, Marcel! Right, this dataset has duplicates. The author told us that CHT has to support them, but as far as the current implementation concerns, this is not yet the case. We are planning an extension to support duplicates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants