You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that CHT returns incorrect ranges on the wiki dataset. I suspect that this is due to the duplicates in wiki. If so, is support for datasets containing duplicates planned?
Code to reproduce:
#include<fstream>
#include<iostream>
#include<iterator>
#include<vector>
#include"cht/builder.h"
#include"cht/cht.h"template<typename Key>
std::vector<Key> load_data(const std::string &filename) {
using key_type = Key;
/* Open file. */
std::ifstream in(filename, std::ios::binary);
if (!in.is_open())
exit(EXIT_FAILURE);
/* Read number of keys. */uint64_t n_keys;
in.read(reinterpret_cast<char*>(&n_keys), sizeof(uint64_t));
/* Initialize vector. */
std::vector<key_type> data;
data.resize(n_keys);
/* Read keys. */
in.read(reinterpret_cast<char*>(data.data()), n_keys * sizeof(key_type));
in.close();
return data;
}
intmain(int argc, char *argv[])
{
/* Load keys. */auto keys = load_data<uint64_t>("data/wiki_ts_200M_uint64");
std::cout << "keys loaded" << std::endl;
/* Build CHT. */const std::size_t num_bins = 128;
const std::size_t max_error = 32;
cht::Builder<uint64_t> chtb(keys.front(), keys.back(), num_bins, max_error);
for (constauto &key : keys) chtb.AddKey(key);
cht::CompactHistTree<uint64_t> cht = chtb.Finalize();
std::cout << "cht built" << std::endl;
/* Lookup all key. */for (std::size_t i = 0; i != keys.size(); ++i) {
auto key = keys.at(i);
auto range = cht.GetSearchBound(key);
if (i < range.beginor i > range.end) {
std::cout << "key not in range" << '\n'
<< "key: " << key << '\n'
<< "pos: " << i << '\n'
<< "lo: " << range.begin << '\n'
<< "hi: " << range.end << '\n'
<< std::endl;
}
}
}
The text was updated successfully, but these errors were encountered:
Hi, Marcel! Right, this dataset has duplicates. The author told us that CHT has to support them, but as far as the current implementation concerns, this is not yet the case. We are planning an extension to support duplicates.
Hi,
I noticed that CHT returns incorrect ranges on the
wiki
dataset. I suspect that this is due to the duplicates inwiki
. If so, is support for datasets containing duplicates planned?Code to reproduce:
The text was updated successfully, but these errors were encountered: