Version 1.1
© 2015-2016 Darius Šidlauskas, Sean Chester, and Kenneth S. Bøgh
The SkyBench software suite contains software for efficient main-memory computation of skylines. The state-of-the-art sequential (i.e., single-threaded) and multi-core (i.e., multi-threaded) algorithms are included.
The skyline operator [1] identifies
so-called pareto-optimal points in a multi-dimensional dataset. In two dimensions, the
problem is often presented as
finding the silhouette of Manhattan:
if one has knows the position of the corner points of every building, what parts of
which buildings are visible from across the river?
The two-dimensional case is trivial to solve and not the focus of SkyBench.
In higher dimensions, the problem is formalised with the concept of dominance: a point p is dominated by another point q if q has better or equal values for every attribute and the points are distinct. All points that are not dominated are part of the skyline. For example, if the points correspond to hotels, then any hotel that is more expensive, farther from anything of interest, and lower-rated than another choice would not be in the skyline. In the table below, Marge's Hotel is dominated by Happy Hostel, because it is more expensive, farther from Central Station, and lower rated, so it is not in the skyline. On the other hand, The Grand has the best rating and Happy Hostel has the best price. Lovely Lodge does not have the best value for any one attribute, but neither The Grand nor Happy Hostel outperform it on every attribute, so it too is in the skyline and represents a good balance of the attributes.
Name | Price per Night | Rating | Distance to Central Station | In skyline? |
---|---|---|---|---|
The Grand | $325 | ⋆⋆⋆⋆⋆ | 1.2km | ✓ |
Marge's Motel | $55 | ⋆⋆ | 3.6km | |
Happy Hostel | $25 | ⋆⋆⋆ | 0.4km | ✓ |
Lovely Lodge | $100 | ⋆⋆⋆⋆ | 8.2km | ✓ |
As the number of dimensions/attributes increases, so too does the size of and difficulty in producing the skyline. Parallel algorithms, such as those implemented here, quickly become necessary.
SkyBench is released in conjunction with our recent ICDE paper [2]. All of the code and scripts necessary to repeat experiments from that paper are available in this software suite. To the best of our knowledge, this is also the first publicly released C++ skyline software, which will hopefully be a useful resource for the academic and industry research communities.
The following algorithms have been implemented in SkyBench:
-
Hybrid [2]: Located in src/hybrid. It is the state-of-the-art multi-core algorithm, based on two-level quad-tree partitioning of the data and memoisation of point-to-point relationships.
-
Q-Flow [2]: Located in src/qflow. It is a simplification of Hybrid to demonstrate control flow.
-
PSkyline [3]: Located in src/pskyline. It was the previous state-of-the-art multi-core algorithm, based on a divide-and-conquer paradigm.
-
BSkyTree [4]: Located in src/bskytree. It is the state-of-the-art sequential algorithm, based on a quad-tree partitioning of the data and memoisation of point-to-point relationships.
All four algorithms are implementations of the common interface defined in
common/skyline_i.h and use common dominance tests from
common/common.h and common/dt_avx.h
(the latter when vectorisation is enabled).
For reproducibility of the experiments in [2], we include three datasets. The WEATHER dataset was originally obtained from The University of East Anglia Climatic Research Unit and preprocessed for skyline computation. We also include two classic skyline datasets, exactly as used in [2]: NBA and HOUSE.
The synthetic workloads can be generated with the standard benchmark skyline data generator [1] hosted on pgfoundry.
SkyBench depends on the following applications:
-
A C++ compiler that supports C++11 and OpenMP (e.g., the newest GNU compiler)
-
The GNU
make
program -
AVX or AVX2 if vectorised dominance tests are to be used
To run, the code needs to be compiled with the given number of dimensions.^
For example, to compute the skyline of the 8-dimensional NBA data set located
in workloads/nba-U-8-17264.csv
, do:
make all DIMS=8
./bin/SkyBench -f workloads/nba-U-8-17264.csv
By default, it will compute the skyline with all algorithms. Running ./bin/SkyBench
without parameters will provide more details about the supported options.
You can make use of the provided shell script (/script/runExp.sh
) that does all of
the above automatically. For details, execute:
./script/runExp.sh
To reproduce the experiment with real datasets (Table II in [2]), do (assuming a 16-core machine):
./scripts/realTest.sh 16 T "bskytree pbskytree pskyline qflow hybrid"
^For performance reasons, skyline implementations that we obtained from other authors compile their code for a specific number of dimensions. For a fair comparison, we adopted the same approach.
This software is subject to the terms of The MIT License, which has been included in this repository.
This software suite will be expanded soon with new algorithms; so, you are encouraged to ensure that this is still the latest version. Please do not hesitate to contact the authors if you have comments, questions, or bugs to report.
S. Börzsönyi, D. Kossmann, and K. Stocker. (2001) "The Skyline Operator." In Proceedings of the 17th International Conference on Data Engineering (ICDE 2001), 421--432. http://infolab.usc.edu/csci599/Fall2007/papers/e-1.pdf
S. Chester, D. Šidlauskas, I Assent, and K. S. Bøgh. (2015) "Scalable parallelization of skyline computation for multi-core processors." In Proceedings of the 31st IEEE International Conference on Data Engineering (ICDE 2015), 1083--1094. http://cs.au.dk/~schester/publications/chester_icde2015_mcsky.pdf
H. Im, J. Park, and S. Park. (2011) "Parallel skyline computation on multicore architectures." Information Systems 36(4): 808--823. http://dx.doi.org/10.1016/j.is.2010.10.005
J. Lee and S. Hwang. (2014) "Scalable skyline computation using a balanced pivot selection technique." Information Systems 39: 1--21. http://dx.doi.org/10.1016/j.is.2013.05.005