Frovedis is high-performance middleware for data analytics. It is written in C++ and utilizes MPI for communication between the servers.
It provides
- Spark-like API for distributed processing
- Matrix library using above API
- Machine learning algorithm library
- Dataframe for preprocessing
- Spark/Python interface for easy utilization
Our primary target architecture is SX-Aurora TSUBASA, which is NEC's vector computer; these libraries are carefully written to support vectorization. However, they are just standard C++ programs and can run efficiently on other architectures like x86.
The machine learning algorithm library performs really well on sparse datasets, especially on SX-Aurora TSUBASA. In the case of logistic regression, it performed 10x faster on x86, and 100x faster on SX-Aurora TSUBASA, compared to Spark on x86.
In addition, it provides Spark/Python interface that is mostly compatible with Spark MLlib and Python scikit-learn. If you are using these libraries, you can easily utilize it. In the case of SX-Aurora TSUBASA, Spark/Python runs on x86 side and the middleware runs on VE (Vector Engine); therefore, users can enjoy the high-performance without noticing the hardware details.
If you want prebuilt binary, please check "releases" (https://github.com/frovedis/frovedis/releases), which includes rpm file. If your environment is supported, using rpm is the easiest way to install.
If you want to build the framework on SX-Aurora TSUBASA, we recommend to utilize our build tools (https://github.com/frovedis/packaging) together with VE version of boost (https://github.com/frovedis/boost-ve). Please follow the instructions in README.md of the build tools.
On other platforms, please follow the instructions below.
The system depends on
- C++11 compiler
- Boost library
- MPI
- Numerial libraries: LAPACK, BLAS, Parallel ARPACK, ScaLAPACK
In Linux environment, it is easy to install them using package manager like apt or yum. On SX-Aurora TSUBASA, numerical libraries and MPI are provided by the system. Boost library is provided separately.
Alternatively, the distribution contains BLAS, LAPACK, Parallel ARPACK, and ScaLAPACK source code. You can build them by yourself during the build process.
First, please edit Makefile.conf in this directory to specify the
architecture. If TARGET := x86
, target architecture is Linux/x86.
If TARGET := sx
, target architecture is SX-ACE (cross compiler on Linux).
TARGET := icpc
is for using Intel compiler on Linux (depends on OpenMPI).
If TARGET := ve
, target architecture is SX-Aurora TSUBASA (cross
compiler on Linux).
If BUILD_FOREIGN_IF := true
, foreign interface to Spark and Python
will be compiled. The interface itself makes sense only on x86 (of
course, the server can run on SX-ACE or VE of SX-Aurora TSUBASA,
though). To build this, you need to install Spark and Java beforehand
on x86.
Next, please edit Makefile.in.[x86, sx, icpc, ve]. You would want to
change INSTALLPATH
. You might need to change LIBS_COMMON
if you
use different Linux distribution where the library names are
different. On SX-ACE or SX-Aurora TSUBASA, you would need to change
BOOST_INC
and BOOST_LIB
, which will be defined after installing
Boost.
If you build foreign interface on x86, you also need to specify JAVA_HOME, SCALA_HOME, SPARK_HOME, HDPATH (path for hadoop jar file), and PYTHON_INC in Makefile.in.x86 (even if TARGET is icpc, you need to edit Makefile.in.x86).
After editing Makefile.in.[x86, sx, icpc, ve]
, please type
$ make
Then, libraries will be compiled.
To copy the compiled libraries and headers to the specified directory, please type
$ make install
If the specified directory is not your home directory, you might need
to sudo
.
If you have installed Frovedis from prebuilt binary, you can find /opt/nec/nosupport/frovedis/getting_started.md; please also refer to it.
In this case, if you want to use VE version, you can set up your environment variables by
$ source /opt/nec/nosupport/frovedis/ve/bin/veenv.sh
If you want to use x86 version, please use following:
$ source /opt/nec/nosupport/frovedis/x86/bin/x86env.sh
In the case of prebuilt binary, ${INSTALLPATH}
is
/opt/nec/nosupport/frovedis/ve/
in the case of VE, and
/opt/nec/nosupport/frovedis/x86/
in the case of x86.
Tutorial is in ${INSTALLPATH}
/doc/tutorial/tutorial.[md,pdf].
The directory also contains small programs that are explained in the
tutorial. You can copy the source files into your home directory and
compile them by yourself. The Makefile and Makefile.in.[x86, etc.]
contains configurations for compilation, like compilation options,
path to include files, libraries, etc.
You can re-use it for your own programs.
The small programs in the tutorial directory looks like this:
#include <frovedis.hpp>
int two_times(int i) {return i*2;}
int main(int argc, char* argv[]){
frovedis::use_frovedis use(argc, argv);
std::vector<int> v;
for(size_t i = 1; i <= 8; i++) v.push_back(i);
auto d1 = frovedis::make_dvector_scatter(v);
auto d2 = d1.map(two_times);
auto r = d2.gather();
for(auto i: r) std::cout << i << std::endl;
}
This program creates distributed vector from std::vector, and doubles its elements in a distribited way; then gathers to std::vector again. As you can see, you can write distributed program quite easily and consicely compared to MPI program.
In addition, there are also sample programs in ${INSTALLPATH}
/samples
directory. You can also use them as reference when you write your own
programs.
Manuals are in ${INSTALLPATH}
/doc/manual/manual.pdf, which are more in
detail than the tutorial. Manuals for man
command is in
${INSTALLPATH}
/man; you can do like man dvector
after setting the
directory into your MANPATH.
You can utilize the predefined functionalities from Spark/Python, which includes machine learning algorithms, matrix operations, and dataframe operations.
This is implemented as a server; the server accepts RPC (remote procedure call) to provide the above functionalities from Spark or Python. The server can run on both x86 and VE.
To run the server, the network should be configured properly. Plaese make sure if your hostname (which can be obtained by hostname command) is in /etc/hosts or registered in DNS. (myhostname in /etc/nsswitch.conf does not work for VE.)
Even if you want to use VE, you need to build x86 version, because it builds dynamic link libraries for Spark/Python interface.
Tutorial is in ${INSTALLPATH}
/doc/tutorial_spark/tutorial.[md,pdf],
and ${INSTALLPATH}
/doc/tutorial_python/tutorial.[md,pdf].
The directory also contains small programs that are explained in the
tutorial. You can copy the source files into your home directory and
run them by yourself.
If the insall path of x86 version is ${X86_INSTALLPATH}
,
${X86_INSTALLPATH}
/foreign_if_demo/[spark,python] includes
another demo.
Please copy these directories into your home directory (since it creates files). The scripts ./foreign_if_demo/spark/run_demo.sh and ./foreign_if_demo/python/run_demo.sh run demos.
As for Spark, compiled jar files are used; as for Python, Python scripts are used. The server is invoked from the script and shutdown at the end of the program.
To run the demo, you need to set following environment variables (if you are using prebuilt binary, veenv.sh or x86env.sh sets these variables):
export INSTALLPATH=...
export FROVEDIS_SERVER=${INSTALLPATH}/bin/frovedis_server
export PYTHONPATH=${INSTALLPATH}/lib/python
export LD_LIBRARY_PATH=${INSTALLPATH}/lib
FROVEDIS_SERVER
is used in the both run_demo.sh scripts.
INSTALLPATH
is used to specify options in spark-submit
in the
script. PYTHONPATH
is needed to load the interface library written
in Python. LD_LIBRARY_PATH
is used to load dynamic link library that
is used in the Python library.
You can modify the Spark programs and Python scripts; there is a Makefile to build jar file from Spark/Scala programs.
License of this software is in LICENSE file.
This software includes third party software. The third_party directory includes third party software together with their licenses. For convenience, src/frovedis/core and src/foreign_if/exrpc inclues a few third party files whose license is Boost Software License (shown in the header of the files), and src/frovedis/ml/w2v is based on the software whose license is Apache License Version 2.0.