In order to use arrow with the Intel IAA Accelerator, we need to build both arrow and QPL separately.
Arrow build instructions:
git clone https://github.com/illinoisdata/arrow-qpl.git
mv arrow-qpl arrow
pushd arrow
git submodule update --init
export PARQUET_TEST_DATA="${PWD}/cpp/submodules/parquet-testing/data"
export ARROW_TEST_DATA="${PWD}/testing/data"
popd
mkdir dist
export ARROW_HOME=$(pwd)/dist
export LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH
export CMAKE_PREFIX_PATH=$ARROW_HOME:$CMAKE_PREFIX_PATH
export QPL_HOME=/home/raunaks3/qpl_library
export CMAKE_PREFIX_PATH=$QPL_HOME:$CMAKE_PREFIX_PATH
mkdir arrow/cpp/build
pushd arrow/cpp/build
cmake -DCMAKE_PREFIX_PATH=$CMAKE_PREFIX_PATH \
-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DCMAKE_BUILD_TYPE=Debug \
-DARROW_BUILD_TESTS=ON \
-DARROW_COMPUTE=ON \
-DARROW_CSV=ON \
-DARROW_DATASET=ON \
-DARROW_FILESYSTEM=ON \
-DARROW_HDFS=ON \
-DARROW_JSON=ON \
-DARROW_PARQUET=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZLIB=ON \
-DARROW_WITH_ZSTD=ON \
-DARROW_WITH_QPL=ON \
-DPARQUET_REQUIRE_ENCRYPTION=ON \
-DARROW_EXTRA_ERROR_CONTEXT="ON" \
..
make -j8
sudo make install
popd
If you want to use python as well:
python3 -m venv pyarrow-dev
source ./pyarrow-dev/bin/activate
pip install -r arrow/python/requirements-build.txt
pip install ipykernel
pushd arrow/python
export PYARROW_WITH_PARQUET=1
export PYARROW_WITH_DATASET=1
export PYARROW_WITH_SNAPPY=1
export PYARROW_WITH_ZLIB=1
export PYARROW_WITH_QPL=1
export PYARROW_WITH_ZSTD=1
export PYARROW_WITH_BZ2=1
export PYARROW_WITH_BROTLI=1
export PYARROW_WITH_LZ4=1
export PYARROW_WITH_HDFS=1
export PYARROW_WITH_CSV=1
export PYARROW_WITH_JSON=1
export PYARROW_PARALLEL=8
export PYARROW_WITH_PARQUET_ENCRYPTION=1
python setup.py build_ext --inplace
popd
For building QPL,
git clone --recursive https://github.com/intel/qpl.git ./qpl_library
cd qpl_library
mkdir build
cd build
mkdir ../qpl_installation
cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=../qpl_installation ..
cmake --build . --target install
# To configure the IAA device (in case we are using hardware path):
sudo python3 /home/<USER>/qpl_library/qpl_installation/share/QPL/scripts/accel_conf.py --load=/home/<USER>/qpl_library/qpl_installation/share/QPL/configs/1n1d1e1w-s-n2.conf
Testing Before testing the arrow-qpl integration, it makes sense to test whether qpl runs normally on its own. You can do this by running:
cd ~/qpl_library/examples/low-level-api
g++ -I/home/raunaks3/qpl_library/qpl_installation/include -o compression_example compression_example.cpp /home/raunaks3/qpl_library/qpl_installation/lib/libqpl.a -ldl
sudo ./compression_example software_path
Any issues in the above step need to be fixed before moving forward.
Now, the normal testing file is arrow/cpp/examples/parquet/parquet_arrow/reader-writer.cc
.
It creates a table, writes it to disk as a parquet file using compression with QPL, and then reads and decompresses the file (also using QPL). Currently this is working with both the software path (no accelerator) and hardware path (IAA accelerator).
To test and run (note that if we change any source code in the main arrow repository we need to rebuild arrow before running the following):
cd arrow/cpp/examples/parquet/parquet_arrow
mkdir qpl_build
cd qpl_build
cmake ..
make
./parquet-arrow-example
Testing compression/decompression performance on TPCH data:
Details are given in /home/raunaks3/arrow/python/TPCH_README.md
What's done
- Compression/Decompression with QPL. The new compression codec is in
cpp/src/arrow/util/compression_qpl.cc
and any relevant files in the repository have been modified accordingly. Testing details are given above. - Testing performance speedup on TPCH data for compression/decompression
- Loading TPCH data in C++ for the filtering step
Work in progress -
- Testing filtering speedup in QPL compared to arrow
---------------- Original Apache Arrow README continues from this point onwards ----------------
Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.
Major components of the project include:
- The Arrow Columnar In-Memory Format: a standard and efficient in-memory representation of various datatypes, plain or nested
- The Arrow IPC Format: an efficient serialization of the Arrow format and associated metadata, for communication between processes and heterogeneous environments
- The Arrow Flight RPC protocol: based on the Arrow IPC format, a building block for remote services exchanging Arrow data with application-defined semantics (for example a storage server or a database)
- C++ libraries
- C bindings using GLib
- C# .NET libraries
- Gandiva: an LLVM-based Arrow expression compiler, part of the C++ codebase
- Go libraries
- Java libraries
- JavaScript libraries
- Python libraries
- R libraries
- Ruby libraries
- Rust libraries
Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.
The reference Arrow libraries contain many distinct software components:
- Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
- Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
- Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
- IO interfaces to local and remote filesystems
- Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
- Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
- Conversions to and from other in-memory data structures
- Readers and writers for various widely-used file formats (such as Parquet, CSV)
The official Arrow libraries in this repository are in different stages of implementing the Arrow format and related features. See our current feature matrix on git main.
Please read our latest project contribution guide.
Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved:
- Join the mailing list: send an email to dev-subscribe@arrow.apache.org. Share your ideas and use cases for the project.
- Follow our activity on GitHub issues
- Learn the format
- Contribute code to one of the reference implementations