Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11362:[Rust][DataFusion] Use iterator APIs in to_array_of_size to improve performance #9305

Closed
wants to merge 62 commits into from
Closed
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
69a9a1c
ARROW-11303: [Release][C++] Enable mimalloc in the windows verificati…
kszucs Jan 18, 2021
903b41c
ARROW-11309: [Release][C#] Use .NET 3.1 for verification
kou Jan 19, 2021
19e9559
ARROW-11315: [Packaging][APT][arm64] Add missing gir1.2 files
kou Jan 19, 2021
17a3fab
ARROW-11314: [Release][APT][Yum] Add support for verifying arm64 pack…
kou Jan 19, 2021
275fda1
ARROW-7633: [C++][CI] Create fuzz targets for tensors and sparse tensors
mrkn Jan 19, 2021
2d3e8f9
ARROW-11246: [Rust] Add type to Unexpected accumulator state error
ovr Jan 19, 2021
e20f439
ARROW-11254: [Rust][DataFusion] Add SIMD and snmalloc flags as option…
Dandandan Jan 19, 2021
18dc62c
ARROW-11074: [Rust][DataFusion] Implement predicate push-down for par…
yordan-pavlov Jan 19, 2021
127961a
ARROW-10489: [C++] Add Intel C++ compiler options for different warni…
jcmuel Jan 19, 2021
0e5d646
ARROW-9128: [C++] Implement string space trimming kernels: trim, ltri…
maartenbreddels Jan 19, 2021
f63cffa
ARROW-11305 Skip first argument (which is the program name) in parque…
jhorstmann Jan 19, 2021
7e0cb0a
ARROW-11108: [Rust] Fixed performance issue in mutableBuffer.
jorgecarleitao Jan 19, 2021
b448de7
ARROW-11216: [Rust] add doc example for StringDictionaryBuilder
alamb Jan 19, 2021
4a6eb19
ARROW-11268: [Rust][DataFusion] MemTable::load output partition support
Dandandan Jan 19, 2021
a4266a1
ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error
Dandandan Jan 19, 2021
bbc9029
ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join
Dandandan Jan 19, 2021
8e218e0
ARROW-11313: [Rust] Fixed size_hint
jorgecarleitao Jan 19, 2021
35053fe
ARROW-11222: [Rust] Catch up with flatbuffers 0.8.1 which had some UB…
mqy Jan 19, 2021
50ba534
ARROW-11277: [C++] Workaround macOS 10.11: don't default construct co…
bkietz Jan 19, 2021
a7633c7
ARROW-11322: [Rust] Re-opening `memory` module as public
maxburke Jan 20, 2021
555643a
ARROW-11269: [Rust] [Parquet] Preserve timezone in int96 reader
nevi-me Jan 20, 2021
e7c69e6
ARROW-11279: [Rust][Parquet] ArrowWriter Definition Levels Memory Usage
Jan 20, 2021
71572bd
ARROW-11318: [Rust] Support pretty printing timestamp, date, and time…
alamb Jan 20, 2021
ed709e0
ARROW-11311: [Rust] Fixed unset_bit
jorgecarleitao Jan 20, 2021
01c5aec
ARROW-11265: [Rust] Made bool not ArrowNativeType
jorgecarleitao Jan 20, 2021
6912869
ARROW-11290: [Rust][DataFusion] Address hash aggregate performance is…
Dandandan Jan 20, 2021
23550c2
ARROW-11149: [Rust] DF Support List/LargeList/FixedSizeList in create…
ovr Jan 20, 2021
a0e1244
ARROW-11329: [Rust] Don't rerun build.rs on every file change
mbrubeck Jan 20, 2021
8b56f85
ARROW-11220: [Rust] Implement GROUP BY support for Boolean
ovr Jan 21, 2021
4601c02
ARROW-11330: [Rust][DataFusion] add ExpressionVisitor to encode expre…
alamb Jan 21, 2021
84126d5
ARROW-11323: [Rust][DataFusion] Allow sort queries to return no results
alamb Jan 21, 2021
bd90043
ARROW-10831: [C++][Compute] Implement quantile kernel
cyb70289 Jan 21, 2021
72bf95a
ARROW-11334: [Python][CI] Fix failing pandas nightly tests
jorisvandenbossche Jan 21, 2021
bc5d8bf
ARROW-11320: [C++] Try to strengthen temporary dir creation
pitrou Jan 21, 2021
c413566
ARROW-11141: [Rust] Add basic Miri checks to CI pipeline
vertexclique Jan 21, 2021
6959e46
ARROW-11337: [C++] Compilation error with ThreadSanitizer
westonpace Jan 22, 2021
499b6d0
ARROW-11333: [Rust] Generalized creation of empty arrays.
jorgecarleitao Jan 22, 2021
629a6fd
ARROW-10299: [Rust] Use IPC Metadata V5 as default
nevi-me Jan 22, 2021
457fa91
ARROW-11343: [Rust][DataFusion] Simplified example with UDF.
jorgecarleitao Jan 22, 2021
251ecac
ARROW-10766: [Rust] [Parquet] Compute nested list definitions
nevi-me Jan 22, 2021
262bbdc
ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec
Dandandan Jan 22, 2021
37c70fb
Add from_iter_values to create arrays from (non null) values
Dandandan Jan 22, 2021
8cd118d
Remove borrow (they are primitive types anyway)
Dandandan Jan 22, 2021
3a63974
Fix comment
Dandandan Jan 22, 2021
b44a4ad
ARROW-11299: [Python] Fix invalid-offsetof warnings
cyb70289 Jan 22, 2021
13e2134
ARROW-11291: [Rust] Add extend to MutableBuffer (-20% for arithmetic,…
jorgecarleitao Jan 23, 2021
67d0c2e
ARROW-11319: [Rust] [DataFusion] Improve test comparisons to record b…
alamb Jan 23, 2021
79c92aa
Merge branch 'master' of github.com:apache/arrow into array_iter_non_…
Dandandan Jan 23, 2021
941ee5d
Use extend
Dandandan Jan 23, 2021
a37941c
Use .collect() api
Dandandan Jan 23, 2021
e448fcc
Use iterators in `to_array_of_size`
Dandandan Jan 23, 2021
69c298e
Add benchmark
Dandandan Jan 23, 2021
10f4ada
ARROW-11317: [Rust] Include the prettyprint feature in CI Coverage
alamb Jan 24, 2021
d612b0f
Use none
Dandandan Jan 24, 2021
6144a23
Use None for Microsecond as well
Dandandan Jan 24, 2021
f2c4e26
Use `None`
Dandandan Jan 24, 2021
cf7638f
ARROW-11349: [Rust] Add from_iter_values to create arrays from (non n…
Dandandan Jan 25, 2021
e07f7e5
Merge remote-tracking branch 'upstream/master' into to_array_of_size_…
Dandandan Jan 25, 2021
eddf021
Use None for strings
Dandandan Jan 25, 2021
555eb1d
fmt
Dandandan Jan 25, 2021
8a20338
Merge remote-tracking branch 'upstream/master' into to_array_of_size_…
Dandandan Jan 26, 2021
32a9e0e
Add license
Dandandan Jan 29, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,41 @@ jobs:
cd rust
cargo clippy --all-targets --workspace -- -D warnings -A clippy::redundant_field_names

miri-checks:
name: Miri Checks
runs-on: ubuntu-latest
strategy:
matrix:
arch: [amd64]
rust: [nightly-2021-01-19]
steps:
- uses: actions/checkout@v2
with:
submodules: true
- uses: actions/cache@v2
with:
path: |
~/.cargo/registry
~/.cargo/git
target
key: ${{ runner.os }}-cargo-miri-${{ hashFiles('**/Cargo.lock') }}
- name: Setup Rust toolchain
run: |
rustup toolchain install ${{ matrix.rust }}
rustup default ${{ matrix.rust }}
rustup component add rustfmt clippy miri
- name: Run Miri Checks
env:
RUST_BACKTRACE: full
RUST_LOG: 'trace'
run: |
export MIRIFLAGS="-Zmiri-disable-isolation"
cd rust
cargo miri setup
cargo clean
# Ignore MIRI errors until we can get a clean run
cargo miri test || true

coverage:
name: Coverage
runs-on: ubuntu-latest
Expand Down
4 changes: 4 additions & 0 deletions cpp/build-support/fuzzing/generate_corpuses.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ rm -rf ${CORPUS_DIR}
${OUT}/arrow-ipc-generate-fuzz-corpus -file ${CORPUS_DIR}
${ARROW_CPP}/build-support/fuzzing/pack_corpus.py ${CORPUS_DIR} ${OUT}/arrow-ipc-file-fuzz_seed_corpus.zip

rm -rf ${CORPUS_DIR}
${OUT}/arrow-ipc-generate-tensor-fuzz-corpus -stream ${CORPUS_DIR}
${ARROW_CPP}/build-support/fuzzing/pack_corpus.py ${CORPUS_DIR} ${OUT}/arrow-ipc-tensor-stream-fuzz_seed_corpus.zip

rm -rf ${CORPUS_DIR}
${OUT}/parquet-arrow-generate-fuzz-corpus ${CORPUS_DIR}
cp ${ARROW_CPP}/submodules/parquet-testing/data/*.parquet ${CORPUS_DIR}
Expand Down
25 changes: 23 additions & 2 deletions cpp/cmake_modules/SetupCxxFlags.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,16 @@ if("${BUILD_WARNING_LEVEL}" STREQUAL "CHECKIN")
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wno-deprecated-declarations")
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wno-sign-conversion")
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wno-unused-variable")
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Intel")
if(WIN32)
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} /Wall")
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} /Wno-deprecated")
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} /Wno-unused-variable")
else()
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wall")
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wno-deprecated")
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wno-unused-variable")
endif()
else()
message(FATAL_ERROR "${UNKNOWN_COMPILER_MESSAGE}")
endif()
Expand All @@ -289,6 +299,12 @@ elseif("${BUILD_WARNING_LEVEL}" STREQUAL "EVERYTHING")
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wpedantic")
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wextra")
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wno-unused-parameter")
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Intel")
if(WIN32)
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} /Wall")
else()
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wall")
endif()
else()
message(FATAL_ERROR "${UNKNOWN_COMPILER_MESSAGE}")
endif()
Expand All @@ -304,9 +320,14 @@ else()
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} /W3")
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "AppleClang"
OR CMAKE_CXX_COMPILER_ID STREQUAL "Clang"
OR CMAKE_CXX_COMPILER_ID STREQUAL "GNU"
OR CMAKE_CXX_COMPILER_ID STREQUAL "Intel")
OR CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wall")
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Intel")
if(WIN32)
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} /Wall")
else()
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wall")
endif()
else()
message(FATAL_ERROR "${UNKNOWN_COMPILER_MESSAGE}")
endif()
Expand Down
1 change: 1 addition & 0 deletions cpp/src/arrow/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,7 @@ if(ARROW_COMPUTE)
compute/registry.cc
compute/kernels/aggregate_basic.cc
compute/kernels/aggregate_mode.cc
compute/kernels/aggregate_quantile.cc
compute/kernels/aggregate_var_std.cc
compute/kernels/codegen_internal.cc
compute/kernels/scalar_arithmetic.cc
Expand Down
5 changes: 5 additions & 0 deletions cpp/src/arrow/compute/api_aggregate.cc
Original file line number Diff line number Diff line change
Expand Up @@ -63,5 +63,10 @@ Result<Datum> Variance(const Datum& value, const VarianceOptions& options,
return CallFunction("variance", {value}, &options, ctx);
}

Result<Datum> Quantile(const Datum& value, const QuantileOptions& options,
ExecContext* ctx) {
return CallFunction("quantile", {value}, &options, ctx);
}

} // namespace compute
} // namespace arrow
41 changes: 41 additions & 0 deletions cpp/src/arrow/compute/api_aggregate.h
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,33 @@ struct ARROW_EXPORT VarianceOptions : public FunctionOptions {
int ddof = 0;
};

/// \brief Control Quantile kernel behavior
///
/// By default, returns the median value.
struct ARROW_EXPORT QuantileOptions : public FunctionOptions {
/// Interpolation method to use when quantile lies between two data points
enum Interpolation {
LINEAR = 0,
LOWER,
HIGHER,
NEAREST,
MIDPOINT,
};

explicit QuantileOptions(double q = 0.5, enum Interpolation interpolation = LINEAR)
: q{q}, interpolation{interpolation} {}

explicit QuantileOptions(std::vector<double> q,
enum Interpolation interpolation = LINEAR)
: q{std::move(q)}, interpolation{interpolation} {}

static QuantileOptions Defaults() { return QuantileOptions{}; }

/// quantile must be between 0 and 1 inclusive
std::vector<double> q;
enum Interpolation interpolation;
};

/// @}

/// \brief Count non-null (or null) values in an array.
Expand Down Expand Up @@ -229,5 +256,19 @@ Result<Datum> Variance(const Datum& value,
const VarianceOptions& options = VarianceOptions::Defaults(),
ExecContext* ctx = NULLPTR);

/// \brief Calculate the quantiles of a numeric array
///
/// \param[in] value input datum, expecting Array or ChunkedArray
/// \param[in] options see QuantileOptions for more information
/// \param[in] ctx the function execution context, optional
/// \return resulting datum as an array
///
/// \since 4.0.0
/// \note API not yet finalized
ARROW_EXPORT
Result<Datum> Quantile(const Datum& value,
const QuantileOptions& options = QuantileOptions::Defaults(),
ExecContext* ctx = NULLPTR);

} // namespace compute
} // namespace arrow
7 changes: 7 additions & 0 deletions cpp/src/arrow/compute/api_scalar.h
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,13 @@ struct ARROW_EXPORT StrptimeOptions : public FunctionOptions {
TimeUnit::type unit;
};

struct ARROW_EXPORT TrimOptions : public FunctionOptions {
explicit TrimOptions(std::string characters) : characters(std::move(characters)) {}

/// The individual characters that can be trimmed from the string.
std::string characters;
};

enum CompareOperator : int8_t {
EQUAL,
NOT_EQUAL,
Expand Down
23 changes: 9 additions & 14 deletions cpp/src/arrow/compute/exec.cc
Original file line number Diff line number Diff line change
Expand Up @@ -688,10 +688,9 @@ Status PackBatchNoChunks(const std::vector<Datum>& args, ExecBatch* out) {
switch (arg.kind()) {
case Datum::SCALAR:
case Datum::ARRAY:
case Datum::CHUNKED_ARRAY:
length = std::max(arg.length(), length);
break;
case Datum::CHUNKED_ARRAY:
return Status::Invalid("Kernel does not support chunked array arguments");
default:
DCHECK(false);
break;
Expand Down Expand Up @@ -722,19 +721,15 @@ class VectorExecutor : public KernelExecutorImpl<VectorKernel> {
const std::vector<Datum>& outputs) override {
// If execution yielded multiple chunks (because large arrays were split
// based on the ExecContext parameters, then the result is a ChunkedArray
if (kernel_->output_chunked) {
if (HaveChunkedArray(inputs) || outputs.size() > 1) {
return ToChunkedArray(outputs, output_descr_.type);
} else if (outputs.size() == 1) {
// Outputs have just one element
return outputs[0];
} else {
// XXX: In the case where no outputs are omitted, is returning a 0-length
// array always the correct move?
return MakeArrayOfNull(output_descr_.type, /*length=*/0).ValueOrDie();
}
} else {
if (kernel_->output_chunked && (HaveChunkedArray(inputs) || outputs.size() > 1)) {
return ToChunkedArray(outputs, output_descr_.type);
} else if (outputs.size() == 1) {
// Outputs have just one element
return outputs[0];
} else {
// XXX: In the case where no outputs are omitted, is returning a 0-length
// array always the correct move?
return MakeArrayOfNull(output_descr_.type, /*length=*/0).ValueOrDie();
}
}

Expand Down
Loading