Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of code duplication in array_functions (array_replace_all, array_replace_n, array_replace, etc) #7988

Closed
alamb opened this issue Oct 30, 2023 · 22 comments
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Oct 30, 2023

Describe the bug

There is a significant amount of code generated for array functions.

This both bloats binaries built with DataFusion as well as makes compile times slow.

To Reproduce

cd datafusion/datafusion-cli
cargo bloat
 File  .text     Size                    Crate Name
 0.1%   0.2% 151.2KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_replace_all
 0.1%   0.2% 151.2KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_replace_n
 0.1%   0.2% 151.2KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_replace
 0.1%   0.2% 150.3KiB                  parquet brotli::enc::prior_eval::PriorEval<Alloc>::update_cost_base
 0.1%   0.2% 124.6KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_repeat
 0.1%   0.2% 121.5KiB                   blake2 blake2::Blake2bVarCore::compress
 0.0%   0.1%  81.5KiB                   blake2 blake2::Blake2sVarCore::compress
 0.0%   0.1%  73.2KiB                   blake3 blake3::portable::compress_in_place
 0.0%   0.1%  65.2KiB                chrono_tz <chrono_tz::timezones::Tz as chrono_tz::timezone_impl::TimeSpans>::timespans
 0.0%   0.1%  61.1KiB                sqlparser <sqlparser::ast::Statement as core::fmt::Display>::fmt
 0.0%   0.1%  61.0KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_append
 0.0%   0.1%  61.0KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_prepend
 0.0%   0.1%  60.5KiB                       h2 h2::codec::framed_read::decode_frame
 0.0%   0.1%  59.3KiB               datafusion datafusion::physical_planner::DefaultPhysicalPlanner::create_initial_plan::{{closure}}
 0.0%   0.1%  56.4KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_remove_all
 0.0%   0.1%  56.4KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_remove_n
 0.0%   0.1%  56.4KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_remove
 0.0%   0.1%  52.4KiB                       h2 h2::frame::headers::HeaderBlock::load::{{closure}}
 0.0%   0.1%  51.4KiB     datafusion_optimizer <datafusion_optimizer::simplify_expressions::expr_simplifier::Simplifier<S> as datafusion_common::tree_node::...
 0.0%   0.1%  48.9KiB datafusion_physical_expr datafusion_physical_expr::datetime_expressions::date_part
35.4%  97.6%  67.1MiB                          And 290367 smaller methods. Use -n N to show more.
36.3% 100.0%  68.7MiB                          .text section size, the file size is 189.3MiB

Expected behavior

I would like the array_replace_all, array_replace_n, array_replace functions to be implemented in terms of arrow kernels (such as eq, and take) and manipulations of offset buffers rather than directly creating new lists.

For example, the large macro expansion here:
https://github.com/apache/arrow-datafusion/blob/bb1d7f9343532d5fa8df871ff42000fbe836d7d7/datafusion/physical-expr/src/array_expressions.rs#L1431-L1437

I believe generates a bunch of specialized code for each different list element data type 😢

Additional context

No response

@alamb
Copy link
Contributor Author

alamb commented Oct 30, 2023

@jayzhan211 here is another one for you to consider

@jayzhan211
Copy link
Contributor

jayzhan211 commented Nov 7, 2023

Since this issue is not closed maybe we can extend this to all others array_function, lazy to open issue for each one :)

sorted by the size of cargo bloat

@Veeupup
Copy link
Contributor

Veeupup commented Nov 7, 2023

@jayzhan211 Thanks for your kind advice! I'll try array_append/prepend first!

It seems that the above functions have been all implemented.

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2023

Thank you both 🙏

@alamb alamb changed the title Lots of code duplication in array_functions (array_replace_all, array_replace_n, array_replace) Lots of code duplication in array_functions (array_replace_all, array_replace_n, array_replace, etc) Nov 7, 2023
@jayzhan211
Copy link
Contributor

@jayzhan211 Thanks for your kind advice! I'll try array_append/prepend first!

It seems that the above functions have been all implemented.

Yeah, all implemented but need to change to general type approach without downcast on each data type. See #8050 #8054 and #8071

@alamb
Copy link
Contributor Author

alamb commented Nov 8, 2023

I think there are still several examples of macros that could/should be functions:

In general, I think we should be aiming for the removal of macros in the array functions as the macros are unecessary, harder to maintain, and generate more code than is necessary

@Veeupup
Copy link
Contributor

Veeupup commented Nov 8, 2023

@alamb I'm now making code duplication in #8081 , I'll come back later to remove duplicate codes in theses functions

@jayzhan211
Copy link
Contributor

Work on array_remove

@Veeupup
Copy link
Contributor

Veeupup commented Nov 9, 2023

I can help with array_append

@Veeupup
Copy link
Contributor

Veeupup commented Nov 9, 2023

and prepend seems can be done in one PR ..

@jayzhan211
Copy link
Contributor

working on array_positions

@jayzhan211
Copy link
Contributor

jayzhan211 commented Nov 16, 2023

TODO:

Screenshot 2023-11-16 at 08 06 10

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2023

Its looking really nice @jayzhan211 -- thank you

@Veeupup
Copy link
Contributor

Veeupup commented Nov 17, 2023

@jayzhan211 hi, I can help with array_array

@jayzhan211
Copy link
Contributor

@jayzhan211 hi, I can help with array_array

Great!

@Veeupup
Copy link
Contributor

Veeupup commented Nov 18, 2023

hi @jayzhan211 I'm going to work on array_has : )

@alamb
Copy link
Contributor Author

alamb commented Nov 20, 2023

It is great to see how we are chipping away at this project. Very nice 👍

@jayzhan211
Copy link
Contributor

Work on define_array_slice

@jayzhan211
Copy link
Contributor

Screenshot 2023-12-03 at 12 00 55

datafusion_common::scalar::ScalarValue::iter_to_array is probably not the target function to cleanup ? Since it need to check the data types in Scalar and deal with them specifically.

@alamb
Copy link
Contributor Author

alamb commented Dec 3, 2023

Screenshot 2023-12-03 at 12 00 55 `datafusion_common::scalar::ScalarValue::iter_to_array` is probably not the target function to cleanup ? Since it need to check the data types in Scalar and deal with them specifically.

I think macros that generate specialized for for each scalar type could potentially be replaced

https://github.com/apache/arrow-datafusion/blob/075ff3ddfc78680d5da424ed63ffea1e38a6c57d/datafusion/common/src/scalar.rs#L1371-L1468

This then makes a massive about of duplication:
https://github.com/apache/arrow-datafusion/blob/075ff3ddfc78680d5da424ed63ffea1e38a6c57d/datafusion/common/src/scalar.rs#L1544-L1765

Specifically, rather than having specialized code for each possible list element type, perhaps we could have one code that accumulated the values (perhaps recursively calling iter_to_array) and one function that computes the offsets

@alamb
Copy link
Contributor Author

alamb commented Dec 6, 2023

@jayzhan211 and @Weijun-H and @Veeupup your efforts appear to be paying off!

The size of the datafusion-cli binary has shrunk by 2MB (from 63MB to 61MB) since 33.0.0 🎉

Size on main after merging https://github.com/apache/arrow-datafusion/pull/8414/files

andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion/datafusion-cli$ ls -l  `which datafusion-cli`
-rwxr-xr-x@ 1 andrewlamb  staff    63M Dec  5 14:07 /Users/andrewlamb/.cargo/bin/datafusion-cli*

Size of binary from 33.0.0

andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion/datafusion-cli$ ls -l target/release/datafusion-cli
-rwxr-xr-x@ 1 andrewlamb  staff    61M Dec  6 16:48 target/release/datafusion-cli*

@jayzhan211
Copy link
Contributor

I think we have done all the tasks in this topic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants