Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Make DataFrame a Vec of Column instead of Series #18664

Merged
merged 42 commits into from
Sep 14, 2024

Conversation

coastalwhite
Copy link
Collaborator

No description provided.

@github-actions github-actions bot added internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars labels Sep 10, 2024
@coastalwhite coastalwhite force-pushed the df-column-enum branch 2 times, most recently from a7b5dd6 to c7a070f Compare September 11, 2024 14:54
Copy link

codecov bot commented Sep 12, 2024

Codecov Report

Attention: Patch coverage is 86.52299% with 469 lines in your changes missing coverage. Please review.

Project coverage is 79.85%. Comparing base (a8e18e6) to head (015ca79).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-core/src/frame/column/mod.rs 80.48% 161 Missing ⚠️
crates/polars-core/src/datatypes/any_value.rs 42.10% 55 Missing ⚠️
...tream/src/nodes/parquet_source/row_group_decode.rs 0.00% 30 Missing ⚠️
crates/polars-expr/src/expressions/apply.rs 68.49% 23 Missing ⚠️
crates/polars-core/src/frame/mod.rs 91.98% 19 Missing ⚠️
crates/polars-ops/src/series/ops/fused.rs 20.00% 12 Missing ⚠️
crates/polars-ops/src/frame/pivot/positioning.rs 65.62% 11 Missing ⚠️
crates/polars-plan/src/dsl/mod.rs 80.00% 10 Missing ⚠️
...ipe/src/executors/sinks/group_by/generic/global.rs 0.00% 9 Missing ⚠️
crates/polars-plan/src/dsl/expr_dyn_fn.rs 12.50% 7 Missing ⚠️
... and 52 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18664      +/-   ##
==========================================
+ Coverage   79.82%   79.85%   +0.03%     
==========================================
  Files        1513     1517       +4     
  Lines      203828   205542    +1714     
  Branches     2892     2892              
==========================================
+ Hits       162705   164142    +1437     
- Misses      40575    40852     +277     
  Partials      548      548              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

crates/polars-core/src/series/any_value.rs Outdated Show resolved Hide resolved
crates/polars-core/src/series/any_value.rs Outdated Show resolved Hide resolved
crates/polars-core/src/chunked_array/struct_/frame.rs Outdated Show resolved Hide resolved
crates/polars-core/src/chunked_array/struct_/mod.rs Outdated Show resolved Hide resolved
@@ -32,8 +32,8 @@ pub(crate) unsafe fn compare_df_rows2(
join_nulls: bool,
) -> bool {
for (l, r) in left.get_columns().iter().zip(right.get_columns()) {
let l = l.get_unchecked(left_idx);
let r = r.get_unchecked(right_idx);
let l = l.as_materialized_series().get_unchecked(left_idx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be very expensive. I think we must materialize before and pass &[Series] to left and right.

crates/polars-ops/src/frame/pivot/mod.rs Outdated Show resolved Hide resolved
let a: &ChunkedArray<$T> = a.as_ref().as_ref().as_ref();
let b: &ChunkedArray<$T> = b.as_ref().as_ref().as_ref();
let c: &ChunkedArray<$T> = c.as_ref().as_ref().as_ref();
let a: &ChunkedArray<$T> = a.as_materialized_series().as_ref().as_ref().as_ref();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can materialize the scalar to a series of length 1, we will have very large speed ups here as the arithmetic can deal with broadcasting already.

let a: &ChunkedArray<$T> = a.as_ref().as_ref().as_ref();
let b: &ChunkedArray<$T> = b.as_ref().as_ref().as_ref();
let c: &ChunkedArray<$T> = c.as_ref().as_ref().as_ref();
let a: &ChunkedArray<$T> = a.as_materialized_series().as_ref().as_ref().as_ref();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

let a: &ChunkedArray<$T> = a.as_ref().as_ref().as_ref();
let b: &ChunkedArray<$T> = b.as_ref().as_ref().as_ref();
let c: &ChunkedArray<$T> = c.as_ref().as_ref().as_ref();
let a: &ChunkedArray<$T> = a.as_materialized_series().as_ref().as_ref().as_ref();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

@@ -141,7 +141,8 @@ impl Container for DataFrame {
}

fn chunk_lengths(&self) -> impl Iterator<Item = usize> {
self.get_columns()[0].chunk_lengths()
// @scalar-correctness?
self.columns[0].as_materialized_series().chunk_lengths()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems wasteful to materialize here. I think we should create e special chunk_lengths iterator for Column. For Column::Scalar it returns 1 time len.

@@ -78,7 +78,8 @@ pub(super) unsafe fn call_plugin(
.get(format!("_polars_plugin_{}", symbol).as_bytes())
.unwrap();

let input = s.iter().map(export_series).collect::<Vec<_>>();
// @scalar-correctness?
let input = s.iter().map(export_column).collect::<Vec<_>>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should materialize here. 👍

let file_path_series = self.include_file_paths.clone().map(|file_path_col| {
StringChunked::full(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from a quick look - IIUC this does mean that (if) we materialize these they won't be shared anymore, but I don't particularly see that as a showstopper

@coastalwhite coastalwhite added the needs-bench Needs a benchmark run label Sep 13, 2024
@ritchie46
Copy link
Member

Alright. Great work @coastalwhite. Huge effort in just 4 days. 🙈

Let's quickly get it in, now all is green! :)

@ritchie46 ritchie46 merged commit 962b576 into pola-rs:main Sep 14, 2024
30 checks passed
@coastalwhite coastalwhite deleted the df-column-enum branch September 14, 2024 11:04
@c-peters c-peters added the accepted Ready for implementation label Sep 16, 2024
nameexhaustion pushed a commit to nameexhaustion/polars that referenced this pull request Sep 16, 2024
coastalwhite added a commit to coastalwhite/polars that referenced this pull request Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation internal An internal refactor or improvement needs-bench Needs a benchmark run python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants