feat(`COPY TO`): hive partitioning support #2634

melbourne2991 · 2024-02-11T11:47:59Z

Addresses (#2462)

Provides hive partitioning support for Parquet & Json.

Missing from this PR:

Remaining formats (lance, csv, bson)
Reading hive partitioned files

CLAassistant · 2024-02-11T11:48:05Z

All committers have signed the CLA.

crates/datasources/Cargo.toml

crates/datasources/src/common/sink/parquet.rs

tychoish

this all looks really good to me.

I'd love to see some tests on "what happens when you partition by something other than a date"
we should definitely open another ticket for "reading from hive-partitioned files." right now you can use our glob, function, and that helps, but there is a push down projection that this might not be able to do. Clearly out of scope for this ticket, but it'd be killer feature either way.
I'd love to see a test with another file format (json or bson?) just just to make sure that it's generic enough and doesn't rely on something parquet specific.
I think it'd be good to be explicit about the expectation that the partitioned field remains in the output data or is elided because it's in the partition, so a test there would be good.

crates/datasources/Cargo.toml

crates/datasources/src/common/sink/write/demux.rs

melbourne2991 · 2024-02-17T03:13:12Z

this all looks really good to me.

I'd love to see some tests on "what happens when you partition by something other than a date"

we should definitely open another ticket for "reading from hive-partitioned files." right now you can use our glob, function, and that helps, but there is a push down projection that this might not be able to do. Clearly out of scope for this ticket, but it'd be killer feature either way.

I'd love to see a test with another file format (json or bson?) just just to make sure that it's generic enough and doesn't rely on something parquet specific.

I think it'd be good to be explicit about the expectation that the partitioned field remains in the output data or is elided because it's in the partition, so a test there would be good.

Agree on all these points - thanks for the feedback. (Just a note: the PR isn't in its final form yet. The current test was primarily for development ease - more comprehensive tests are on the way!).

tychoish · 2024-02-26T13:51:23Z

crates/datasources/src/common/sink/write/demux.rs

+//! -- NOTE --
+//! This code was originally sourced from:
+//! Repo: https://github.com/apache/arrow-datafusion
+//! Commit: ae882356171513c9d6c22b3bd966898fb4e8cac0
+//! Path: datafusion/core/src/datasource/file_format/write/demux.rs
+//! Date: 10 Feb 2024


would like a line that indicates what the change is that we needed to make.

tychoish · 2024-02-26T13:58:29Z

crates/datasources/src/common/sink/write/demux.rs

+        // This is implemented in the DF repo.
+        None => unimplemented!(),


it sort of feels like this is the wrong thing, and this is essentially a panic, regardless... it feels like we should decide if partition_by can ever be None (it seems kind of like it shouldn't ever get to this point, with that being the case in a non-error state.) but at the same time, I can see arguments for making this an error, a straight .expect(), or just a noop/continue.

tychoish · 2024-02-26T14:13:44Z

crates/protogen/src/sqlexec/copy_to.rs

@@ -90,12 +90,18 @@ pub struct CopyToFormatOptionsCsv {
 pub struct CopyToFormatOptionsJson {
    #[prost(bool, tag = "1")]
    pub array: bool,
+
+    #[prost(string, repeated, tag = "2")]
+    pub partition_columns: Vec<String>,


I think we need to do this for CSV, Lance, and BSON.

I realized that we missed having one of these options structs for bson which I added in #2704

I merged in #2704 which should make it easier to add partitioning support there too.

tychoish · 2024-02-26T14:15:46Z

crates/sqlbuiltins/src/validation.rs

@@ -22,6 +22,9 @@ pub enum ValidationError {

    #[error("Format '{format}' not supported by datasource '{datasource}'")]
    FormatNotSupportedByDatasource { format: String, datasource: String },
+
+    #[error("Partitioning is not supported by format '{format}'")]
+    PartitionByNotSupportedByFormat { format: String },


I'd think that we could have this be an enum that we already have for the format?

tychoish · 2024-02-26T14:18:40Z

crates/sqlexec/src/parser.rs

+    #[test]
+    fn copy_to_partition_by() {
+        let test_cases = ["COPY table TO './dest' FORMAT parquet PARTITION BY (year, quarter)"];
+
+        for test_case in test_cases {
+            let stmt = CustomParser::parse_sql(test_case)
+                .unwrap()
+                .pop_front()
+                .unwrap();
+            assert_eq!(test_case, stmt.to_string().as_str());
+        }
+    }


can we test the error conditions:

PARTITION without BY

an unsupported format?

an unknown format?

tychoish · 2024-02-26T14:42:07Z

crates/sqlbuiltins/src/validation.rs

@@ -165,3 +168,13 @@ pub fn validate_copyto_dest_format_support(dest: &str, format: &str) -> Result<(
        })
    }
 }
+
+pub fn validate_copyto_format_partition_support(format: &str) -> Result<()> {
+    if matches!(format, |"parquet"| "json") {


I feel like we could have a method on the enum or something that could do this validation rather than on a string. Is this extension? I worry about this falling out of sync with other variations of the enum?

tychoish · 2024-02-26T14:42:55Z

crates/sqlexec/src/planner/physical_plan/copy_to.rs

+        CopyToFormatOptions::Json(json_opts) => {
+            let schema = source.schema();
+
+            println!("location: {}", location);


should probably omit this

tychoish · 2024-02-26T14:42:58Z

crates/sqlexec/src/planner/physical_plan/copy_to.rs

+                ),
+                json_opts.partition_columns,
+                location.into(),
+                ".json".to_string(),


why does this need to be a string and not using the enum or another property?

universalmind303 · 2024-03-05T14:39:48Z

marking as draft as it's not actively waiting on review.

@melbourne2991 please feel free to ping us when it is ready.

tychoish · 2024-03-20T12:34:43Z

@melbourne2991 wanted to check in on this. Is there anything I can do to help you on this?

melbourne2991 · 2024-03-22T11:55:47Z

hey @tychoish, apologies, I've been swamped lately - I'm not sure if I'll have time to get around to this in any reasonable time frame - happy for someone else to pick it up, there shouldn't be too much effort left on it I hope

melbourne2991 mentioned this pull request Feb 11, 2024

Hive partitioning when using COPY TO #2462

Open

universalmind303 self-requested a review February 11, 2024 17:27

tychoish self-requested a review February 12, 2024 23:14

tychoish reviewed Feb 12, 2024

View reviewed changes

crates/datasources/Cargo.toml Outdated Show resolved Hide resolved

crates/datasources/src/common/sink/parquet.rs Show resolved Hide resolved

melbourne2991 force-pushed the feat-copy-to-hive-partitioning branch 2 times, most recently from eb83dba to 8317fbc Compare February 16, 2024 04:25

tychoish reviewed Feb 16, 2024

View reviewed changes

crates/datasources/Cargo.toml Outdated Show resolved Hide resolved

crates/datasources/src/common/sink/write/demux.rs Outdated Show resolved Hide resolved

melbourne2991 force-pushed the feat-copy-to-hive-partitioning branch 3 times, most recently from dc48449 to d87ea65 Compare February 20, 2024 11:13

melbourne2991 marked this pull request as ready for review February 20, 2024 11:15

melbourne2991 changed the title ~~[WIP] Feat copy to hive partitioning~~ [WIP] " hive partitioning Feb 20, 2024

melbourne2991 changed the title ~~[WIP] " hive partitioning~~ Hive partitioning for 'COPY TO' Feb 20, 2024

melbourne2991 force-pushed the feat-copy-to-hive-partitioning branch from d87ea65 to 2dea866 Compare February 20, 2024 11:17

melbourne2991 changed the title ~~Hive partitioning for 'COPY TO'~~ feat: add hive partitioning support for 'COPY TO' Feb 20, 2024

melbourne2991 force-pushed the feat-copy-to-hive-partitioning branch 2 times, most recently from fadcd2e to 192642b Compare February 22, 2024 12:22

feat: add hive partitioning support for COPY TO (GlareDB#2462)

1c618b5

melbourne2991 force-pushed the feat-copy-to-hive-partitioning branch from 192642b to 1c618b5 Compare February 25, 2024 01:57

greyscaled changed the title ~~feat: add hive partitioning support for 'COPY TO'~~ feat(copy to): hive partitioning support Feb 26, 2024

greyscaled changed the title ~~feat(copy to): hive partitioning support~~ feat(COPY TO): hive partitioning support Feb 26, 2024

greyscaled linked an issue Feb 26, 2024 that may be closed by this pull request

Hive partitioning when using COPY TO #2462

Open

tychoish mentioned this pull request Feb 26, 2024

chore: make bson copy to options more consistent #2704

Merged

tychoish reviewed Feb 26, 2024

View reviewed changes

universalmind303 marked this pull request as draft March 5, 2024 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(`COPY TO`): hive partitioning support #2634

feat(`COPY TO`): hive partitioning support #2634

melbourne2991 commented Feb 11, 2024 •

edited

Loading

CLAassistant commented Feb 11, 2024 •

edited

Loading

tychoish left a comment

melbourne2991 commented Feb 17, 2024

tychoish Feb 26, 2024

tychoish Feb 26, 2024

tychoish Feb 26, 2024

tychoish Feb 27, 2024

tychoish Feb 26, 2024

tychoish Feb 26, 2024

tychoish Feb 26, 2024

tychoish Feb 26, 2024

tychoish Feb 26, 2024

universalmind303 commented Mar 5, 2024

tychoish commented Mar 20, 2024

melbourne2991 commented Mar 22, 2024

		// This is implemented in the DF repo.
		None => unimplemented!(),

feat(COPY TO): hive partitioning support #2634

Are you sure you want to change the base?

feat(COPY TO): hive partitioning support #2634

Conversation

melbourne2991 commented Feb 11, 2024 • edited Loading

CLAassistant commented Feb 11, 2024 • edited Loading

tychoish left a comment

Choose a reason for hiding this comment

melbourne2991 commented Feb 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

universalmind303 commented Mar 5, 2024

tychoish commented Mar 20, 2024

melbourne2991 commented Mar 22, 2024

feat(`COPY TO`): hive partitioning support #2634

feat(`COPY TO`): hive partitioning support #2634

melbourne2991 commented Feb 11, 2024 •

edited

Loading

CLAassistant commented Feb 11, 2024 •

edited

Loading