Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(COPY TO): hive partitioning support #2634

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

melbourne2991
Copy link

@melbourne2991 melbourne2991 commented Feb 11, 2024

Addresses (#2462)

Provides hive partitioning support for Parquet & Json.

Missing from this PR:

  • Remaining formats (lance, csv, bson)
  • Reading hive partitioned files

@CLAassistant
Copy link

CLAassistant commented Feb 11, 2024

CLA assistant check
All committers have signed the CLA.

@melbourne2991 melbourne2991 force-pushed the feat-copy-to-hive-partitioning branch 2 times, most recently from eb83dba to 8317fbc Compare February 16, 2024 04:25
Copy link
Collaborator

@tychoish tychoish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all looks really good to me.

  • I'd love to see some tests on "what happens when you partition by something other than a date"
  • we should definitely open another ticket for "reading from hive-partitioned files." right now you can use our glob, function, and that helps, but there is a push down projection that this might not be able to do. Clearly out of scope for this ticket, but it'd be killer feature either way.
  • I'd love to see a test with another file format (json or bson?) just just to make sure that it's generic enough and doesn't rely on something parquet specific.
  • I think it'd be good to be explicit about the expectation that the partitioned field remains in the output data or is elided because it's in the partition, so a test there would be good.

crates/datasources/Cargo.toml Outdated Show resolved Hide resolved
crates/datasources/src/common/sink/write/demux.rs Outdated Show resolved Hide resolved
@melbourne2991
Copy link
Author

this all looks really good to me.

  • I'd love to see some tests on "what happens when you partition by something other than a date"
  • we should definitely open another ticket for "reading from hive-partitioned files." right now you can use our glob, function, and that helps, but there is a push down projection that this might not be able to do. Clearly out of scope for this ticket, but it'd be killer feature either way.
  • I'd love to see a test with another file format (json or bson?) just just to make sure that it's generic enough and doesn't rely on something parquet specific.
  • I think it'd be good to be explicit about the expectation that the partitioned field remains in the output data or is elided because it's in the partition, so a test there would be good.

Agree on all these points - thanks for the feedback. (Just a note: the PR isn't in its final form yet. The current test was primarily for development ease - more comprehensive tests are on the way!).

@melbourne2991 melbourne2991 force-pushed the feat-copy-to-hive-partitioning branch 3 times, most recently from dc48449 to d87ea65 Compare February 20, 2024 11:13
@melbourne2991 melbourne2991 marked this pull request as ready for review February 20, 2024 11:15
@melbourne2991 melbourne2991 changed the title [WIP] Feat copy to hive partitioning [WIP] " hive partitioning Feb 20, 2024
@melbourne2991 melbourne2991 changed the title [WIP] " hive partitioning Hive partitioning for 'COPY TO' Feb 20, 2024
@melbourne2991 melbourne2991 changed the title Hive partitioning for 'COPY TO' feat: add hive partitioning support for 'COPY TO' Feb 20, 2024
@melbourne2991 melbourne2991 force-pushed the feat-copy-to-hive-partitioning branch 2 times, most recently from fadcd2e to 192642b Compare February 22, 2024 12:22
@greyscaled greyscaled changed the title feat: add hive partitioning support for 'COPY TO' feat(copy to): hive partitioning support Feb 26, 2024
@greyscaled greyscaled changed the title feat(copy to): hive partitioning support feat(COPY TO): hive partitioning support Feb 26, 2024
@greyscaled greyscaled linked an issue Feb 26, 2024 that may be closed by this pull request
Comment on lines +22 to +27
//! -- NOTE --
//! This code was originally sourced from:
//! Repo: https://github.com/apache/arrow-datafusion
//! Commit: ae882356171513c9d6c22b3bd966898fb4e8cac0
//! Path: datafusion/core/src/datasource/file_format/write/demux.rs
//! Date: 10 Feb 2024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like a line that indicates what the change is that we needed to make.

Comment on lines +99 to +100
// This is implemented in the DF repo.
None => unimplemented!(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sort of feels like this is the wrong thing, and this is essentially a panic, regardless... it feels like we should decide if partition_by can ever be None (it seems kind of like it shouldn't ever get to this point, with that being the case in a non-error state.) but at the same time, I can see arguments for making this an error, a straight .expect(), or just a noop/continue.

@@ -90,12 +90,18 @@ pub struct CopyToFormatOptionsCsv {
pub struct CopyToFormatOptionsJson {
#[prost(bool, tag = "1")]
pub array: bool,

#[prost(string, repeated, tag = "2")]
pub partition_columns: Vec<String>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do this for CSV, Lance, and BSON.

I realized that we missed having one of these options structs for bson which I added in #2704

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged in #2704 which should make it easier to add partitioning support there too.

@@ -22,6 +22,9 @@ pub enum ValidationError {

#[error("Format '{format}' not supported by datasource '{datasource}'")]
FormatNotSupportedByDatasource { format: String, datasource: String },

#[error("Partitioning is not supported by format '{format}'")]
PartitionByNotSupportedByFormat { format: String },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think that we could have this be an enum that we already have for the format?

Comment on lines +1403 to +1414
#[test]
fn copy_to_partition_by() {
let test_cases = ["COPY table TO './dest' FORMAT parquet PARTITION BY (year, quarter)"];

for test_case in test_cases {
let stmt = CustomParser::parse_sql(test_case)
.unwrap()
.pop_front()
.unwrap();
assert_eq!(test_case, stmt.to_string().as_str());
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test the error conditions:

  • PARTITION without BY
  • an unsupported format?
  • an unknown format?

@@ -165,3 +168,13 @@ pub fn validate_copyto_dest_format_support(dest: &str, format: &str) -> Result<(
})
}
}

pub fn validate_copyto_format_partition_support(format: &str) -> Result<()> {
if matches!(format, |"parquet"| "json") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we could have a method on the enum or something that could do this validation rather than on a string. Is this extension? I worry about this falling out of sync with other variations of the enum?

CopyToFormatOptions::Json(json_opts) => {
let schema = source.schema();

println!("location: {}", location);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably omit this

),
json_opts.partition_columns,
location.into(),
".json".to_string(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to be a string and not using the enum or another property?

@universalmind303
Copy link
Contributor

marking as draft as it's not actively waiting on review.

@melbourne2991 please feel free to ping us when it is ready.

@universalmind303 universalmind303 marked this pull request as draft March 5, 2024 14:39
@tychoish
Copy link
Collaborator

@melbourne2991 wanted to check in on this. Is there anything I can do to help you on this?

@melbourne2991
Copy link
Author

hey @tychoish, apologies, I've been swamped lately - I'm not sure if I'll have time to get around to this in any reasonable time frame - happy for someone else to pick it up, there shouldn't be too much effort left on it I hope

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hive partitioning when using COPY TO
4 participants