Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support dimensions in conclusion feed #535

Merged
merged 4 commits into from
Sep 19, 2024
Merged

Conversation

nathanielc
Copy link
Collaborator

@nathanielc nathanielc commented Sep 18, 2024

Adds support from the db out the flight API to handle dimensions. There is a mapping from existing events to dimensions on conclusion events.

The olap aggregator also preserves the dimensions during aggregation.

@nathanielc nathanielc requested a review from a team as a code owner September 18, 2024 15:19
@nathanielc nathanielc requested review from ukstv, dav1do and samika98 and removed request for a team September 18, 2024 15:19
Copy link
Collaborator Author

@nathanielc nathanielc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In live walk through we also discovered querying the dimensions in their current form is difficult. We should explore better options.

flight/src/types.rs Show resolved Hide resolved
flight/src/types.rs Outdated Show resolved Hide resolved
flight/src/types.rs Outdated Show resolved Hide resolved
olap/src/aggregator/mod.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@samika98 samika98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The cid_string_list really simplified the pretty_feed. Reading the unnest and aggregate was confusing this is way easier to read!

Expr::ScalarFunction(ScalarFunction::new_udf(
cid_string.clone(),
vec![col("event_cid")],
))
.alias("event_cid"),
Expr::Cast(Cast::new(Box::new(col("data")), DataType::Utf8)).alias("data"),
Expr::ScalarFunction(ScalarFunction::new_udf(cid_string, vec![col("previous")]))
.alias("previous"),
Expr::ScalarFunction(ScalarFunction::new_udf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!!

dimensions: vec![
("model".to_string(), init_payload.header().model().to_vec()),
(
"controller".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about should_index and unique fields? Do we want to include those in the dimension?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Index is not a dimension as it is mutable. We will need to add it to a wrapper object over the data. I'll create an issue to handle that. Unique is an interesting case. We discussed at length let me summarize.

Dimensions are by definition the fields that users can query over. Unique is not useful in querying for streams. Rather its only purpose is to ensure the cid of the init event is unique compared to other streams with the same dimensions. As a result while it exists as a header on the init event it does not need to be a dimension.

The sep concept is being replaced by dimensions and is now redundant so its not needed either.

Copy link
Contributor

@dav1do dav1do left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A couple questions about datafusion things I don't quite grok but the code looks good.

}
let _ = self.dimensions.append(!init.dimensions.is_empty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to go look at the MapBuilder docs to understand this.. makes sense now.

Field::new(
"value",
DataType::Dictionary(
Box::new(DataType::Int8),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, will we support any other types? I'm curious why you choose Int8 instead of 32 or 64 and just binary strings instead of utf8.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int8 is about how many unique values we can have. Int8 gives us 256. And as I write this I realize this is not enough. I was thinking about the number of unique dimension values for a single stream. However this needs to cover many streams combined. I'll likely pick i32/i64 as its plenty large.

let all_cids = as_list_array(&args[0])?;
let mut strs = ListBuilder::new(StringBuilder::new());
for cids in ArrayIter::new(all_cids) {
if let Some(cids) = cids {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this odd (that we have values that are None) but I guess it's expected based on the Nullable definition. Does the DataType need to be nullable (and when would it be in practice)? I know this for testing so it's not a big deal, just trying to understand for when I have to add something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah not real use case for when a cid would be null. For example the event_cid column is not nullable. However a this is a function that can be used on any function its best to not assume data is not null. And just pass nulls along.

@nathanielc nathanielc added this pull request to the merge queue Sep 19, 2024
Merged via the queue into main with commit 8a44b03 Sep 19, 2024
5 checks passed
@nathanielc nathanielc deleted the feat/dimensions branch September 19, 2024 21:54
@smrz2001 smrz2001 mentioned this pull request Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants