-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support dimensions in conclusion feed #535
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In live walk through we also discovered querying the dimensions in their current form is difficult. We should explore better options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The cid_string_list really simplified the pretty_feed. Reading the unnest and aggregate was confusing this is way easier to read!
Expr::ScalarFunction(ScalarFunction::new_udf( | ||
cid_string.clone(), | ||
vec![col("event_cid")], | ||
)) | ||
.alias("event_cid"), | ||
Expr::Cast(Cast::new(Box::new(col("data")), DataType::Utf8)).alias("data"), | ||
Expr::ScalarFunction(ScalarFunction::new_udf(cid_string, vec![col("previous")])) | ||
.alias("previous"), | ||
Expr::ScalarFunction(ScalarFunction::new_udf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!!
dimensions: vec![ | ||
("model".to_string(), init_payload.header().model().to_vec()), | ||
( | ||
"controller".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about should_index and unique fields? Do we want to include those in the dimension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should Index is not a dimension as it is mutable. We will need to add it to a wrapper object over the data. I'll create an issue to handle that. Unique is an interesting case. We discussed at length let me summarize.
Dimensions are by definition the fields that users can query over. Unique is not useful in querying for streams. Rather its only purpose is to ensure the cid of the init event is unique compared to other streams with the same dimensions. As a result while it exists as a header on the init event it does not need to be a dimension.
The sep concept is being replaced by dimensions and is now redundant so its not needed either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. A couple questions about datafusion things I don't quite grok but the code looks good.
} | ||
let _ = self.dimensions.append(!init.dimensions.is_empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to go look at the MapBuilder docs to understand this.. makes sense now.
flight/src/server.rs
Outdated
Field::new( | ||
"value", | ||
DataType::Dictionary( | ||
Box::new(DataType::Int8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, will we support any other types? I'm curious why you choose Int8 instead of 32 or 64 and just binary strings instead of utf8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int8 is about how many unique values we can have. Int8 gives us 256. And as I write this I realize this is not enough. I was thinking about the number of unique dimension values for a single stream. However this needs to cover many streams combined. I'll likely pick i32/i64 as its plenty large.
let all_cids = as_list_array(&args[0])?; | ||
let mut strs = ListBuilder::new(StringBuilder::new()); | ||
for cids in ArrayIter::new(all_cids) { | ||
if let Some(cids) = cids { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this odd (that we have values that are None) but I guess it's expected based on the Nullable definition. Does the DataType need to be nullable (and when would it be in practice)? I know this for testing so it's not a big deal, just trying to understand for when I have to add something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah not real use case for when a cid would be null. For example the event_cid column is not nullable. However a this is a function that can be used on any function its best to not assume data is not null. And just pass nulls along.
Adds support from the db out the flight API to handle dimensions. There is a mapping from existing events to dimensions on conclusion events.
The olap aggregator also preserves the dimensions during aggregation.