Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hll_add_agg() should return an hll_empty() when aggregating over an empty set #2

Closed
ghost opened this issue Mar 11, 2013 · 8 comments
Closed

Comments

@ghost
Copy link

ghost commented Mar 11, 2013

hll_add_agg currently returns NULL when it aggregates over an empty data set, which can have the undesirable effect of NULLing out the target when the aggregate is used to update another hll column (with hll_union).

The issue is avoidable with some NULL detection (as is shown in this blog post) but the post is right in that the real fix should be in postgresql-hll.

@dimitri
Copy link

dimitri commented Mar 12, 2013

I think the return value for hll_add_agg() should be hll_empty() when called with an empty input.

@ghost
Copy link
Author

ghost commented Mar 14, 2013

@dimitri After looking at the other PG aggregate functions, all of them return NULL over an empty data set. Ditto if any of the aggregated values are NULL in a non-empty data set. Is it worth breaking consistency here?

@dimitri
Copy link

dimitri commented Mar 14, 2013

I think it's worth it, and explained why in the reference blog article. Basically when you union an existing value over the aggregate on an empty set, you lose your current HLL value. That can't be what you want to happen.

@ozgune
Copy link

ozgune commented May 14, 2013

@TimonK I don't think we'd be breaking consistency here. The PostgreSQL manual says that "except for count, aggregate functions return a null value when no rows are selected." This is consistent with count(expression)'s definition where the return value is "the number of input rows for which the value of expression is not null."

The higher level implications for us is, when we run count(Distinct null_returning_expression), we get 0 in PostgreSQL. When we use postgresql-hll functions, we get back Null.

@luben
Copy link

luben commented May 21, 2013

@ozgune hll is estimating counts, so i think it should follow "count" semantics and return hll_empty()

@ozgune
Copy link

ozgune commented Jun 11, 2013

I'd like to submit a pull request for this if everyone thinks it makes sense. Here's what I had in mind.

First, I thought I'd modify hll_pack(), and modify code paths that returned NULL such that they return hll_empty(). However, I realized that we don't have enough information in this function to construct an hll_empty with the correct ms_bits, ms_nregs, and so forth. I could use the defaults, but then we'd run into issues when we call hll_union_agg.

The proper way to do this is to modify hll_add_trans functions such that we initialize an empty multi set the very first time the function is called. This way, instead of creating an uninitialized multi set and then initializing it later on if need be, we initialize the multi set from the beginning.

The proposed change modifies hll_add_trans functions. It simply moves the code block underneath if (msap->ms_type == MST_UNINIT) to if (PG_ARGISNULL(0)). Do you guys think this makes sense?

ozgune pushed a commit to ozgune/postgresql-hll that referenced this issue Jun 18, 2013
… when aggregating over empty sets. This makes hll_add_agg() conform to PostgreSQL's count() semantics over empty sets, and closes issue citusdata#2. Also added two small regression tests.
@ghost ghost closed this as completed Jul 16, 2013
@ghost
Copy link
Author

ghost commented Jul 16, 2013

This is fixed in the newest version, 2.8.0.

@dimitri
Copy link

dimitri commented Aug 4, 2013

Thanks!

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants