Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid sinking a row if this new row is the same as the old row #10362

Open
lmatz opened this issue Jun 16, 2023 · 4 comments
Open

Avoid sinking a row if this new row is the same as the old row #10362

lmatz opened this issue Jun 16, 2023 · 4 comments
Labels
no-issue-activity priority/high type/enhancement Improvements to existing implementation.

Comments

@lmatz
Copy link
Contributor

lmatz commented Jun 16, 2023

Take an example:

dev=> create table t(user int, log varchar);
CREATE_TABLE

dev=> create sink sk1 as select user, (case when count(*) > 1 then true else false end) as status from t group by user 
dev-> with (
dev(>   connector = 'kafka',
dev(>   properties.bootstrap.server = '127.0.0.1:29092',
dev(>   topic = 'test',
dev(>   type = 'append-only',
dev(>   force_append_only = 'true'
dev(> );
CREATE_SINK

dev=> insert into t values (1, 'abc');
INSERT 0 1
dev=> insert into t values (1, 'bcd');
INSERT 0 1
dev=> insert into t values (1, 'cde');
INSERT 0 1
dev=> insert into t values (1, 'def');
INSERT 0 1

Right the output in the Kafka is:

{"status":false,"user":1}
{"status":true,"user":1}
{"status":true,"user":1}
{"status":true,"user":1}

But the user wants:

{"status":false,"user":1}
{"status":true,"user":1}

The argument is that in alerting use case, the following two events provide no information and makes further processing even harder: we need to identify that these three true events are the same, as we only want to alert the user once.

In general, if the entire row is the same as the old row (this is an update op), we expect to sink only once.

Is it possible that there exists a case that the user does want to have multiple same events sunk? I don't know, but I think this can always be achieved by adding the column that is indeed changing, by using the case above:

create sink sk1 as select user, count(*), (case when count(*) > 1 then true else false end) as status from t group by user
wee add another count(*) column.

Tag this issue as high-priority as it has been requested by two users already, and it makes a lot of sense for alerting use cases, which Risingwave is good at.

@github-actions github-actions bot added this to the release-0.20 milestone Jun 16, 2023
@lmatz lmatz added type/enhancement Improvements to existing implementation. priority/high labels Jun 16, 2023
@BugenZhao
Copy link
Member

BugenZhao commented Jun 16, 2023

In general, if the entire row is the same as the old row (this is an update op), we expect to sink only once.

It's worth noting that not all records with Update semantics have the op U- and U+. For example, Join will rewrite inputs of U- and U+ into normal - and + (#1811, #8578).

Thus to handle these general cases, I guess we have to at least buffer records in a single chunk, i.e., do a "compaction" before sinking them to external systems.


BTW, I'm currently evaluating whether the force_append_only feature is being misused beyond its intended purpose. In Sink: User Behavior Guideline, I said ...

I think it’ll be better to make this an “assertion” or “hint”. If the stream does not have the property of append-only, we should warn the users but maybe still allow it to happen. 🤔 For example, there’s some constraints that only the business knows about and we don’t.

Where I mean "if the user believes the output is append-only but our optimizer does not find this due to the lack of some features, we can still allow the sink to run". Apparently, the stream is not append-only in the given case of this issue, so force-append-only does not solve the problem and even makes things more confusing.

I agree that we should find another clear way to provide correct semantics of altering, possibly by exposing the raw change-log (with the op column in data) and allowing users to directly perform some transformations on that (proposed by @st1page). But perhaps we should not pursue the force-append-only approach any further.

@st1page
Copy link
Contributor

st1page commented Jun 16, 2023

I think the "force-append-only" exactly supports many features requested by the user before we implement “get the changelog from a table”. I think we can rethink that after we introduce that.

@lmatz
Copy link
Contributor Author

lmatz commented Jun 17, 2023

Just try to add some more information extracted from the conversation in the slack,
there are totally 3 cases that users want (all streaming queries below):

dev=> create table t(user int, log varchar);
CREATE_TABLE
dev=> create sink sk1 as select user, count(*), (case when count(*) > 1 then true else false end) as status from t group

output multiple times even after count(*) > 1 already and keep increasing, as count(*) is still changing.
Very make sense to output, well-defined.

dev=> create table t(user int, log varchar);
CREATE_TABLE
dev=> create sink sk1 as select user, (case when count(*) > 1 then true else false end) as status from t group

output only once even after count(*) > 1 already and keep increasing, as true stays the same.
Very make sense not to output, well-defined. (But right now RW will output multiple ones).

dev=> create table t(user int, log varchar);
CREATE_TABLE
dev=> create sink sk1 as select user, count(*) as status from t group having count(*) > 1

output only once when count(*) > 1, does not output anymore even after count(*) keeps increasing.
Not well-defined semantics that can be represented by standard SQL.

@lmatz lmatz modified the milestones: release-1.0, release-1.1 Jul 14, 2023
@lmatz lmatz removed this from the release-1.1 milestone Aug 8, 2023
Copy link
Contributor

github-actions bot commented Jul 3, 2024

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-issue-activity priority/high type/enhancement Improvements to existing implementation.
Projects
None yet
Development

No branches or pull requests

3 participants