Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lazy Distinct operation #1558

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

RobinTF
Copy link
Collaborator

@RobinTF RobinTF commented Oct 15, 2024

Allow the Distinct operation to deal with lazy values.

Copy link

codecov bot commented Oct 15, 2024

Codecov Report

Attention: Patch coverage is 81.42857% with 13 lines in your changes missing coverage. Please review.

Project coverage is 88.44%. Comparing base (81a9350) to head (5210c7c).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/engine/Distinct.cpp 80.00% 0 Missing and 13 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1558      +/-   ##
==========================================
- Coverage   88.45%   88.44%   -0.02%     
==========================================
  Files         362      362              
  Lines       27455    27492      +37     
  Branches     3705     3717      +12     
==========================================
+ Hits        24285    24314      +29     
- Misses       1938     1939       +1     
- Partials     1232     1239       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

sonarcloud bot commented Oct 15, 2024

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review on everything but the tests.


// Removes all duplicates from input with regards to the columns
// in keepIndices. The input needs to be sorted on the keep indices,
// otherwise the result of this function is undefined.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also document the previousRow argument.


VariableToColumnMap computeVariableToColumnMap() const override;

template <size_t WIDTH>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

short docstring please.

@@ -36,17 +33,107 @@ VariableToColumnMap Distinct::computeVariableToColumnMap() const {
return subtree_->getVariableColumns();
}

template <size_t WIDTH>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
template <size_t WIDTH>
// ____________________________________________________________________
template <size_t WIDTH>

@@ -36,17 +33,107 @@ VariableToColumnMap Distinct::computeVariableToColumnMap() const {
return subtree_->getVariableColumns();
}

template <size_t WIDTH>
cppcoro::generator<IdTable> Distinct::lazyDistinct(
cppcoro::generator<IdTable> originalGenerator,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cppcoro::generator<IdTable> originalGenerator,
cppcoro::generator<IdTable> input,

(simpler name,)

cppcoro::generator<IdTable> Distinct::lazyDistinct(
cppcoro::generator<IdTable> originalGenerator,
std::vector<ColumnIndex> keepIndices,
std::optional<IdTable> aggregateTable) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason to not use the bool yieldOnce pattern from other operations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly for simplicity reasons. This function is currently static just like the regular distinct function and if a bool was passed instead you'd have to pass the width and the allocator to construct the IdTable within the generator. But conceptually it does the same thing

auto last = result.end();

auto dest = result.begin();
if (first == dest) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see it,there are two optimizations:

  1. Do this columnwise (see the Engine.cpp for a variation of this algorithm that only returns the count of unique elements)

  2. Use std::ranges::unique (with your matches row you can build something)

  3. What about cancellation? (see 1.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In total I think there might be a version with less code possible here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This would work, but it would require O(numRows()) bytes of additional memory (the current algorithm is in-place, no additional memory required, we're just freeing memory that has been allocated by the previous operation).
  2. After implementing this I'm fairly confident std::ranges::unique (as in this exact function) doesn't help here whatsoever, I really tried using it, but it just doesn't work for this use-case (with previousRow considered), but of course there might be other algorithms that might help making this shorter.
  3. Yeah, cancellation would be nice, but that should be added once we settled on a general approach. For the non-lazy approach admittedly it would be better to not clone the whole thing at all and just copy the entries that are actually required, but this would require even more code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants