Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] stop using mergeAndSetValidity for any nested type #7485

Closed
revans2 opened this issue Jan 9, 2023 · 5 comments
Closed

[BUG] stop using mergeAndSetValidity for any nested type #7485

revans2 opened this issue Jan 9, 2023 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Jan 9, 2023

Describe the bug

I filed this as a bug, but it really might just be a task.

There are multiple places where we are using mergeAndSetValidity on potentially nested types.

$ find . -type f -iname \*.scala | xargs grep mergeAndSetValidity
./sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala:            val hasValidEntryCV = hasNegativeIndicesCV.mergeAndSetValidity(BinaryOp.BITWISE_AND,
./sql-plugin/src/main/scala/org/apache/spark/sql/rapids/mathExpressions.scala:            zero.mergeAndSetValidity(BinaryOp.BITWISE_AND, lhs)
./sql-plugin/src/main/scala/org/apache/spark/sql/rapids/mathExpressions.scala:        repVal.mergeAndSetValidity(BinaryOp.BITWISE_AND, lhs)
./sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala:            val hasValidEntryCV = indices.mergeAndSetValidity(BinaryOp.BITWISE_AND,
./sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala:      list.mergeAndSetValidity(BinaryOp.BITWISE_AND, rhs.getBase)
./sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala:            equals.mergeAndSetValidity(BinaryOp.BITWISE_AND, equals, step)
./sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala:        ret.mergeAndSetValidity(BinaryOp.BITWISE_AND, size)
./sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala:        withResource(strs.mergeAndSetValidity(BinaryOp.BITWISE_AND, strs, ends)) { rets =>
./sql-plugin/src/main/scala/org/apache/spark/sql/rapids/HashFunctions.scala:        fullResult.mergeAndSetValidity(BinaryOp.BITWISE_AND, normalized)
./sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala:        _.mergeAndSetValidity(BinaryOp.BITWISE_AND, input)
./sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala:              spaceCol.mergeAndSetValidity(BinaryOp.BITWISE_AND, strValue)
./sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala:        _.mergeAndSetValidity(BinaryOp.BITWISE_AND, input)
./sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala:              columns += spaceColumn.mergeAndSetValidity(BinaryOp.BITWISE_AND, nonFirstColumnView)
./sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala:          _.mergeAndSetValidity(BinaryOp.BITWISE_AND, input) // original whole row is null

This operation will overwrite the null mask on a column.

https://github.com/rapidsai/cudf/blob/d24bf11838863739da014d828156e8bf638430ff/java/src/main/native/src/ColumnViewJni.cpp#L1762

This is not safe for any nested type, including strings. CUDF does not want to support inputs for LISTS or STRINGS where the value is NULL, but the offsets are not empty. Calling mergeAndSetValidity does not guarantee this and can cause some very subtile bugs in the future with CUDF.

We really should change the API in CUDF to throw an exception if we try to do this on anything that is not a fixed width type. We should also switch all of the code to that calls this to do the right thing in these case. That does not typically mean calling into a ColumnView or other similar API and trying to do something similar manually. If we want to do this correctly we need to do an ifElse call or something similar. It should not be too bad because that API already makes a copy of the input, so it may not actually be any slower.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 9, 2023
@ttnghia
Copy link
Collaborator

ttnghia commented Jan 9, 2023

Do you want to fix mergeAndSetValidity and continue using it, instead of "stop using" it? The line in https://github.com/rapidsai/cudf/blob/d24bf11838863739da014d828156e8bf638430ff/java/src/main/native/src/ColumnViewJni.cpp#L1762 (+L1767) just sets the null mask but keep the offsets intact. We can just call purge_nonempty_nulls to make the output column clean (i.e., empty out the offsets).

@ttnghia
Copy link
Collaborator

ttnghia commented Jan 9, 2023

Tried to search for set_null_mask in ColumnViewJni.cpp and found several other instances. So we may want to do a more extensive fix for all the callers to set_null_mask in JNI.

@revans2
Copy link
Collaborator Author

revans2 commented Jan 10, 2023

Do you want to fix mergeAndSetValidity and continue using it, instead of "stop using" it? The line in https://github.com/rapidsai/cudf/blob/d24bf11838863739da014d828156e8bf638430ff/java/src/main/native/src/ColumnViewJni.cpp#L1762 (+L1767) just sets the null mask but keep the offsets intact. We can just call purge_nonempty_nulls to make the output column clean (i.e., empty out the offsets).

How we do it is not that important. It would be nice to be able to think about performance too. The code as it is written right now makes a copy of the input column and sets the null mask on it. If we were to call purge_nonemtpy_nulls it is likely to make another copy. If we optimized the first part to just make a column_view that would be fine.

Yes generally we have a lot of places where we might be doing really bad things and we need to actually start putting in place checks to detect these and avoid them before they happen. Because purge_nonempoty_nulls is not free it would be nice to find these and avoid them.

@ttnghia
Copy link
Collaborator

ttnghia commented Jan 10, 2023

In C++, we allow potential pass-through for APIs like this (set null mask). That means we only make a copy if we have to (i.e., when we have to call purge_nonempoty_nulls with input lists/strings). Otherwise, just set null mask to the input column_view and return. We can optimize this mergeAndSetValidity in a similar way.

@razajafri
Copy link
Collaborator

This is fixed because mergeAndSetValidity now purges non-empty nulls and also doesn't trample on already null values. This was fixed in rapidsai/cudf#13335.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants