-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support reading ANSI day time interval type from CSV source #4927
Support reading ANSI day time interval type from CSV source #4927
Conversation
Signed-off-by: Chong Gao <res_life@163.com>
build |
build |
build |
build |
Filed 2 Spark issues: The second range is not [0, 59] in the day time ANSI interval |
|
||
def supportCsvRead(dt: DataType) : Boolean = false | ||
|
||
def csvRead(cv: ColumnVector, dt: DataType): ColumnVector = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def csvRead(cv: ColumnVector, dt: DataType): ColumnVector = | |
def csvRead(cv: cudf.ColumnVector, dt: DataType): cudf.ColumnVector = |
Otherwise, it will conflict with the PR #4926 who imports the org.apache.spark.sql.vectorized.ColumnVector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above, #4926 is merged, then here should be a conflict now.
@@ -15,6 +15,7 @@ | |||
*/ | |||
package com.nvidia.spark.rapids.shims | |||
|
|||
import ai.rapids.cudf.ColumnVector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import ai.rapids.cudf.ColumnVector | |
import ai.rapids.cudf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see nothing is changed. And the PR I mentioned is merged, so here should be a conflict now.
build |
Performance test result
Test file is 1.7G and has 12320000 rows and 10 columns, it's duplicated from day-time-interval.csv by many times. The day-time-interval.csv is in the tests module. The sql is: Details code:
|
build |
seconds = 0 | ||
microseconds = 0 | ||
|
||
if (start_field, end_field) == ("day", "day"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to use the below structure, which should get a better perf.
if:
...
elif:
...
elif:
...
else:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this totally differently? There is just so much copy/paste between each part of the code. We know that the python code really is only converting it to micro seconds. So can we just generate a random number for the microseconds, up to max_micros
, and then truncate it to the proper time? Also can we convert the "string" start/end fields into something simpler to write code for?
# in __init__
DAY = 0
HOUR = 1
MIN = 2
SEC = 3
fields_to_look_at = ["day", "hour", "minute", "second"]
si = fields_to_look_at.index(start_field)
ei = fileds_to_look_at.index(end_field)
assert si <= ei
self.hasDays = si <= DAY and ei >= DAY
self.hasHours = si <= HOUR and ei >= HOUR
self.hasMin = si <= MIN and ei >= MIN
self.hasSec = si <= SEC and ei >= SEC
sql-plugin/src/main/330+/scala/com/nvidia/spark/rapids/shims/GpuIntervalUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/330+/scala/com/nvidia/spark/rapids/shims/GpuIntervalUtils.scala
Outdated
Show resolved
Hide resolved
@revans2 help to review |
seconds = 0 | ||
microseconds = 0 | ||
|
||
if (start_field, end_field) == ("day", "day"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this totally differently? There is just so much copy/paste between each part of the code. We know that the python code really is only converting it to micro seconds. So can we just generate a random number for the microseconds, up to max_micros
, and then truncate it to the proper time? Also can we convert the "string" start/end fields into something simpler to write code for?
# in __init__
DAY = 0
HOUR = 1
MIN = 2
SEC = 3
fields_to_look_at = ["day", "hour", "minute", "second"]
si = fields_to_look_at.index(start_field)
ei = fileds_to_look_at.index(end_field)
assert si <= ei
self.hasDays = si <= DAY and ei >= DAY
self.hasHours = si <= HOUR and ei >= HOUR
self.hasMin = si <= MIN and ei >= MIN
self.hasSec = si <= SEC and ei >= SEC
sql-plugin/src/main/330+/scala/com/nvidia/spark/rapids/shims/GpuIntervalUtils.scala
Show resolved
Hide resolved
build |
The last comment: #4927 (comment) |
Will resolve the conflict after #4946 is merged, because of 4946 is also conflict with this. |
Filed a Spark issue: Interval types are not truncated to the expected endField when creating a DataFrame via Duration https://issues.apache.org/jira/browse/SPARK-38577 |
build |
build |
} | ||
|
||
/** | ||
* TODO: Blocked by Spark overflow issue: https://issues.apache.org/jira/browse/SPARK-38520 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we do the right thing with the overflow? If so then we need to document this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, should update doc, another followup PR: #4981
Contributes #4146
Support reading ANSI day time interval type from CSV source.
This PR is an implementation on the plugin side currently.
If cuDF supports this in future, should update, see rapidsai/cudf#10356
Signed-off-by: Chong Gao res_life@163.com