-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Different CSV parsing behavior between 22.04 and 22.02 #5035
Comments
I suspect this is actually a difference with CSV parsing of floating point values rather than a difference in the |
I am also not sure that it is a bug at all. fixed a number of things with floating point parsing. It is still not perfect, but it is a lot better. |
I was testing using the latest 22.04 snapshot (March 24) which should already include #4637, but seems it does not help on this. Hmm, so our suggestion is to let user to fall back to CPU or fix it? Since getNumClasses is the common API for the classification ML. If we get the different value, then the training process will fail directly. i |
Sorry I think you mis-understood me. I think that #4637 caused the issue you are seeing. I didn't think that it fixed it. Did you check that you can just read in the data from CSV with no problems? Forget the max aggregation? |
Yeah, I just tested the csv reading. It's not max issue. scala> val schema = new StructType(Array(StructField("class", DoubleType, true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(class,DoubleType,true))
scala> val rawInput = spark.read.schema(schema).csv("/tmp/test.csv")
rawInput: org.apache.spark.sql.DataFrame = [class: double]
scala> rawInput.show()
22/03/28 13:16:42 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
@Partitioning <SinglePartition$> could run on GPU
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
@Expression <Alias> cast(class#0 as string) AS class#3 could run on GPU
!Expression <Cast> cast(class#0 as string) cannot run on GPU because the GPU will use different precision than Java's toString method when converting floating point data types to strings and this can produce results that differ from the default behavior in Spark. To enable this operation on the GPU, set spark.rapids.sql.castFloatToString.enabled to true.
@Expression <AttributeReference> class#0 could run on GPU
*Exec <FileSourceScanExec> will run on GPU
+------------------+
| class|
+------------------+
| 0.0|
|0.9999999999999999|
|1.9999999999999998|
+------------------+ |
Now things are interesting because before #4637 we were using the built in CUDF CSV parser to convert the strings to a number. After we are now using CUDF's built in to_floats: https://github.com/rapidsai/cudf/blob/0d78007adc3bc4988b7424c726c984e81df4f25a/cpp/src/strings/convert/convert_floats.cu#L44-L153 Not sure if this is technically a bug or not, but we should file something with CUDF about it, because at a minimum it is very confusing. Yes they each have different requirements so having the code be different is fine, but at the same time if they produce different results it feels a bit odd. I don't consider this a P1 as the result is withing the error bounds we expect for floating point, but it is not great. |
If the exact same sequence of characters are being parsed into two different floating point values (other than NaNs), that smells like a bug to me. Just from a library quality perspective, it also smells like a DRY violation to have two different functions for parsing a string into floating-point values. |
I filed an issue against cuDF - rapidsai/cudf#10599 |
It is not a bug. It is a feature 😏 |
How is it round off error on |
In C++, casting from double to int is just round-toward-zero. If you have |
So maybe we can mitigate this by fixing the |
It seems that Scala
Output:
So our issue here is tricky to fix indeed: If we have The only way I can think of to overcome this issue (not to fix it) is explicitly telling the users to use |
@ttnghia I think we are talking about different things here. Java has their own complicated way of parsing floats and doubles that does not match always C/C++ or really anyone else out there. Eventually we will need to replicate this logic, but my problem is not with C++. My problem is how do you get round off error when parsing something as simple as "1.0". I wrote a quick program to show what C++ does...
When I run it I get...
I then wrote the same thing in java.
and got essentially identical results.
Looking at how floating point is represented I am really at a loss about how you get a round off error when you see a "1" or "10" for that matter fits entirely within the significand for either 32-bit or 64-bit values. That feels like a bug to me. |
You got identical results here because your C++ results were parsed by standard C++ library. How does cudf parser have round-off error? The string-to-double parser in cudf is implemented very differently from standard C++. Instead, it was in-house implemented, involves many Even something simple as |
Yes exactly. What you are saying is that cudf does not match C++ and that is okay, because is within rounding error, and I am asking "is it really okay?" especially when "1.0" and "2.0" are going to potentially be really common and a reasonable person would wonder why they are not a clear and specific value. |
I have a test.csv file with below content
I got different results between 22.04 and 22.02.
Below test code is copied from https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala#L142, if we have different result, then some ML application will fail on 22.04 version
22.04 version
22.02
The text was updated successfully, but these errors were encountered: