[BUG] Different CSV parsing behavior between 22.04 and 22.02 #5035

wbo4958 · 2022-03-24T10:56:11Z

I have a test.csv file with below content

0.0
1.0
2.0

I got different results between 22.04 and 22.02.

Below test code is copied from https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala#L142, if we have different result, then some ML application will fail on 22.04 version

22.04 version

scala>     import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala>     import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala>     import org.apache.spark.sql.{Row}
import org.apache.spark.sql.Row

scala>     val schema = new StructType(Array(StructField("class", DoubleType, true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(class,DoubleType,true))

scala>     val rawInput = spark.read.schema(schema).csv("/home/bobwang/tmp/max/test.csv")
rawInput: org.apache.spark.sql.DataFrame = [class: double]

scala>     val maxLabelRow: Array[Row] = rawInput.select(max("class")).take(1)
maxLabelRow: Array[org.apache.spark.sql.Row] = Array([1.9999999999999998])

scala>     val maxDoubleLabel: Double = maxLabelRow.head.getDouble(0)
maxDoubleLabel: Double = 1.9999999999999998

scala>     maxDoubleLabel.toInt + 1
res3: Int = 2

22.02

scala>     import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala>     import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala>     import org.apache.spark.sql.{Row}
import org.apache.spark.sql.Row

scala>     val schema = new StructType(Array(StructField("class", DoubleType, true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(class,DoubleType,true))

scala>     val rawInput = spark.read.schema(schema).csv("/home/bobwang/tmp/max/test.csv")
rawInput: org.apache.spark.sql.DataFrame = [class: double]

scala>     val maxLabelRow: Array[Row] = rawInput.select(max("class")).take(1)
maxLabelRow: Array[org.apache.spark.sql.Row] = Array([2.0])

scala>     val maxDoubleLabel: Double = maxLabelRow.head.getDouble(0)
maxDoubleLabel: Double = 2.0

scala>     maxDoubleLabel.toInt + 1
res0: Int = 3

jlowe · 2022-03-24T15:26:02Z

I suspect this is actually a difference with CSV parsing of floating point values rather than a difference in the max function. The max function should only ever reference existing values in the DataFrame rather than construct new values. I assume this problem goes away (or is at least consistent between 22.02 and 22.04) if the data is Parquet or ORC rather than CSV, but it would be good to verify that.

revans2 · 2022-03-24T16:29:17Z

I am also not sure that it is a bug at all.

#4637

fixed a number of things with floating point parsing. It is still not perfect, but it is a lot better.

wbo4958 · 2022-03-25T01:42:12Z

I was testing using the latest 22.04 snapshot (March 24) which should already include #4637, but seems it does not help on this.

Hmm, so our suggestion is to let user to fall back to CPU or fix it? Since getNumClasses is the common API for the classification ML. If we get the different value, then the training process will fail directly.

i

revans2 · 2022-03-25T21:10:21Z

Sorry I think you mis-understood me. I think that #4637 caused the issue you are seeing. I didn't think that it fixed it. Did you check that you can just read in the data from CSV with no problems? Forget the max aggregation?

wbo4958 · 2022-03-28T05:17:58Z

Yeah, I just tested the csv reading. It's not max issue.

scala> val schema = new StructType(Array(StructField("class", DoubleType, true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(class,DoubleType,true))

scala> val rawInput = spark.read.schema(schema).csv("/tmp/test.csv")
rawInput: org.apache.spark.sql.DataFrame = [class: double]

scala> rawInput.show()
22/03/28 13:16:42 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
    @Expression <Alias> cast(class#0 as string) AS class#3 could run on GPU
      !Expression <Cast> cast(class#0 as string) cannot run on GPU because the GPU will use different precision than Java's toString method when converting floating point data types to strings and this can produce results that differ from the default behavior in Spark.  To enable this operation on the GPU, set spark.rapids.sql.castFloatToString.enabled to true.
        @Expression <AttributeReference> class#0 could run on GPU
    *Exec <FileSourceScanExec> will run on GPU

+------------------+
|             class|
+------------------+
|               0.0|
|0.9999999999999999|
|1.9999999999999998|
+------------------+

revans2 · 2022-03-28T14:40:42Z

Now things are interesting because before #4637 we were using the built in CUDF CSV parser to convert the strings to a number. After we are now using CUDF's built in cudf::strings::to_floats. It appears that each one does things slightly differently for 1.0 and 2.0, which is really odd to me.

to_floats: https://github.com/rapidsai/cudf/blob/0d78007adc3bc4988b7424c726c984e81df4f25a/cpp/src/strings/convert/convert_floats.cu#L44-L153
CSV: https://github.com/rapidsai/cudf/blob/0d78007adc3bc4988b7424c726c984e81df4f25a/cpp/src/io/utilities/parsing_utils.cuh#L156-L230

Not sure if this is technically a bug or not, but we should file something with CUDF about it, because at a minimum it is very confusing. Yes they each have different requirements so having the code be different is fine, but at the same time if they produce different results it feels a bit odd.

I don't consider this a P1 as the result is withing the error bounds we expect for floating point, but it is not great.

jrhemstad · 2022-04-05T17:02:28Z

If the exact same sequence of characters are being parsed into two different floating point values (other than NaNs), that smells like a bug to me.

Just from a library quality perspective, it also smells like a DRY violation to have two different functions for parsing a string into floating-point values.

andygrove · 2022-04-05T20:55:05Z

I filed an issue against cuDF - rapidsai/cudf#10599

ttnghia · 2022-04-05T23:50:04Z

If the exact same sequence of characters are being parsed into two different floating point values (other than NaNs), that smells like a bug to me.

It is not a bug. It is a feature 😏
strings::to_floats calls stod (string-to-double) then static cast double to float number. Round-off error here. Or conversion error here.

revans2 · 2022-04-08T17:16:47Z

How is it round off error on 1.0 and 2.0. I can see what was filed in the cuDF bug with 0.3, but not 1.0

ttnghia · 2022-04-08T17:20:50Z

In C++, casting from double to int is just round-toward-zero. If you have 1.99999999999 double, it will be 1 after casted.

ttnghia · 2022-04-08T17:23:43Z

So maybe we can mitigate this by fixing the toInt API from round-toward-zero to round-to-nearest? I'm not sure about the intentional (expected) behavior of toInt here.

ttnghia · 2022-04-08T17:36:47Z

It seems that Scala toInt is the same as C++:

val x: Double = 1.9999999999999998
val y: Int = x.toInt

println(y)

Output:

So our issue here is tricky to fix indeed: If we have 2.0 in the input file, we must not have 1.9999999999999998 in memory after parsing. This is very difficult to guarantee.

The only way I can think of to overcome this issue (not to fix it) is explicitly telling the users to use round.toInt instead of just toInt to get the expected, consistent result.

revans2 · 2022-04-08T18:02:44Z

@ttnghia I think we are talking about different things here. Java has their own complicated way of parsing floats and doubles that does not match always C/C++ or really anyone else out there. Eventually we will need to replicate this logic, but my problem is not with C++. My problem is how do you get round off error when parsing something as simple as "1.0". I wrote a quick program to show what C++ does...

#include <iostream>   // std::cout
#include <string>     // std::string, std::stod
#include <cstring>

void parse_n_print_it(std::string input) {
  double a = std::stod(input);
  uint64_t a_bits = 0;
  memcpy(&a_bits, &a, sizeof(a));

  float fa = (float)a;
  uint32_t fa_bits = 0;
  memcpy(&fa_bits, &fa, sizeof(fa)); 

  std::cout << input << " " << a << " " << a_bits << " " << fa << " " << fa_bits <<"\n";
}

int main ()
{
  parse_n_print_it("0.0");
  parse_n_print_it("1.0");
  parse_n_print_it("2.0");
  parse_n_print_it("3.0");
  parse_n_print_it("0.1");
  parse_n_print_it("0.2");
  parse_n_print_it("0.3");
  return 0;
}

When I run it I get...

0.0 0 0 0 0
1.0 1 4607182418800017408 1 1065353216
2.0 2 4611686018427387904 2 1073741824
3.0 3 4613937818241073152 3 1077936128
0.1 0.1 4591870180066957722 0.1 1036831949
0.2 0.2 4596373779694328218 0.2 1045220557
0.3 0.3 4599075939470750515 0.3 1050253722

I then wrote the same thing in java.

class Test {
public static void parse_n_print_it(String input) {
    double a = Double.valueOf(input);
    long a_bits = Double.doubleToLongBits(a);
    
    float fa = (float)a;
    int fa_bits = Float.floatToRawIntBits(fa);

    System.out.println(input + " " + a + " " + a_bits + " " + fa + " " + fa_bits);
}

public static void main (String[] args)
{
  parse_n_print_it("0.0");
  parse_n_print_it("1.0");
  parse_n_print_it("2.0");
  parse_n_print_it("3.0");
  parse_n_print_it("0.1");
  parse_n_print_it("0.2");
  parse_n_print_it("0.3");
}
}

and got essentially identical results.

0.0 0.0 0 0.0 0
1.0 1.0 4607182418800017408 1.0 1065353216
2.0 2.0 4611686018427387904 2.0 1073741824
3.0 3.0 4613937818241073152 3.0 1077936128
0.1 0.1 4591870180066957722 0.1 1036831949
0.2 0.2 4596373779694328218 0.2 1045220557
0.3 0.3 4599075939470750515 0.3 1050253722

Looking at how floating point is represented I am really at a loss about how you get a round off error when you see a "1" or "10" for that matter fits entirely within the significand for either 32-bit or 64-bit values. That feels like a bug to me.

ttnghia · 2022-04-08T18:06:53Z

You got identical results here because your C++ results were parsed by standard C++ library.
cudf string parser is not implemented to do exactly the same as in standard C++ (which may produce very close result with Java). Instead, cudf parser may have some round-off errors and you would probably get slightly different results.

How does cudf parser have round-off error? The string-to-double parser in cudf is implemented very differently from standard C++. Instead, it was in-house implemented, involves many +,-,*,/ operations, and even calls log10 and exp10 operations. Each of those operations all has round-off.

Even something simple as 1.0, the intermediate result going through many different operations listed above will be rounded off. Maybe cudf parser is not good enough to recognize such input is "simple" and terminates early.

revans2 · 2022-04-08T18:21:29Z

You got identical results here because your C++ results were parsed by standard C++ library. cudf string parser is not implemented to do exactly the same as in standard C++ (which may produce very close result with Java). Instead, cudf parser may have some round-off errors and you would probably get slightly different results.

Yes exactly. What you are saying is that cudf does not match C++ and that is okay, because is within rounding error, and I am asking "is it really okay?" especially when "1.0" and "2.0" are going to potentially be really common and a reasonable person would wonder why they are not a clear and specific value.

wbo4958 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 24, 2022

jlowe added the P0 Must have for release label Mar 24, 2022

jlowe changed the title ~~[BUG] Different behavior about "max" between 22.04 and 22.02~~ [BUG] Different CSV parsing behavior between 22.04 and 22.02 Mar 28, 2022

revans2 added P1 Nice to have for release cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed P0 Must have for release labels Mar 28, 2022

andygrove mentioned this issue Mar 28, 2022

Disable JSON and CSV floating-point reads by default #5078

Merged

andygrove closed this as completed in #5078 Mar 29, 2022

sameerz removed the ? - Needs Triage Need team to review and classify label Apr 5, 2022

andygrove mentioned this issue Apr 5, 2022

[BUG] Parsing string to float is inconsistent between CSV reader and to_numeric rapidsai/cudf#10599

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Different CSV parsing behavior between 22.04 and 22.02 #5035

[BUG] Different CSV parsing behavior between 22.04 and 22.02 #5035

wbo4958 commented Mar 24, 2022

jlowe commented Mar 24, 2022 •

edited

Loading

revans2 commented Mar 24, 2022

wbo4958 commented Mar 25, 2022

revans2 commented Mar 25, 2022

wbo4958 commented Mar 28, 2022 •

edited

Loading

revans2 commented Mar 28, 2022

jrhemstad commented Apr 5, 2022

andygrove commented Apr 5, 2022

ttnghia commented Apr 5, 2022 •

edited

Loading

revans2 commented Apr 8, 2022

ttnghia commented Apr 8, 2022 •

edited

Loading

ttnghia commented Apr 8, 2022 •

edited

Loading

ttnghia commented Apr 8, 2022 •

edited

Loading

revans2 commented Apr 8, 2022

ttnghia commented Apr 8, 2022 •

edited

Loading

revans2 commented Apr 8, 2022

[BUG] Different CSV parsing behavior between 22.04 and 22.02 #5035

[BUG] Different CSV parsing behavior between 22.04 and 22.02 #5035

Comments

wbo4958 commented Mar 24, 2022

22.04 version

22.02

jlowe commented Mar 24, 2022 • edited Loading

revans2 commented Mar 24, 2022

wbo4958 commented Mar 25, 2022

revans2 commented Mar 25, 2022

wbo4958 commented Mar 28, 2022 • edited Loading

revans2 commented Mar 28, 2022

jrhemstad commented Apr 5, 2022

andygrove commented Apr 5, 2022

ttnghia commented Apr 5, 2022 • edited Loading

revans2 commented Apr 8, 2022

ttnghia commented Apr 8, 2022 • edited Loading

ttnghia commented Apr 8, 2022 • edited Loading

ttnghia commented Apr 8, 2022 • edited Loading

revans2 commented Apr 8, 2022

ttnghia commented Apr 8, 2022 • edited Loading

revans2 commented Apr 8, 2022

jlowe commented Mar 24, 2022 •

edited

Loading

wbo4958 commented Mar 28, 2022 •

edited

Loading

ttnghia commented Apr 5, 2022 •

edited

Loading

ttnghia commented Apr 8, 2022 •

edited

Loading

ttnghia commented Apr 8, 2022 •

edited

Loading

ttnghia commented Apr 8, 2022 •

edited

Loading

ttnghia commented Apr 8, 2022 •

edited

Loading