Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Different CSV parsing behavior between 22.04 and 22.02 #5035

Closed
wbo4958 opened this issue Mar 24, 2022 · 16 comments · Fixed by #5078
Closed

[BUG] Different CSV parsing behavior between 22.04 and 22.02 #5035

wbo4958 opened this issue Mar 24, 2022 · 16 comments · Fixed by #5078
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P1 Nice to have for release

Comments

@wbo4958
Copy link
Collaborator

wbo4958 commented Mar 24, 2022

I have a test.csv file with below content

0.0
1.0
2.0

I got different results between 22.04 and 22.02.

Below test code is copied from https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala#L142, if we have different result, then some ML application will fail on 22.04 version

22.04 version

scala>     import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala>     import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala>     import org.apache.spark.sql.{Row}
import org.apache.spark.sql.Row

scala>     val schema = new StructType(Array(StructField("class", DoubleType, true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(class,DoubleType,true))

scala>     val rawInput = spark.read.schema(schema).csv("/home/bobwang/tmp/max/test.csv")
rawInput: org.apache.spark.sql.DataFrame = [class: double]

scala>     val maxLabelRow: Array[Row] = rawInput.select(max("class")).take(1)
maxLabelRow: Array[org.apache.spark.sql.Row] = Array([1.9999999999999998])

scala>     val maxDoubleLabel: Double = maxLabelRow.head.getDouble(0)
maxDoubleLabel: Double = 1.9999999999999998

scala>     maxDoubleLabel.toInt + 1
res3: Int = 2

22.02

scala>     import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala>     import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala>     import org.apache.spark.sql.{Row}
import org.apache.spark.sql.Row

scala>     val schema = new StructType(Array(StructField("class", DoubleType, true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(class,DoubleType,true))

scala>     val rawInput = spark.read.schema(schema).csv("/home/bobwang/tmp/max/test.csv")
rawInput: org.apache.spark.sql.DataFrame = [class: double]

scala>     val maxLabelRow: Array[Row] = rawInput.select(max("class")).take(1)
maxLabelRow: Array[org.apache.spark.sql.Row] = Array([2.0])

scala>     val maxDoubleLabel: Double = maxLabelRow.head.getDouble(0)
maxDoubleLabel: Double = 2.0

scala>     maxDoubleLabel.toInt + 1
res0: Int = 3
@wbo4958 wbo4958 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 24, 2022
@jlowe jlowe added the P0 Must have for release label Mar 24, 2022
@jlowe
Copy link
Member

jlowe commented Mar 24, 2022

I suspect this is actually a difference with CSV parsing of floating point values rather than a difference in the max function. The max function should only ever reference existing values in the DataFrame rather than construct new values. I assume this problem goes away (or is at least consistent between 22.02 and 22.04) if the data is Parquet or ORC rather than CSV, but it would be good to verify that.

@revans2
Copy link
Collaborator

revans2 commented Mar 24, 2022

I am also not sure that it is a bug at all.

#4637

fixed a number of things with floating point parsing. It is still not perfect, but it is a lot better.

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Mar 25, 2022

I was testing using the latest 22.04 snapshot (March 24) which should already include #4637, but seems it does not help on this.

Hmm, so our suggestion is to let user to fall back to CPU or fix it? Since getNumClasses is the common API for the classification ML. If we get the different value, then the training process will fail directly.

i

@revans2
Copy link
Collaborator

revans2 commented Mar 25, 2022

Sorry I think you mis-understood me. I think that #4637 caused the issue you are seeing. I didn't think that it fixed it. Did you check that you can just read in the data from CSV with no problems? Forget the max aggregation?

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Mar 28, 2022

Yeah, I just tested the csv reading. It's not max issue.

scala> val schema = new StructType(Array(StructField("class", DoubleType, true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(class,DoubleType,true))

scala> val rawInput = spark.read.schema(schema).csv("/tmp/test.csv")
rawInput: org.apache.spark.sql.DataFrame = [class: double]

scala> rawInput.show()
22/03/28 13:16:42 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
    @Expression <Alias> cast(class#0 as string) AS class#3 could run on GPU
      !Expression <Cast> cast(class#0 as string) cannot run on GPU because the GPU will use different precision than Java's toString method when converting floating point data types to strings and this can produce results that differ from the default behavior in Spark.  To enable this operation on the GPU, set spark.rapids.sql.castFloatToString.enabled to true.
        @Expression <AttributeReference> class#0 could run on GPU
    *Exec <FileSourceScanExec> will run on GPU

+------------------+
|             class|
+------------------+
|               0.0|
|0.9999999999999999|
|1.9999999999999998|
+------------------+

@jlowe jlowe changed the title [BUG] Different behavior about "max" between 22.04 and 22.02 [BUG] Different CSV parsing behavior between 22.04 and 22.02 Mar 28, 2022
@revans2
Copy link
Collaborator

revans2 commented Mar 28, 2022

Now things are interesting because before #4637 we were using the built in CUDF CSV parser to convert the strings to a number. After we are now using CUDF's built in cudf::strings::to_floats. It appears that each one does things slightly differently for 1.0 and 2.0, which is really odd to me.

to_floats: https://github.com/rapidsai/cudf/blob/0d78007adc3bc4988b7424c726c984e81df4f25a/cpp/src/strings/convert/convert_floats.cu#L44-L153
CSV: https://github.com/rapidsai/cudf/blob/0d78007adc3bc4988b7424c726c984e81df4f25a/cpp/src/io/utilities/parsing_utils.cuh#L156-L230

Not sure if this is technically a bug or not, but we should file something with CUDF about it, because at a minimum it is very confusing. Yes they each have different requirements so having the code be different is fine, but at the same time if they produce different results it feels a bit odd.

I don't consider this a P1 as the result is withing the error bounds we expect for floating point, but it is not great.

@revans2 revans2 added P1 Nice to have for release cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed P0 Must have for release labels Mar 28, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Apr 5, 2022
@jrhemstad
Copy link

If the exact same sequence of characters are being parsed into two different floating point values (other than NaNs), that smells like a bug to me.

Just from a library quality perspective, it also smells like a DRY violation to have two different functions for parsing a string into floating-point values.

@andygrove
Copy link
Contributor

I filed an issue against cuDF - rapidsai/cudf#10599

@ttnghia
Copy link
Collaborator

ttnghia commented Apr 5, 2022

If the exact same sequence of characters are being parsed into two different floating point values (other than NaNs), that smells like a bug to me.

It is not a bug. It is a feature 😏
strings::to_floats calls stod (string-to-double) then static cast double to float number. Round-off error here. Or conversion error here.

@revans2
Copy link
Collaborator

revans2 commented Apr 8, 2022

How is it round off error on 1.0 and 2.0. I can see what was filed in the cuDF bug with 0.3, but not 1.0

@ttnghia
Copy link
Collaborator

ttnghia commented Apr 8, 2022

In C++, casting from double to int is just round-toward-zero. If you have 1.99999999999 double, it will be 1 after casted.

@ttnghia
Copy link
Collaborator

ttnghia commented Apr 8, 2022

So maybe we can mitigate this by fixing the toInt API from round-toward-zero to round-to-nearest? I'm not sure about the intentional (expected) behavior of toInt here.

@ttnghia
Copy link
Collaborator

ttnghia commented Apr 8, 2022

It seems that Scala toInt is the same as C++:

val x: Double = 1.9999999999999998
val y: Int = x.toInt

println(y)

Output:

1

So our issue here is tricky to fix indeed: If we have 2.0 in the input file, we must not have 1.9999999999999998 in memory after parsing. This is very difficult to guarantee.

The only way I can think of to overcome this issue (not to fix it) is explicitly telling the users to use round.toInt instead of just toInt to get the expected, consistent result.

@revans2
Copy link
Collaborator

revans2 commented Apr 8, 2022

@ttnghia I think we are talking about different things here. Java has their own complicated way of parsing floats and doubles that does not match always C/C++ or really anyone else out there. Eventually we will need to replicate this logic, but my problem is not with C++. My problem is how do you get round off error when parsing something as simple as "1.0". I wrote a quick program to show what C++ does...

#include <iostream>   // std::cout
#include <string>     // std::string, std::stod
#include <cstring>

void parse_n_print_it(std::string input) {
  double a = std::stod(input);
  uint64_t a_bits = 0;
  memcpy(&a_bits, &a, sizeof(a));

  float fa = (float)a;
  uint32_t fa_bits = 0;
  memcpy(&fa_bits, &fa, sizeof(fa)); 

  std::cout << input << " " << a << " " << a_bits << " " << fa << " " << fa_bits <<"\n";
}

int main ()
{
  parse_n_print_it("0.0");
  parse_n_print_it("1.0");
  parse_n_print_it("2.0");
  parse_n_print_it("3.0");
  parse_n_print_it("0.1");
  parse_n_print_it("0.2");
  parse_n_print_it("0.3");
  return 0;
}

When I run it I get...

0.0 0 0 0 0
1.0 1 4607182418800017408 1 1065353216
2.0 2 4611686018427387904 2 1073741824
3.0 3 4613937818241073152 3 1077936128
0.1 0.1 4591870180066957722 0.1 1036831949
0.2 0.2 4596373779694328218 0.2 1045220557
0.3 0.3 4599075939470750515 0.3 1050253722

I then wrote the same thing in java.

class Test {
public static void parse_n_print_it(String input) {
    double a = Double.valueOf(input);
    long a_bits = Double.doubleToLongBits(a);
    
    float fa = (float)a;
    int fa_bits = Float.floatToRawIntBits(fa);

    System.out.println(input + " " + a + " " + a_bits + " " + fa + " " + fa_bits);
}

public static void main (String[] args)
{
  parse_n_print_it("0.0");
  parse_n_print_it("1.0");
  parse_n_print_it("2.0");
  parse_n_print_it("3.0");
  parse_n_print_it("0.1");
  parse_n_print_it("0.2");
  parse_n_print_it("0.3");
}
}

and got essentially identical results.

0.0 0.0 0 0.0 0
1.0 1.0 4607182418800017408 1.0 1065353216
2.0 2.0 4611686018427387904 2.0 1073741824
3.0 3.0 4613937818241073152 3.0 1077936128
0.1 0.1 4591870180066957722 0.1 1036831949
0.2 0.2 4596373779694328218 0.2 1045220557
0.3 0.3 4599075939470750515 0.3 1050253722

Looking at how floating point is represented I am really at a loss about how you get a round off error when you see a "1" or "10" for that matter fits entirely within the significand for either 32-bit or 64-bit values. That feels like a bug to me.

@ttnghia
Copy link
Collaborator

ttnghia commented Apr 8, 2022

You got identical results here because your C++ results were parsed by standard C++ library.
cudf string parser is not implemented to do exactly the same as in standard C++ (which may produce very close result with Java). Instead, cudf parser may have some round-off errors and you would probably get slightly different results.


How does cudf parser have round-off error? The string-to-double parser in cudf is implemented very differently from standard C++. Instead, it was in-house implemented, involves many +,-,*,/ operations, and even calls log10 and exp10 operations. Each of those operations all has round-off.

Even something simple as 1.0, the intermediate result going through many different operations listed above will be rounded off. Maybe cudf parser is not good enough to recognize such input is "simple" and terminates early.

@revans2
Copy link
Collaborator

revans2 commented Apr 8, 2022

You got identical results here because your C++ results were parsed by standard C++ library. cudf string parser is not implemented to do exactly the same as in standard C++ (which may produce very close result with Java). Instead, cudf parser may have some round-off errors and you would probably get slightly different results.

Yes exactly. What you are saying is that cudf does not match C++ and that is okay, because is within rounding error, and I am asking "is it really okay?" especially when "1.0" and "2.0" are going to potentially be really common and a reasonable person would wonder why they are not a clear and specific value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P1 Nice to have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants