parser_csv: Improve the performance for typical cases #2535

repeatedly · 2019-07-31T03:51:37Z

Add parser_type parameter to switch the parser.

Signed-off-by: Masahiro Nakagawa repeatedly@gmail.com

Which issue(s) this PR fixes:
None

What this PR does / why we need it:
Parser version try of #2529.
I implemented original method because CSV module doesn't provide the way to parse single line without recreate CSV object.
This fast implementation supports typical patterns so it is useful. See test case for supported patterns:

fluentd/test/plugin/test_parser_csv.rb

Line 107 in 8b1cb4a

def test_compatibility_between_normal_and_fast_parser(param)

Here is the benchmark result:

non quote: value1,value1,value1,value1,value1,value1,value1,value1
quoted   : "fo,o","fo,o","fo,o","fo,o","fo,o","fo,o","fo,o","fo,o"
escaped  : "aa","b""b""b","c"," ",e,f,"""""",
Warming up --------------------------------------
  now with non quote     1.388k i/100ms
  new with non quote    30.281k i/100ms
     now with quoted     1.175k i/100ms
     new with quoted    10.304k i/100ms
     now with escape     1.142k i/100ms
     new with escape    11.327k i/100ms
Calculating -------------------------------------
  now with non quote     14.631k (± 2.2%) i/s -     73.564k in   5.030352s
  new with non quote    329.169k (± 3.2%) i/s -      1.665M in   5.065852s
     now with quoted     11.950k (± 2.8%) i/s -     59.925k in   5.019060s
     new with quoted    103.813k (± 7.2%) i/s -    525.504k in   5.096997s
     now with escape     11.920k (± 2.5%) i/s -     60.526k in   5.080920s
     new with escape    117.268k (± 1.9%) i/s -    589.004k in   5.024571s

benchmark code

require 'benchmark/ips'
require 'csv'

class CP
  def initialize
    @keys = $keys.dup
    @delimiter = ','
    @quote_char = '"'
    @escape_pattern = Regexp.compile(@quote_char * 2)
  end

  def parse1(text, &block)
    values = CSV.parse_line(text, col_sep: @delimiter)
    r = Hash[@keys.zip(values)]
    yield r
  end

  def parse2(text, &block)
    r = fast_parse(text)
    yield r
  end

  def fast_parse(text)
    record = {}
    text.chomp!

    return record if text.empty?

    # use while because while is now faster than each_with_index
    columns = text.split(@delimiter, -1)
    num_columns = columns.size
    i = 0
    j = 0
    while j < num_columns
      column = columns[j]

      case column.count(@quote_char)
      when 0
        if column.empty?
          column = nil
        end
      when 1
        if column.start_with?(@quote_char)
          to_merge = [column]
          j += 1
          while j < num_columns
            merged_col = columns[j]
            to_merge << merged_col
            break if merged_col.end_with?(@quote_char)
            j += 1
          end
          column = to_merge.join(@delimiter)[1..-2]
        end
      when 2
        if column.start_with?(@quote_char) && column.end_with?(@quote_char)
          column = column[1..-2]
        end
      else
        if column.start_with?(@quote_char) && column.end_with?(@quote_char)
          column = column[1..-2]
        end
        column.gsub!(@escape_pattern, @quote_char)
      end

      record[@keys[i]] = column
      j += 1
      i += 1
    end
    record
  end
end

$keys = ["key1","key2","ke,y3","ke y4","key5","key6","key7","key8"]
keys = $keys.dup
text1 = keys.size.times.map { "value1" }.join(",")
text2 = keys.size.times.map { '"fo,o"' }.join(",")
text3 = '"aa","b""b""b","c"," ",e,f,"""""",'

puts "Ruby version: #{RUBY_VERSION}"

cp = CP.new

puts "non quote: #{text1}"
puts "quoted   : #{text2}"
puts "escaped  : #{text3}"

Benchmark.ips do |x|
  x.report('now with non quote') do
    cp.parse1(text1) { |r| }
  end

  x.report('new with non quote') do
    cp.parse2(text1) { |r| }
  end

  x.report('now with quoted') do
    cp.parse1(text2) { |r| }
  end

  x.report('new with quoted') do
    cp.parse2(text2) { |r| }
  end

  x.report('now with escape') do
    cp.parse1(text3) { |r| }
  end

  x.report('new with escape') do
    cp.parse2(text3) { |r| }
  end
end

Docs Changes:
Add parser_type to parser_csv article

Release Note:
Same as title

Add parser_type parameter to switch the parser. Here is the benchmark result: non quote: value1,value1,value1,value1,value1,value1,value1,value1 quoted : "fo,o","fo,o","fo,o","fo,o","fo,o","fo,o","fo,o","fo,o" Warming up -------------------------------------- now with non quote 1.462k i/100ms new with non quote 31.351k i/100ms now with quoted 1.223k i/100ms new with quoted 10.241k i/100ms Calculating ------------------------------------- now with non quote 15.207k (± 2.1%) i/s - 76.024k in 5.001425s new with non quote 338.989k (± 1.1%) i/s - 1.724M in 5.087178s now with quoted 12.440k (± 1.3%) i/s - 62.373k in 5.014981s new with quoted 105.291k (± 2.5%) i/s - 532.532k in 5.061225s Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

ganmacs · 2019-07-31T07:38:07Z

lib/fluent/plugin/parser_csv.rb

+      # This method avoids the overhead of CSV.parse_line for typical patterns
+      def parse_fast_internal(text)
+        record = {}
+        text.chomp!


Is executing chomp! okay here?
If row value is a,b, , (User expects that the last column is space), should we remain the value as it is?

p CSV.parse_line('a,b, ,c, ') #=> ["a", "b", " ", "c", " "]

chomp! don't remove space character.

ganmacs · 2019-07-31T07:40:35Z

test/plugin/test_parser_csv.rb

+        assert_equal(event_time("28/Feb/2013:12:00:00 +0900", format: '%d/%b/%Y:%H:%M:%S %z'), time)
+        assert_equal expected, record
+      end
+    end


Probably, it needs to add the test which fast CSV parser receives the value which it can't parse but normal one can do.
I don't understand now the difference between fast and normal parser.

I added simple test for different point check but hard to write detailed test because I don't know how CSV module parse the CSV line precisely.

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

ganmacs · 2019-08-02T02:03:30Z

test/plugin/test_parser_csv.rb

+      assert_raise(CSV::MalformedCSVError) {
+        normal.instance.parse(text) { |t, r| }
+      } 
+      assert_nothing_raised {


How do users detect invalid records?

No way. This fast parser doesn't consider it. Users need to check their format meets fast parser before.

ganmacs

I think we need to add the detail document about the difference between the fast parser and the normal one and how to handing if invalid records contain.

ganmacs · 2019-08-05T02:40:04Z

lib/fluent/plugin/parser_csv.rb

+      def configure(conf)
+        super
+
+        @quote_char = '"'


Should This line and L37 be moved into L39's if expression?

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

repeatedly · 2019-08-07T16:09:02Z

Apply reviews. I will send a patch to fluentd-docs-gitbook.

I used benchmark script as below: fluent/fluentd#2535 Warming up -------------------------------------- now 5.553k i/100ms new 10.626k i/100ms instance 9.009k i/100ms Calculating ------------------------------------- now 57.255k (± 4.1%) i/s - 288.756k in 5.051981s new 114.090k (± 7.1%) i/s - 573.804k in 5.062333s instance 95.062k (± 4.1%) i/s - 477.477k in 5.031413s

I'm still thinking... I used benchmark script as below: fluent/fluentd#2535 Warming up -------------------------------------- now 5.553k i/100ms new 10.626k i/100ms instance 9.009k i/100ms Calculating ------------------------------------- now 57.255k (± 4.1%) i/s - 288.756k in 5.051981s new 114.090k (± 7.1%) i/s - 573.804k in 5.062333s instance 95.062k (± 4.1%) i/s - 477.477k in 5.031413s

repeatedly added the enhancement Feature request or improve operations label Jul 31, 2019

repeatedly requested a review from ganmacs July 31, 2019 03:51

repeatedly self-assigned this Jul 31, 2019

repeatedly force-pushed the improve-parser-csv branch from a121057 to cd11f49 Compare July 31, 2019 04:18

ganmacs reviewed Jul 31, 2019

View reviewed changes

repeatedly added 2 commits August 1, 2019 15:32

parser_csv: Support escaped character in fast parser

1d5eb7d

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

parser_csv: Add incompatibility test for normal/fast parser

8b1cb4a

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

ganmacs reviewed Aug 2, 2019

View reviewed changes

ganmacs approved these changes Aug 6, 2019

View reviewed changes

parser_csv: Move fast related variables

1fe93ca

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

repeatedly merged commit 6e486aa into master Aug 7, 2019

repeatedly deleted the improve-parser-csv branch August 7, 2019 16:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parser_csv: Improve the performance for typical cases #2535

parser_csv: Improve the performance for typical cases #2535

repeatedly commented Jul 31, 2019 •

edited

Loading

ganmacs Jul 31, 2019

repeatedly Aug 1, 2019

ganmacs Jul 31, 2019

repeatedly Aug 1, 2019

ganmacs Aug 2, 2019

repeatedly Aug 2, 2019 •

edited

Loading

ganmacs left a comment

ganmacs Aug 5, 2019

repeatedly commented Aug 7, 2019

parser_csv: Improve the performance for typical cases #2535

parser_csv: Improve the performance for typical cases #2535

Conversation

repeatedly commented Jul 31, 2019 • edited Loading

ganmacs Jul 31, 2019

Choose a reason for hiding this comment

repeatedly Aug 1, 2019

Choose a reason for hiding this comment

ganmacs Jul 31, 2019

Choose a reason for hiding this comment

repeatedly Aug 1, 2019

Choose a reason for hiding this comment

ganmacs Aug 2, 2019

Choose a reason for hiding this comment

repeatedly Aug 2, 2019 • edited Loading

Choose a reason for hiding this comment

ganmacs left a comment

Choose a reason for hiding this comment

ganmacs Aug 5, 2019

Choose a reason for hiding this comment

repeatedly commented Aug 7, 2019

repeatedly commented Jul 31, 2019 •

edited

Loading

repeatedly Aug 2, 2019 •

edited

Loading