Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser_csv: Improve the performance for typical cases #2535

Merged
merged 4 commits into from
Aug 7, 2019

Conversation

repeatedly
Copy link
Member

@repeatedly repeatedly commented Jul 31, 2019

Add parser_type parameter to switch the parser.

Signed-off-by: Masahiro Nakagawa repeatedly@gmail.com

Which issue(s) this PR fixes:
None

What this PR does / why we need it:
Parser version try of #2529.
I implemented original method because CSV module doesn't provide the way to parse single line without recreate CSV object.
This fast implementation supports typical patterns so it is useful. See test case for supported patterns:

def test_compatibility_between_normal_and_fast_parser(param)

Here is the benchmark result:

non quote: value1,value1,value1,value1,value1,value1,value1,value1
quoted   : "fo,o","fo,o","fo,o","fo,o","fo,o","fo,o","fo,o","fo,o"
escaped  : "aa","b""b""b","c"," ",e,f,"""""",
Warming up --------------------------------------
  now with non quote     1.388k i/100ms
  new with non quote    30.281k i/100ms
     now with quoted     1.175k i/100ms
     new with quoted    10.304k i/100ms
     now with escape     1.142k i/100ms
     new with escape    11.327k i/100ms
Calculating -------------------------------------
  now with non quote     14.631k (± 2.2%) i/s -     73.564k in   5.030352s
  new with non quote    329.169k (± 3.2%) i/s -      1.665M in   5.065852s
     now with quoted     11.950k (± 2.8%) i/s -     59.925k in   5.019060s
     new with quoted    103.813k (± 7.2%) i/s -    525.504k in   5.096997s
     now with escape     11.920k (± 2.5%) i/s -     60.526k in   5.080920s
     new with escape    117.268k (± 1.9%) i/s -    589.004k in   5.024571s
  • benchmark code
require 'benchmark/ips'
require 'csv'

class CP
  def initialize
    @keys = $keys.dup
    @delimiter = ','
    @quote_char = '"'
    @escape_pattern = Regexp.compile(@quote_char * 2)
  end

  def parse1(text, &block)
    values = CSV.parse_line(text, col_sep: @delimiter)
    r = Hash[@keys.zip(values)]
    yield r
  end

  def parse2(text, &block)
    r = fast_parse(text)
    yield r
  end

  def fast_parse(text)
    record = {}
    text.chomp!

    return record if text.empty?

    # use while because while is now faster than each_with_index
    columns = text.split(@delimiter, -1)
    num_columns = columns.size
    i = 0
    j = 0
    while j < num_columns
      column = columns[j]

      case column.count(@quote_char)
      when 0
        if column.empty?
          column = nil
        end
      when 1
        if column.start_with?(@quote_char)
          to_merge = [column]
          j += 1
          while j < num_columns
            merged_col = columns[j]
            to_merge << merged_col
            break if merged_col.end_with?(@quote_char)
            j += 1
          end
          column = to_merge.join(@delimiter)[1..-2]
        end
      when 2
        if column.start_with?(@quote_char) && column.end_with?(@quote_char)
          column = column[1..-2]
        end
      else
        if column.start_with?(@quote_char) && column.end_with?(@quote_char)
          column = column[1..-2]
        end
        column.gsub!(@escape_pattern, @quote_char)
      end

      record[@keys[i]] = column
      j += 1
      i += 1
    end
    record
  end
end

$keys = ["key1","key2","ke,y3","ke y4","key5","key6","key7","key8"]
keys = $keys.dup
text1 = keys.size.times.map { "value1" }.join(",")
text2 = keys.size.times.map { '"fo,o"' }.join(",")
text3 = '"aa","b""b""b","c"," ",e,f,"""""",'

puts "Ruby version: #{RUBY_VERSION}"

cp = CP.new

puts "non quote: #{text1}"
puts "quoted   : #{text2}"
puts "escaped  : #{text3}"

Benchmark.ips do |x|
  x.report('now with non quote') do
    cp.parse1(text1) { |r| }
  end

  x.report('new with non quote') do
    cp.parse2(text1) { |r| }
  end

  x.report('now with quoted') do
    cp.parse1(text2) { |r| }
  end

  x.report('new with quoted') do
    cp.parse2(text2) { |r| }
  end

  x.report('now with escape') do
    cp.parse1(text3) { |r| }
  end

  x.report('new with escape') do
    cp.parse2(text3) { |r| }
  end
end

Docs Changes:
Add parser_type to parser_csv article

Release Note:
Same as title

@repeatedly repeatedly added the enhancement Feature request or improve operations label Jul 31, 2019
@repeatedly repeatedly requested a review from ganmacs July 31, 2019 03:51
@repeatedly repeatedly self-assigned this Jul 31, 2019
Add parser_type parameter to switch the parser.
Here is the benchmark result:

non quote: value1,value1,value1,value1,value1,value1,value1,value1
quoted   : "fo,o","fo,o","fo,o","fo,o","fo,o","fo,o","fo,o","fo,o"
Warming up --------------------------------------
  now with non quote     1.462k i/100ms
  new with non quote    31.351k i/100ms
  now with quoted        1.223k i/100ms
  new with quoted       10.241k i/100ms
Calculating -------------------------------------
  now with non quote     15.207k (± 2.1%) i/s -     76.024k in   5.001425s
  new with non quote    338.989k (± 1.1%) i/s -      1.724M in   5.087178s
  now with quoted        12.440k (± 1.3%) i/s -     62.373k in   5.014981s
  new with quoted       105.291k (± 2.5%) i/s -    532.532k in   5.061225s

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
@repeatedly repeatedly force-pushed the improve-parser-csv branch from a121057 to cd11f49 Compare July 31, 2019 04:18
# This method avoids the overhead of CSV.parse_line for typical patterns
def parse_fast_internal(text)
record = {}
text.chomp!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is executing chomp! okay here?
If row value is a,b, , (User expects that the last column is space), should we remain the value as it is?

p CSV.parse_line('a,b, ,c, ')  #=> ["a", "b", " ", "c", " "]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chomp! don't remove space character.

assert_equal(event_time("28/Feb/2013:12:00:00 +0900", format: '%d/%b/%Y:%H:%M:%S %z'), time)
assert_equal expected, record
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, it needs to add the test which fast CSV parser receives the value which it can't parse but normal one can do.
I don't understand now the difference between fast and normal parser.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added simple test for different point check but hard to write detailed test because I don't know how CSV module parse the CSV line precisely.

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
assert_raise(CSV::MalformedCSVError) {
normal.instance.parse(text) { |t, r| }
}
assert_nothing_raised {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do users detect invalid records?

Copy link
Member Author

@repeatedly repeatedly Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No way. This fast parser doesn't consider it. Users need to check their format meets fast parser before.

Copy link
Member

@ganmacs ganmacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add the detail document about the difference between the fast parser and the normal one and how to handing if invalid records contain.

def configure(conf)
super

@quote_char = '"'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should This line and L37 be moved into L39's if expression?

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
@repeatedly repeatedly merged commit 6e486aa into master Aug 7, 2019
@repeatedly
Copy link
Member Author

Apply reviews. I will send a patch to fluentd-docs-gitbook.

@repeatedly repeatedly deleted the improve-parser-csv branch August 7, 2019 16:09
284km added a commit to 284km/csv that referenced this pull request Sep 18, 2019
I used benchmark script as below:
fluent/fluentd#2535

Warming up --------------------------------------
                 now     5.553k i/100ms
                 new    10.626k i/100ms
            instance     9.009k i/100ms
Calculating -------------------------------------
                 now     57.255k (± 4.1%) i/s -    288.756k in   5.051981s
                 new    114.090k (± 7.1%) i/s -    573.804k in   5.062333s
            instance     95.062k (± 4.1%) i/s -    477.477k in   5.031413s
284km added a commit to 284km/csv that referenced this pull request Sep 18, 2019
I'm still thinking...

I used benchmark script as below:
fluent/fluentd#2535

Warming up --------------------------------------
                 now     5.553k i/100ms
                 new    10.626k i/100ms
            instance     9.009k i/100ms
Calculating -------------------------------------
                 now     57.255k (± 4.1%) i/s -    288.756k in   5.051981s
                 new    114.090k (± 7.1%) i/s -    573.804k in   5.062333s
            instance     95.062k (± 4.1%) i/s -    477.477k in   5.031413s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature request or improve operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants