-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parser_csv: Improve the performance for typical cases #2535
Conversation
Add parser_type parameter to switch the parser. Here is the benchmark result: non quote: value1,value1,value1,value1,value1,value1,value1,value1 quoted : "fo,o","fo,o","fo,o","fo,o","fo,o","fo,o","fo,o","fo,o" Warming up -------------------------------------- now with non quote 1.462k i/100ms new with non quote 31.351k i/100ms now with quoted 1.223k i/100ms new with quoted 10.241k i/100ms Calculating ------------------------------------- now with non quote 15.207k (± 2.1%) i/s - 76.024k in 5.001425s new with non quote 338.989k (± 1.1%) i/s - 1.724M in 5.087178s now with quoted 12.440k (± 1.3%) i/s - 62.373k in 5.014981s new with quoted 105.291k (± 2.5%) i/s - 532.532k in 5.061225s Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
a121057
to
cd11f49
Compare
# This method avoids the overhead of CSV.parse_line for typical patterns | ||
def parse_fast_internal(text) | ||
record = {} | ||
text.chomp! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is executing chomp!
okay here?
If row value is a,b, ,
(User expects that the last column is space), should we remain the value as it is?
p CSV.parse_line('a,b, ,c, ') #=> ["a", "b", " ", "c", " "]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chomp!
don't remove space character.
assert_equal(event_time("28/Feb/2013:12:00:00 +0900", format: '%d/%b/%Y:%H:%M:%S %z'), time) | ||
assert_equal expected, record | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, it needs to add the test which fast CSV parser receives the value which it can't parse but normal one can do.
I don't understand now the difference between fast and normal parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added simple test for different point check but hard to write detailed test because I don't know how CSV module parse the CSV line precisely.
Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
assert_raise(CSV::MalformedCSVError) { | ||
normal.instance.parse(text) { |t, r| } | ||
} | ||
assert_nothing_raised { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do users detect invalid records?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No way. This fast
parser doesn't consider it. Users need to check their format meets fast parser before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add the detail document about the difference between the fast parser and the normal one and how to handing if invalid records contain.
lib/fluent/plugin/parser_csv.rb
Outdated
def configure(conf) | ||
super | ||
|
||
@quote_char = '"' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should This line and L37 be moved into L39's if
expression?
Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
Apply reviews. I will send a patch to fluentd-docs-gitbook. |
I used benchmark script as below: fluent/fluentd#2535 Warming up -------------------------------------- now 5.553k i/100ms new 10.626k i/100ms instance 9.009k i/100ms Calculating ------------------------------------- now 57.255k (± 4.1%) i/s - 288.756k in 5.051981s new 114.090k (± 7.1%) i/s - 573.804k in 5.062333s instance 95.062k (± 4.1%) i/s - 477.477k in 5.031413s
I'm still thinking... I used benchmark script as below: fluent/fluentd#2535 Warming up -------------------------------------- now 5.553k i/100ms new 10.626k i/100ms instance 9.009k i/100ms Calculating ------------------------------------- now 57.255k (± 4.1%) i/s - 288.756k in 5.051981s new 114.090k (± 7.1%) i/s - 573.804k in 5.062333s instance 95.062k (± 4.1%) i/s - 477.477k in 5.031413s
Add parser_type parameter to switch the parser.
Signed-off-by: Masahiro Nakagawa repeatedly@gmail.com
Which issue(s) this PR fixes:
None
What this PR does / why we need it:
Parser version try of #2529.
I implemented original method because CSV module doesn't provide the way to parse single line without recreate CSV object.
This fast implementation supports typical patterns so it is useful. See test case for supported patterns:
fluentd/test/plugin/test_parser_csv.rb
Line 107 in 8b1cb4a
Here is the benchmark result:
Docs Changes:
Add
parser_type
toparser_csv
articleRelease Note:
Same as title