-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize postgresql ingest pipeline #7269
Conversation
4de6c89
to
869b326
Compare
The postgresql ingest pipeline was not performing so well. This PR use the following rules to improve the situation. - Anchor the Regular expression at the begining of the string. - Merge the multiple statements into a single RE - Do not use back reference for user/host delimiter. Fixes: elastic#7201
a03b504
to
ccc274f
Compare
I am working on a bigger tests scenario, but still its better :) |
@@ -6,16 +6,11 @@ | |||
"field": "message", | |||
"ignore_missing": true, | |||
"patterns": [ | |||
"%{LOCALDATETIME:postgresql.log.timestamp} %{WORD:postgresql.log.timezone} \\[%{NUMBER:postgresql.log.thread_id}\\] %{USERNAME:postgresql.log.user}@%{POSTGRESQL_DB_NAME:postgresql.log.database} %{WORD:postgresql.log.level}: duration: %{NUMBER:postgresql.log.duration} ms statement: %{MULTILINEQUERY:postgresql.log.query}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we take this as a general rule that 1 pattern is better the multiple?
Should we perhaps introduce in some cases the processing in steps to first extract the common parts and then do the more complex things in a second processor. This would make things more readable and maintainable. Don't know what the affect on performance would be.
It depends, the main idea is to make sure that we can escape fast of the
regular espresssion if it doesn’t match, this is why anchoring is so
important. I presume if most of the string match every time. Like in the
postgres logs, we always have time stamp + log level + thread id, it will
be faster to only have one regexp, because of the way the state machine
will be created since the second part of the board expression has static
element (duration / ms) this make it easy to short circuit.
On Wed, Jun 6, 2018 at 3:48 AM Nicolas Ruflin ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In filebeat/module/postgresql/log/ingest/pipeline.json
<#7269 (comment)>:
> @@ -6,16 +6,11 @@
"field": "message",
"ignore_missing": true,
"patterns": [
- "%{LOCALDATETIME:postgresql.log.timestamp} %{WORD:postgresql.log.timezone} \\[%{NUMBER:postgresql.log.thread_id}\\] %{USERNAME:postgresql.log.user}@%{POSTGRESQL_DB_NAME:postgresql.log.database} %{WORD:postgresql.log.level}: duration: %{NUMBER:postgresql.log.duration} ms statement: %{MULTILINEQUERY:postgresql.log.query}",
Can we take this as a general rule that 1 pattern is better the multiple?
Should we perhaps introduce in some cases the processing in steps to first
extract the common parts and then do the more complex things in a second
processor. This would make things more readable and maintainable. Don't
know what the affect on performance would be.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#7269 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAACgO5zetbbsTpQBq0T_R9HfshjpB_nks5t54lEgaJpZM4UbE_U>
.
--
ph
|
Changed label from in progress to review |
@ph On your note about dissect: It seems that dissect could be used to extract parts of it? |
I wonder if we should backport this to 6.3? |
@ruflin we should, in a real world context the ingest pipeline is mostly unusable and could hurt your cluster. |
The postgresql ingest pipeline was not performing so well. This PR use the following rules to improve the situation. - Anchor the Regular expression at the begining of the string. - Merge the multiple statements into a single RE - Do not use back reference for user/host delimiter. Fixes: elastic#7201 Master branch to ingest 18 documents 26ms, this PR 11 ms. Not I did not use dissect, since there are variations in the tokens. (cherry picked from commit 4a41587)
The postgresql ingest pipeline was not performing so well. This PR use the following rules to improve the situation. - Anchor the Regular expression at the begining of the string. - Merge the multiple statements into a single RE - Do not use back reference for user/host delimiter. Fixes: #7201 Master branch to ingest 18 documents 26ms, this PR 11 ms. Not I did not use dissect, since there are variations in the tokens. (cherry picked from commit 4a41587)
The postgresql ingest pipeline was not performing so well. This PR use the following rules to improve the situation. - Anchor the Regular expression at the begining of the string. - Merge the multiple statements into a single RE - Do not use back reference for user/host delimiter. Fixes: elastic#7201 Master branch to ingest 18 documents 26ms, this PR 11 ms. Not I did not use dissect, since there are variations in the tokens. (cherry picked from commit e801aaa)
The postgresql ingest pipeline was not performing so well.
This PR use the following rules to improve the situation.
Fixes: #7201
Note to reviewers:
Master branch to ingest 18 documents 26ms, this PR 11 ms.
Not I did not use dissect, since there are variations in the tokens.