Optimize postgresql ingest pipeline #7269

ph · 2018-06-05T15:41:29Z

The postgresql ingest pipeline was not performing so well.
This PR use the following rules to improve the situation.

Anchor the Regular expression at the begining of the string.
Merge the multiple statements into a single RE
Do not use back reference for user/host delimiter.

Note to reviewers:

Master branch to ingest 18 documents 26ms, this PR 11 ms.
Not I did not use dissect, since there are variations in the tokens.

The postgresql ingest pipeline was not performing so well. This PR use the following rules to improve the situation. - Anchor the Regular expression at the begining of the string. - Merge the multiple statements into a single RE - Do not use back reference for user/host delimiter. Fixes: elastic#7201

ph · 2018-06-05T20:47:06Z

I am working on a bigger tests scenario, but still its better :)

ruflin · 2018-06-06T07:48:19Z

filebeat/module/postgresql/log/ingest/pipeline.json

@@ -6,16 +6,11 @@
        "field": "message",
        "ignore_missing": true,
        "patterns": [
-          "%{LOCALDATETIME:postgresql.log.timestamp} %{WORD:postgresql.log.timezone} \\[%{NUMBER:postgresql.log.thread_id}\\] %{USERNAME:postgresql.log.user}@%{POSTGRESQL_DB_NAME:postgresql.log.database} %{WORD:postgresql.log.level}:  duration: %{NUMBER:postgresql.log.duration} ms  statement: %{MULTILINEQUERY:postgresql.log.query}",


Can we take this as a general rule that 1 pattern is better the multiple?

Should we perhaps introduce in some cases the processing in steps to first extract the common parts and then do the more complex things in a second processor. This would make things more readable and maintainable. Don't know what the affect on performance would be.

ph · 2018-06-06T10:57:55Z

It depends, the main idea is to make sure that we can escape fast of the regular espresssion if it doesn’t match, this is why anchoring is so important. I presume if most of the string match every time. Like in the postgres logs, we always have time stamp + log level + thread id, it will be faster to only have one regexp, because of the way the state machine will be created since the second part of the board expression has static element (duration / ms) this make it easy to short circuit.

On Wed, Jun 6, 2018 at 3:48 AM Nicolas Ruflin ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In filebeat/module/postgresql/log/ingest/pipeline.json <#7269 (comment)>: > @@ -6,16 +6,11 @@ "field": "message", "ignore_missing": true, "patterns": [ - "%{LOCALDATETIME:postgresql.log.timestamp} %{WORD:postgresql.log.timezone} \\[%{NUMBER:postgresql.log.thread_id}\\] %{USERNAME:postgresql.log.user}@%{POSTGRESQL_DB_NAME:postgresql.log.database} %{WORD:postgresql.log.level}: duration: %{NUMBER:postgresql.log.duration} ms statement: %{MULTILINEQUERY:postgresql.log.query}", Can we take this as a general rule that 1 pattern is better the multiple? Should we perhaps introduce in some cases the processing in steps to first extract the common parts and then do the more complex things in a second processor. This would make things more readable and maintainable. Don't know what the affect on performance would be. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7269 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAACgO5zetbbsTpQBq0T_R9HfshjpB_nks5t54lEgaJpZM4UbE_U> .

-- ph

ph · 2018-06-11T16:25:58Z

Changed label from in progress to review

ruflin · 2018-06-12T09:08:34Z

@ph On your note about dissect: It seems that dissect could be used to extract parts of it?

ruflin · 2018-06-12T09:08:55Z

I wonder if we should backport this to 6.3?

ph · 2018-06-12T15:24:40Z

@ruflin we should, in a real world context the ingest pipeline is mostly unusable and could hurt your cluster.

ref: elastic#7269

ref: #7269

The postgresql ingest pipeline was not performing so well. This PR use the following rules to improve the situation. - Anchor the Regular expression at the begining of the string. - Merge the multiple statements into a single RE - Do not use back reference for user/host delimiter. Fixes: elastic#7201 Master branch to ingest 18 documents 26ms, this PR 11 ms. Not I did not use dissect, since there are variations in the tokens. (cherry picked from commit 4a41587)

The postgresql ingest pipeline was not performing so well. This PR use the following rules to improve the situation. - Anchor the Regular expression at the begining of the string. - Merge the multiple statements into a single RE - Do not use back reference for user/host delimiter. Fixes: #7201 Master branch to ingest 18 documents 26ms, this PR 11 ms. Not I did not use dissect, since there are variations in the tokens. (cherry picked from commit 4a41587)

The postgresql ingest pipeline was not performing so well. This PR use the following rules to improve the situation. - Anchor the Regular expression at the begining of the string. - Merge the multiple statements into a single RE - Do not use back reference for user/host delimiter. Fixes: elastic#7201 Master branch to ingest 18 documents 26ms, this PR 11 ms. Not I did not use dissect, since there are variations in the tokens. (cherry picked from commit e801aaa)

ph added in progress Pull request is currently in progress. Filebeat Filebeat labels Jun 5, 2018

ph force-pushed the fix/performance-check-postgresql branch from 4de6c89 to 869b326 Compare June 5, 2018 15:42

ph added 2 commits June 5, 2018 15:15

multiline greedy

d8d80e5

ph changed the title ~~[WIP] Optimize postgresql ingest pipeline~~ Optimize postgresql ingest pipeline Jun 5, 2018

pep8

ccc274f

ph force-pushed the fix/performance-check-postgresql branch from a03b504 to ccc274f Compare June 5, 2018 20:31

ruflin reviewed Jun 6, 2018

View reviewed changes

ph added review and removed in progress Pull request is currently in progress. labels Jun 11, 2018

ruflin merged commit 4a41587 into elastic:master Jun 12, 2018

ph added the needs_backport PR is waiting to be backported to other branches. label Jun 12, 2018

ph mentioned this pull request Jun 12, 2018

Comply with PostgreSQL database name format #7198

Merged

ph added a commit to ph/beats that referenced this pull request Jun 14, 2018

Missing changelog for the PostgreSQL ingest pipeline change.

bace072

ref: elastic#7269

ph mentioned this pull request Jun 14, 2018

Missing changelog for the PostgreSQL ingest pipeline change. #7334

Merged

ruflin pushed a commit that referenced this pull request Jun 15, 2018

Missing changelog for the PostgreSQL ingest pipeline change. (#7334)

7a76684

ref: #7269

ph mentioned this pull request Jun 15, 2018

Cherry-pick #7269 to 6.3: Optimize postgresql ingest pipeline #7344

Merged

ph added v6.3.1 and removed needs_backport PR is waiting to be backported to other branches. labels Jun 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize postgresql ingest pipeline #7269

Optimize postgresql ingest pipeline #7269

ph commented Jun 5, 2018 •

edited

Loading

ph commented Jun 5, 2018

ruflin Jun 6, 2018

ph commented Jun 6, 2018 via email

ph commented Jun 11, 2018

ruflin commented Jun 12, 2018

ruflin commented Jun 12, 2018

ph commented Jun 12, 2018 •

edited

Loading

Optimize postgresql ingest pipeline #7269

Optimize postgresql ingest pipeline #7269

Conversation

ph commented Jun 5, 2018 • edited Loading

Note to reviewers:

ph commented Jun 5, 2018

ruflin Jun 6, 2018

Choose a reason for hiding this comment

ph commented Jun 6, 2018 via email

ph commented Jun 11, 2018

ruflin commented Jun 12, 2018

ruflin commented Jun 12, 2018

ph commented Jun 12, 2018 • edited Loading

ph commented Jun 5, 2018 •

edited

Loading

ph commented Jun 12, 2018 •

edited

Loading