-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GC fileset to Elasticsearch Filebeat module #7305
Conversation
Update ingest pipeline Add Java 8 GC log JDK -> JVM logs Add default gc.log path Add jvm fields group; GC test log update Add test data
Inadvertently left out the multiline config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for kicking this off.
I'm still confused if this is a JDK or JVM log. One more general thought: Is the gc log the same independent on where it comes from? Meaning should we perhaps have a jvm
module with a gc
metricset`?
What are log files are there from the JVM?
Ignoring the above for now: Happy to have it directly in ES. If we have it in, from a user perspective it is probably more the gc
metricset as he does not think too much about where it comes from but what it's about.
@@ -0,0 +1,702 @@ | |||
2018-06-11 01:51:14 GC log file created /var/log/elasticsearch/gc.log.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this log contains jdk
in the name I'm now thinking if this is actually a jdk log and not jvm. I'm confused :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log came from a running JDK (as opposed to JRE) and should be renamed to gc.log
.
Note: Let's discuss the above quickly but will also have a second review on the metricset itself so we can get CI green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments.
For the CI failure need to check it out. I assume there is a very minimal diff for the expected log which is hard to spot.
"field": "message", | ||
"patterns": [ | ||
"%{TIMESTAMP_ISO8601:timestamp}", | ||
"%{GREEDYMULTILINE:elasticsearch.gc.message}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to put this under just message
. It means we will overwrite what is in message at the moment. If we want to keep it, we could use log.message
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the missing elasticsearch.gc.message
part in the expected log file is the reason CI is failing.
@@ -1,2 +1,4 @@ | |||
dashboards: | |||
|
|||
- id: Filebeat-elasticsearch-gc-Dashboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be removed as there are no dashboards yet.
var: | ||
- name: paths | ||
default: | ||
- /var/log/elasticsearch/gc.log.[0-9]* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the name of the first log? If it is called gc.log
I think the pattern will not match it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GC logs are weird. In ES jvm.options
, the default configuration looks like this:
8:-Xloggc:/var/log/elasticsearch/gc.log
When Elasticsearch is started for the first time without any GC logs, the first one is named gc.log.0.current
. When rolled over by the JVM, the file is renamed to gc.log.0
and gc.log.1.current
begins. Here are files from a running cluster node:
ls -1 /var/log/elasticsearch/gc.*
/var/log/elasticsearch/gc.log.0
/var/log/elasticsearch/gc.log.0.current
/var/log/elasticsearch/gc.log.1
/var/log/elasticsearch/gc.log.2
/var/log/elasticsearch/gc.log.3
/var/log/elasticsearch/gc.log.4.current
On this system, the active file is /var/log/elasticsearch/gc.log.0.current
. /var/log/elasticsearch/gc.log.4.current
was left behind from a restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, then the pattern makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments. In think the fields
part needs fixing besides that I'm good with merging and fixing things in a follow up PR.
type: group | ||
description: > | ||
fields: | ||
- name: relative_process_timestamp_secs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should adhere to our naming conventions: https://www.elastic.co/guide/en/beats/devguide/current/event-conventions.html We can also tackle this in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can fix on a follow up PR.
negate: true | ||
match: after | ||
|
||
fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the indentation seems to be off fields
and fields_under_root
is probably not applying because of this.
"grok": { | ||
"field": "message", | ||
"patterns": [ | ||
"%{GCTIMESTAMP}: %{GCPROCRUNTIME}: Total time for which application threads were stopped: %{BASE10NUM:elasticsearch.gc.threads_total_stop_time_secs} seconds, Stopping threads took: %{BASE10NUM:elasticsearch.gc.stopping_threads_time_secs} seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ph Looks like we could this in dissect?
var: | ||
- name: paths | ||
default: | ||
- /var/log/elasticsearch/gc.log.[0-9]* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, then the pattern makes sense.
Fixed two indentation issues @ruflin. Should be good? |
@inqueue Thanks a lot for the PR. Now lets start to iterate on top of it :-) Really happy to have this in. |
-- | ||
type: float | ||
|
||
Garbage collection threads total stop time seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inqueue Just one clarification on the meaning of those: The metrics that are captured here are not necessarily tied to garbage collection. These times are actually related to safepoints. At a safepoint all application threads are halted and the JVM can do any administrative work that requires that the application cannot interfere. That might include such things as deoptimization, biased locking or garbage collection.
The first metric tells for how long the application threads have been halted (the duration of the pause due to the safepoint) and the second one tells how long it took from the point in time the JVM requested a safepoint until all application threads could been halted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @danielmitterdorfer, that makes sense. I did want to capture that metric in this initial commit as it is probably the most frequently logged in gc.log. Can we work with you to isolate the most meaningful metrics? We also intend to incorporate GC metrics captured in the Kafka Filebeat module- https://www.elastic.co/blog/monitoring-kafka-with-elastic-stack-1-filebeat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we work with you to isolate the most meaningful metrics?
Yes. The most meaningful ones in my opinion are:
- Absolute timestamp of an event
- Type of pause (young GC, old GC)
- Pause time (how long was the application stopped?), GC (phase) duration (might happen concurrently)
- Amount of memory cleaned up
Although I think it will get very tricky to parse all those variations:
- The log file format has changed with JDK 9 (so called "unified logging").
- There are different garbage collection algorithms (e.g. CMS, G1)
https://www.youtube.com/watch?v=M9AYK5PtX40 should give a good idea about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielmitterdorfer Thanks a lot for chiming in here.
What is going to be really valuable for our users are the level of detail you described the metrics above. We should make sure these end up in our docs as it happens very frequently on discuss or support that users ask about what a field means in detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're welcome. Glad I could help.
Adding fileset JVM GC logs for Elasticsearch Filebeat module. This baseline fileset does not do much yet, more to come.
#5301.