Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tail input plugin does not read line-by-line and sends partial line data #11372

Open
dr3amville opened this issue Jun 22, 2022 · 19 comments
Open
Labels
bug unexpected problem or unintended behavior

Comments

@dr3amville
Copy link

dr3amville commented Jun 22, 2022

Relevant telegraf.conf

# Configuration for telegraf agent
[agent]
    interval = "60s"
    debug = false
    hostname = "<hostname>"
    round_interval = true
    flush_interval = "60s"
    flush_jitter = "0s"
    collection_jitter = "0s"
    metric_batch_size = 3000
    metric_buffer_limit = 15000
    quiet = false
    logtarget = "file"
    logfile = "/opt/logs/telegraf.log"
    logfile_rotation_interval = "24h"
    logfile_rotation_max_size = "10MB"
    logfile_rotation_max_archives = 5
    omit_hostname = false

[[outputs.influxdb]]
    urls = ["https://<influxdb>:8085"]
    database = "monitoring"
    precision = "60s"
    insecure_skip_verify = true

[[inputs.tail]]
    files = ["/path/to/file/telegraf-metrics.log*"]
    data_format = "influx"
    path_tag = ""
[inputs.tail.tags]
    metric_type = "manager"

Logs from Telegraf

2022-06-22T19:26:00Z E! [inputs.tail] Malformed log line in "/path/to/file/telegraf-metrics.log_2022-06-22-15-22": ["e.partitioncount=2048,offheap.backupcount=27,offheap.entrycount=36,offheap.primarycount=9,offheap.size=0,onheap.entrycount=0,put.rate=1111,rebalance.keyleft=0,rebalance.keyrate=0,rebalance.partitioncount=0,tx.commit.size=0,tx.commit.time=141250,tx.rollback.size=0 1655925960896000000"]: metric parse error: expected field at 1:284: "e.partitioncount=2048,offheap.backupcount=27,offheap.entrycount=36,offheap.primarycount=9,offheap.size=0,onheap.entrycount=0,put.rate=1111,rebalance.keyleft=0,rebalance.keyrate=0,rebalance.partitioncount=0,tx.commit.size=0,tx.commit.time=141250,tx.rollback.size=0 1655925960896000000"

System info

Telegraf 1.18.3 (git: HEAD 6a94f65), RHEL 7.9 (Maipo) 63.10.0-1160.31.1.el7.x86_64 and RHEL 6.10 (Santiago) 2.6.32-754.35.1.el6.x86_64

Docker

No response

Steps to reproduce

  1. Application writes out InfluxDB line protocol formatted metrics to file.
  2. Tail plugin reads and publishes to configured InfluxDB output.
  3. Malformed lines messages in logs and bad measurements in InfluxDB.

Expected behavior

Full lines should be sent instead of partially reading lines and sending them.

Actual behavior

Lines are partially being read and sent leading to large number of bad measurements in InfluxDB.

Additional info

No response

@dr3amville dr3amville added the bug unexpected problem or unintended behavior label Jun 22, 2022
@powersj
Copy link
Contributor

powersj commented Jun 23, 2022

Hi,

Some questions for you:

  • Can you provide this log file to help reproduce the issue?
  • If not, can you at least provide a few of the lines before and after the failing line?
  • Have you verified that the log file does not have any extra new lines in this line? Influx line protocol breaks on new lines. It makes me wonder what comes right before the e.partitioncount=2048.

Thanks!

@powersj powersj added the waiting for response waiting for response from contributor label Jun 23, 2022
@dr3amville
Copy link
Author

Sure, the file I'd rather not provide because there's a good amount of sensitive data in there, and the logs I mentioned have rotated out but here's another example:

Telegraf error log:

2022-06-23T14:05:00Z E! [inputs.tail] Malformed log line in "/path/to/file/telegraf-metrics.log_2022-06-23-09-59": ["sPercentage=1,pagesFillFactor=1,pagesReplaceRate=1,physicalMemoryPages=1,totalAllocatedPages=1 1655993100897000000"]: metric parse error: expected field at 1:115: "sPercentage=1,pagesFillFactor=1,pagesReplaceRate=1,physicalMemoryPages=1,totalAllocatedPages=1 1655993100897000000"

Lines from the metric log file (4th line):

timer,plugins=UpdateCommunityStringPlugIn,profile=uptime_profile,agg=samples execution.succeeded=0 1655993100897000000
timer,plugins=UpdateCommunityStringPlugIn,profile=uptime_profile,agg=total execution.succeeded=0 1655993100897000000
timer,plugins=UpdateCommunityStringPlugIn,profile=uptime_profile,agg=average execution.succeeded=0 1655993100897000000
memory,igniteRegion=sysMemPlc,agg=samples allocationRate=1,dirtyPages=1,evictionRate=1,largeEntriesPagesPercentage=1,pagesFillFactor=1,pagesReplaceRate=1,physicalMemoryPages=1,totalAllocatedPages=1 1655993100897000000
memory,igniteRegion=sysMemPlc,agg=total allocationRate=0,dirtyPages=0,evictionRate=0,largeEntriesPagesPercentage=0,pagesFillFactor=0,pagesReplaceRate=0,physicalMemoryPages=200,totalAllocatedPages=200 1655993100897000000
memory,igniteRegion=sysMemPlc,agg=average allocationRate=0,dirtyPages=0,evictionRate=0,largeEntriesPagesPercentage=0,pagesFillFactor=0,pagesReplaceRate=0,physicalMemoryPages=200,totalAllocatedPages=200 1655993100897000000

Have you verified that the log file does not have any extra new lines in this line?

Yes, checked over the span of weeks of data to make sure actually because I was paranoid it was a writer issue but the log files are correct, no hidden characters or line breaks.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jun 23, 2022
@powersj
Copy link
Contributor

powersj commented Jun 23, 2022

Thanks for some example data! However, copying that data and using that with the tail and/or file did not reproduce.

You do not have from_beginning as true in your config, so I assume you are appending to this line and not re-writing the file?

@dr3amville
Copy link
Author

dr3amville commented Jun 23, 2022

Yeah unfortunately it's not clear what causes it to do this, it is totally random but I feel like it should not read and flush metrics unless it is collecting the log line by line. I do not have from_beginning as true, correct, so just append and get new metrics and push. I also tried to get the file to explicitly show hidden characters but nothing showed besides line endings where they belong, etc.

It was suggested by @srebhan in the InfluxDB Community Slack that maybe a line-by-line vs at-once explicit reading mode definition might be worth implementing to try and resolve this, similar to what was done here:
#11234

@powersj
Copy link
Contributor

powersj commented Jun 23, 2022

@srebhan @reimda thoughts on where to go from here?

@dr3amville
Copy link
Author

Just a few notes as well, it looks to be happening way more commonly with RHEL 6 than 7, not exactly sure why. Also just for reference, the largest batch of writes to the file that happens that I could find was approx 621 lines, 1863 words, 88710 bytes.

I'm guessing the only way to emulate this is to set up a simulated influx line protocol metric generator and write it out to file (maybe even just have telegraf itself do it and to a file output?) and then also configure the tail plugin to read that file and output to Influx. This way you can have it generating InfluxDB Line protocol formatted metric data writing to file and then the reading of that file and ingesting to InfluxDB.

@powersj
Copy link
Contributor

powersj commented Jun 23, 2022

Did this suddenly start happening recently or is this a new problem after some sort of upgrade?

We have plenty of users who are using the tail plugin with influx protocol. While we could start generating a bunch of metrics in a VM this is sounding more and more like a system-specific issue if you are seeing differences between OSes. Are these systems under heavy load or have lots of disk writes going on at the same time?

edit: more questions:

  • Are these systems baremetal, VMs or containers?
  • What type of underlying storage is used? HDD, SDD, NVMe, cloud storage?
  • Over what period of time are those 621 metrics written in: every second? minute?

@dr3amville
Copy link
Author

dr3amville commented Jun 23, 2022

It has been this way since we started adopting this method of publishing the metrics. We used to use a different stack before and leveraged filebeat to format the metrics a bit (it was in graphite format before) before publishing and we never saw any issues there.

It is a pretty basic VM setup, nothing crazy and disk IO (SSD) is extremely light (<2% almost all the time) and CPU usage is 90-95% idle for all nodes that see this issue all the time.

CPU (usage):
image

Disk IO (SSDs and plenty of diskspace):
image

The 621 metrics would be once per minute almost instantaneously. I have tried using the inotify and poll methods alike for the tail plugin and see the same issues.

We run our java applications on these nodes but since it's a really light test system, it doesn't see much usage, it's really just a test env for this observability work.

@powersj
Copy link
Contributor

powersj commented Jun 24, 2022

ok thanks for those metrics, so it doesn't sound like a load issue. I guess I could try to reproduce in a CentOS VM next.

@srebhan
Copy link
Member

srebhan commented Jun 27, 2022

I think, the OS does not flush line-wise, but just flushes the buffer and this buffer might contain a partial line/line-fragment. The only way I see to prevent this is by enforce waiting for a complete line (i.e. wait for the newline to appear) and hold back all partial writes... To do this I suggest to implement #11234 here... Maybe we should have a common place for this kind of "scanner"...

@dr3amville
Copy link
Author

@powersj - any chance you were you able to reproduce this? Do we feel like implementing #11234 might be worth a shot? I could test it out potentially.

@dr3amville
Copy link
Author

Do we know how the scanner completes its reading of the line? Does it look for a specific line-ending character when reading? Could be good to enforce terminating the read on a newline instead of an EOF if that may be impacting the current read point index for the file.

Just another datapoint tho - I added namepass under the tail input to test preventing any bad partial line writes and so far so good, though it may take some more soak time to really prove it out.

@srebhan
Copy link
Member

srebhan commented Jul 6, 2022

@dr3amville, the scanner takes a split function deciding when to return a chunk. #11234 uses ScanLines returning lines (e.g. wait for newline char) and the whole file (i.e. no split function), but we are free to implement more by implementing a custom SplitFunc...

Does that help?

@dr3amville
Copy link
Author

@srebhan I think just the #11234 implementation is plenty if that could be ported over.

@srebhan
Copy link
Member

srebhan commented Jul 6, 2022

Just give it a shot! ;-P

@dr3amville
Copy link
Author

I will have to pick up some Go 😔

@powersj
Copy link
Contributor

powersj commented Jul 7, 2022

@srebhan maybe I'm not following, but tail seems to be a different animal. For example, we aren't using a scanner, but instead waiting on the external tail library to send us data via our receiver function.

In the upstream tail library, which is our fork of hpcloud/tail, data is read new line delimited already.

Did I follow the right path or missed something else? Thanks!

@srebhan
Copy link
Member

srebhan commented Jul 8, 2022

@powersj ok, I didn't look into tail when I should have done it... :-( If the underlying library has a delimiter already, we should use it.

@jeme-0-jyskebank-dk
Copy link

I had same issue,
This fixed it for me:
[[inputs.tail]] watch_method = "poll"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants