Tail input plugin does not read line-by-line and sends partial line data #11372

dr3amville · 2022-06-22T22:11:15Z

Relevant telegraf.conf

# Configuration for telegraf agent
[agent]
    interval = "60s"
    debug = false
    hostname = "<hostname>"
    round_interval = true
    flush_interval = "60s"
    flush_jitter = "0s"
    collection_jitter = "0s"
    metric_batch_size = 3000
    metric_buffer_limit = 15000
    quiet = false
    logtarget = "file"
    logfile = "/opt/logs/telegraf.log"
    logfile_rotation_interval = "24h"
    logfile_rotation_max_size = "10MB"
    logfile_rotation_max_archives = 5
    omit_hostname = false

[[outputs.influxdb]]
    urls = ["https://<influxdb>:8085"]
    database = "monitoring"
    precision = "60s"
    insecure_skip_verify = true

[[inputs.tail]]
    files = ["/path/to/file/telegraf-metrics.log*"]
    data_format = "influx"
    path_tag = ""
[inputs.tail.tags]
    metric_type = "manager"

Logs from Telegraf

2022-06-22T19:26:00Z E! [inputs.tail] Malformed log line in "/path/to/file/telegraf-metrics.log_2022-06-22-15-22": ["e.partitioncount=2048,offheap.backupcount=27,offheap.entrycount=36,offheap.primarycount=9,offheap.size=0,onheap.entrycount=0,put.rate=1111,rebalance.keyleft=0,rebalance.keyrate=0,rebalance.partitioncount=0,tx.commit.size=0,tx.commit.time=141250,tx.rollback.size=0 1655925960896000000"]: metric parse error: expected field at 1:284: "e.partitioncount=2048,offheap.backupcount=27,offheap.entrycount=36,offheap.primarycount=9,offheap.size=0,onheap.entrycount=0,put.rate=1111,rebalance.keyleft=0,rebalance.keyrate=0,rebalance.partitioncount=0,tx.commit.size=0,tx.commit.time=141250,tx.rollback.size=0 1655925960896000000"

System info

Telegraf 1.18.3 (git: HEAD 6a94f65), RHEL 7.9 (Maipo) 63.10.0-1160.31.1.el7.x86_64 and RHEL 6.10 (Santiago) 2.6.32-754.35.1.el6.x86_64

Docker

No response

Steps to reproduce

Application writes out InfluxDB line protocol formatted metrics to file.
Tail plugin reads and publishes to configured InfluxDB output.
Malformed lines messages in logs and bad measurements in InfluxDB.

Expected behavior

Full lines should be sent instead of partially reading lines and sending them.

Actual behavior

Lines are partially being read and sent leading to large number of bad measurements in InfluxDB.

Additional info

No response

The text was updated successfully, but these errors were encountered:

powersj · 2022-06-23T13:26:03Z

Hi,

Some questions for you:

Can you provide this log file to help reproduce the issue?
If not, can you at least provide a few of the lines before and after the failing line?
Have you verified that the log file does not have any extra new lines in this line? Influx line protocol breaks on new lines. It makes me wonder what comes right before the e.partitioncount=2048.

Thanks!

dr3amville · 2022-06-23T14:46:47Z

Sure, the file I'd rather not provide because there's a good amount of sensitive data in there, and the logs I mentioned have rotated out but here's another example:

Telegraf error log:

2022-06-23T14:05:00Z E! [inputs.tail] Malformed log line in "/path/to/file/telegraf-metrics.log_2022-06-23-09-59": ["sPercentage=1,pagesFillFactor=1,pagesReplaceRate=1,physicalMemoryPages=1,totalAllocatedPages=1 1655993100897000000"]: metric parse error: expected field at 1:115: "sPercentage=1,pagesFillFactor=1,pagesReplaceRate=1,physicalMemoryPages=1,totalAllocatedPages=1 1655993100897000000"

Lines from the metric log file (4th line):

timer,plugins=UpdateCommunityStringPlugIn,profile=uptime_profile,agg=samples execution.succeeded=0 1655993100897000000
timer,plugins=UpdateCommunityStringPlugIn,profile=uptime_profile,agg=total execution.succeeded=0 1655993100897000000
timer,plugins=UpdateCommunityStringPlugIn,profile=uptime_profile,agg=average execution.succeeded=0 1655993100897000000
memory,igniteRegion=sysMemPlc,agg=samples allocationRate=1,dirtyPages=1,evictionRate=1,largeEntriesPagesPercentage=1,pagesFillFactor=1,pagesReplaceRate=1,physicalMemoryPages=1,totalAllocatedPages=1 1655993100897000000
memory,igniteRegion=sysMemPlc,agg=total allocationRate=0,dirtyPages=0,evictionRate=0,largeEntriesPagesPercentage=0,pagesFillFactor=0,pagesReplaceRate=0,physicalMemoryPages=200,totalAllocatedPages=200 1655993100897000000
memory,igniteRegion=sysMemPlc,agg=average allocationRate=0,dirtyPages=0,evictionRate=0,largeEntriesPagesPercentage=0,pagesFillFactor=0,pagesReplaceRate=0,physicalMemoryPages=200,totalAllocatedPages=200 1655993100897000000

Have you verified that the log file does not have any extra new lines in this line?

Yes, checked over the span of weeks of data to make sure actually because I was paranoid it was a writer issue but the log files are correct, no hidden characters or line breaks.

powersj · 2022-06-23T15:22:06Z

Thanks for some example data! However, copying that data and using that with the tail and/or file did not reproduce.

You do not have from_beginning as true in your config, so I assume you are appending to this line and not re-writing the file?

dr3amville · 2022-06-23T17:13:06Z

Yeah unfortunately it's not clear what causes it to do this, it is totally random but I feel like it should not read and flush metrics unless it is collecting the log line by line. I do not have from_beginning as true, correct, so just append and get new metrics and push. I also tried to get the file to explicitly show hidden characters but nothing showed besides line endings where they belong, etc.

It was suggested by @srebhan in the InfluxDB Community Slack that maybe a line-by-line vs at-once explicit reading mode definition might be worth implementing to try and resolve this, similar to what was done here:
#11234

powersj · 2022-06-23T18:06:12Z

@srebhan @reimda thoughts on where to go from here?

dr3amville · 2022-06-23T20:59:36Z

Just a few notes as well, it looks to be happening way more commonly with RHEL 6 than 7, not exactly sure why. Also just for reference, the largest batch of writes to the file that happens that I could find was approx 621 lines, 1863 words, 88710 bytes.

I'm guessing the only way to emulate this is to set up a simulated influx line protocol metric generator and write it out to file (maybe even just have telegraf itself do it and to a file output?) and then also configure the tail plugin to read that file and output to Influx. This way you can have it generating InfluxDB Line protocol formatted metric data writing to file and then the reading of that file and ingesting to InfluxDB.

powersj · 2022-06-23T21:12:17Z

Did this suddenly start happening recently or is this a new problem after some sort of upgrade?

We have plenty of users who are using the tail plugin with influx protocol. While we could start generating a bunch of metrics in a VM this is sounding more and more like a system-specific issue if you are seeing differences between OSes. Are these systems under heavy load or have lots of disk writes going on at the same time?

edit: more questions:

Are these systems baremetal, VMs or containers?
What type of underlying storage is used? HDD, SDD, NVMe, cloud storage?
Over what period of time are those 621 metrics written in: every second? minute?

dr3amville · 2022-06-23T21:30:35Z

It has been this way since we started adopting this method of publishing the metrics. We used to use a different stack before and leveraged filebeat to format the metrics a bit (it was in graphite format before) before publishing and we never saw any issues there.

It is a pretty basic VM setup, nothing crazy and disk IO (SSD) is extremely light (<2% almost all the time) and CPU usage is 90-95% idle for all nodes that see this issue all the time.

CPU (usage):

Disk IO (SSDs and plenty of diskspace):

The 621 metrics would be once per minute almost instantaneously. I have tried using the inotify and poll methods alike for the tail plugin and see the same issues.

We run our java applications on these nodes but since it's a really light test system, it doesn't see much usage, it's really just a test env for this observability work.

powersj · 2022-06-24T14:03:07Z

ok thanks for those metrics, so it doesn't sound like a load issue. I guess I could try to reproduce in a CentOS VM next.

srebhan · 2022-06-27T12:15:39Z

I think, the OS does not flush line-wise, but just flushes the buffer and this buffer might contain a partial line/line-fragment. The only way I see to prevent this is by enforce waiting for a complete line (i.e. wait for the newline to appear) and hold back all partial writes... To do this I suggest to implement #11234 here... Maybe we should have a common place for this kind of "scanner"...

dr3amville · 2022-07-05T18:56:37Z

@powersj - any chance you were you able to reproduce this? Do we feel like implementing #11234 might be worth a shot? I could test it out potentially.

dr3amville · 2022-07-06T14:38:26Z

Do we know how the scanner completes its reading of the line? Does it look for a specific line-ending character when reading? Could be good to enforce terminating the read on a newline instead of an EOF if that may be impacting the current read point index for the file.

Just another datapoint tho - I added namepass under the tail input to test preventing any bad partial line writes and so far so good, though it may take some more soak time to really prove it out.

srebhan · 2022-07-06T16:36:59Z

@dr3amville, the scanner takes a split function deciding when to return a chunk. #11234 uses ScanLines returning lines (e.g. wait for newline char) and the whole file (i.e. no split function), but we are free to implement more by implementing a custom SplitFunc...

Does that help?

dr3amville · 2022-07-06T18:39:40Z

@srebhan I think just the #11234 implementation is plenty if that could be ported over.

srebhan · 2022-07-06T19:28:47Z

Just give it a shot! ;-P

dr3amville · 2022-07-06T21:26:18Z

I will have to pick up some Go 😔

powersj · 2022-07-07T20:07:01Z

@srebhan maybe I'm not following, but tail seems to be a different animal. For example, we aren't using a scanner, but instead waiting on the external tail library to send us data via our receiver function.

In the upstream tail library, which is our fork of hpcloud/tail, data is read new line delimited already.

Did I follow the right path or missed something else? Thanks!

srebhan · 2022-07-08T16:37:39Z

@powersj ok, I didn't look into tail when I should have done it... :-( If the underlying library has a delimiter already, we should use it.

jeme-0-jyskebank-dk · 2023-01-19T15:17:59Z

I had same issue,
This fixed it for me:
[[inputs.tail]] watch_method = "poll"

dr3amville added the bug unexpected problem or unintended behavior label Jun 22, 2022

powersj added the waiting for response waiting for response from contributor label Jun 23, 2022

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tail input plugin does not read line-by-line and sends partial line data #11372

Tail input plugin does not read line-by-line and sends partial line data #11372

dr3amville commented Jun 22, 2022 •

edited

Loading

powersj commented Jun 23, 2022

dr3amville commented Jun 23, 2022

powersj commented Jun 23, 2022

dr3amville commented Jun 23, 2022 •

edited

Loading

powersj commented Jun 23, 2022

dr3amville commented Jun 23, 2022

powersj commented Jun 23, 2022 •

edited

Loading

dr3amville commented Jun 23, 2022 •

edited

Loading

powersj commented Jun 24, 2022

srebhan commented Jun 27, 2022

dr3amville commented Jul 5, 2022

dr3amville commented Jul 6, 2022

srebhan commented Jul 6, 2022

dr3amville commented Jul 6, 2022

srebhan commented Jul 6, 2022

dr3amville commented Jul 6, 2022

powersj commented Jul 7, 2022

srebhan commented Jul 8, 2022

jeme-0-jyskebank-dk commented Jan 19, 2023

Tail input plugin does not read line-by-line and sends partial line data #11372

Tail input plugin does not read line-by-line and sends partial line data #11372

Comments

dr3amville commented Jun 22, 2022 • edited Loading

Relevant telegraf.conf

Logs from Telegraf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

powersj commented Jun 23, 2022

dr3amville commented Jun 23, 2022

powersj commented Jun 23, 2022

dr3amville commented Jun 23, 2022 • edited Loading

powersj commented Jun 23, 2022

dr3amville commented Jun 23, 2022

powersj commented Jun 23, 2022 • edited Loading

dr3amville commented Jun 23, 2022 • edited Loading

powersj commented Jun 24, 2022

srebhan commented Jun 27, 2022

dr3amville commented Jul 5, 2022

dr3amville commented Jul 6, 2022

srebhan commented Jul 6, 2022

dr3amville commented Jul 6, 2022

srebhan commented Jul 6, 2022

dr3amville commented Jul 6, 2022

powersj commented Jul 7, 2022

srebhan commented Jul 8, 2022

jeme-0-jyskebank-dk commented Jan 19, 2023

dr3amville commented Jun 22, 2022 •

edited

Loading

dr3amville commented Jun 23, 2022 •

edited

Loading

powersj commented Jun 23, 2022 •

edited

Loading

dr3amville commented Jun 23, 2022 •

edited

Loading