Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/stanza] support gzip compressed log files for file log receiver #33406

Conversation

bacherfl
Copy link
Contributor

@bacherfl bacherfl commented Jun 6, 2024

Description: This PR adds support for reading gzip compressed log files for the file log receiver. This is done by, if enabled via the gzip_file_suffix parameter, creating a gzip.Reader on top of the file handle of a compressed file.

Link to tracking Issue: #2328

Testing: Added unit tests for the new functionality. Manually tested using the following configuration for the filelog receiver:

  filelog:
    include: [ ./simple.log*.gz ]
    start_at: beginning
    gzip_file_suffix: ".gz"
    operators:
      - type: regex_parser
        regex: '^(?P<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (?P<sev>[A-Z]*) (?P<msg>.*)$'
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%d %H:%M:%S'
        severity:
          parse_from: attributes.sev

Documentation: Added documentation in the readme of the file log receiver

bacherfl added 3 commits June 5, 2024 07:00
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
…ed reader

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this PR together. It's very helpful towards understanding how this might look.

The main thing I'm wondering about is whether it's really necessary to have such a separate implementation. I see a lot of functionality that was written in a different way, but may not actually be necessary. For example, we currently track changes in non-compressed files a certain way. It's not yet clear to me why exactly we would need to change this to support gzip (or other appendable compression formats).

If I can ask for one major direction change - is it possible that we could just our current Reader struct by changing its internal file field to an io.Reader? Then, when we build a Reader for a non-compressed file, we assign an os.File. And when we build a Reader for a gzip file, we assign a gzip.Reader.

It's possible this doesn't work for a reason I'm overlooking (maybe tokenization or offset updating) but I'd really like to start from this idea and clearly identify why it doesn't work so we can address those needs specifically without adding a lot of other changes.

pkg/stanza/fileconsumer/internal/reader/factory.go Outdated Show resolved Hide resolved
}

initalRead := true
if lastReadFileSize, ok := r.FileAttributes["lastReadFileSize"]; ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to adding such an attribute but it should be proposed in another issue.

In another comment I question whether we need to track this for the sake of this implementation. If we don't, let's remove it. If we do, let's use an unexported field on the reader struct.

@bacherfl
Copy link
Contributor Author

bacherfl commented Jun 7, 2024

Thanks for putting this PR together. It's very helpful towards understanding how this might look.

The main thing I'm wondering about is whether it's really necessary to have such a separate implementation. I see a lot of functionality that was written in a different way, but may not actually be necessary. For example, we currently track changes in non-compressed files a certain way. It's not yet clear to me why exactly we would need to change this to support gzip (or other appendable compression formats).

If I can ask for one major direction change - is it possible that we could just our current Reader struct by changing its internal file field to an io.Reader? Then, when we build a Reader for a non-compressed file, we assign an os.File. And when we build a Reader for a gzip file, we assign a gzip.Reader.

It's possible this doesn't work for a reason I'm overlooking (maybe tokenization or offset updating) but I'd really like to start from this idea and clearly identify why it doesn't work so we can address those needs specifically without adding a lot of other changes.

Thanks a lot for the detailed review @djaglowski! I agree that the number of modifications to make this work with gzip files should be as small as possible, so I'll look into how this can be best achieved today. The main challenge here though is to figure out how to track the current offset within a gzip compressed file, as it does not seem to be possible to start reading from an offset within such a file. However I might be missing something here, so I appreciate any ideas regarding how this could be done.

Although, one question just to clarify: For this use case, i.e. ingesting logs from gzip compressed files, does it actually make sense to keep watching for incoming changes in the compressed file, or would it make sense to emit all the logs of a compressed file, as soon as we get a new one that matches the file name pattern? The reason i'm asking this is because I would assume that in a usual log file rotation workflow the compressed file is created at the end of a rotation period with the delta to the previous interval being stored, and not being updated inbetween.
If we can rely on this assumption this would make the implementation much simpler with most of the existing code being able to be reused (most likely changing from *os.File to io.Reader in the existing reader would be sufficient then)

bacherfl added 8 commits June 10, 2024 10:32
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
bacherfl added 4 commits June 10, 2024 13:46
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
…eiver-compressed-files

# Conflicts:
#	pkg/stanza/fileconsumer/file_test.go
#	pkg/stanza/go.mod
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
bacherfl added 2 commits June 11, 2024 08:52
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
@bacherfl bacherfl marked this pull request as ready for review June 11, 2024 08:07
@bacherfl bacherfl requested a review from a team June 11, 2024 08:07
@bacherfl
Copy link
Contributor Author

@djaglowski this should be ready for review now :)

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress on this. It's looking much cleaner.

I was trying to clarify my understanding of the changes so ended up pulling your branch and tinkering a bit. I came up with what seems like a simpler way to handle gzip readers and made a PR in bacherfl#1. Let me know what you think.

pkg/stanza/fileconsumer/config.go Outdated Show resolved Hide resolved
pkg/stanza/fileconsumer/internal/reader/factory.go Outdated Show resolved Hide resolved
pkg/stanza/fileconsumer/internal/reader/factory.go Outdated Show resolved Hide resolved
Leave internal reader nil if gzip EOF
@bacherfl
Copy link
Contributor Author

bacherfl commented Jun 12, 2024

Great progress on this. It's looking much cleaner.

I was trying to clarify my understanding of the changes so ended up pulling your branch and tinkering a bit. I came up with what seems like a simpler way to handle gzip readers and made a PR in bacherfl#1. Let me know what you think.

Thanks a lot for the review and the suggestions @djaglowski! The changes you proposed with your PR all made sense to me, so I merged them into this branch now .
One update though: I had to re-add the offset override via the deferFunc again - without that I ran into invalid gzip header errors after the initial read of a gzip file

bacherfl added 2 commits June 12, 2024 07:35
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
@djaglowski djaglowski merged commit 01efffd into open-telemetry:main Jun 14, 2024
154 checks passed
@github-actions github-actions bot added this to the next release milestone Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants