Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Compression of Rolled Log Files #48

Closed
lajoie opened this issue Jan 30, 2018 · 30 comments
Closed

Support Compression of Rolled Log Files #48

lajoie opened this issue Jan 30, 2018 · 30 comments

Comments

@lajoie
Copy link

lajoie commented Jan 30, 2018

It would be nice if the File sink supported compressing files as they're rolled out in order to help save on disk space.

@nblumhardt
Copy link
Member

Thanks for the suggestion! We don't have any plans to attempt this at present - the sink is heavily text-oriented. It would be interesting to see if there are any extension points we could provide that would enable it, but I suspect they would need to be quite intricate.

@cocowalla
Copy link
Contributor

If I'm not mistaken, all the actual writing is centralised in SharedFileSink - if we put aside rolling for the sake of simplifying this discussion, a compressed output can be achieved by simply wrapping the stream in a GZipStream and BufferedStream.

A couple of notes:

  1. You have no control over when data is flushed to disk - flushing is handled by the GZIP stream
  2. In the event of a crash, using compression is still fairly 'safe' - as GZIP is a streaming algorithm, you're not going to lose much data. Recovery is a simple gunzip < mylog.gz > mylog.recovered.txt

An alternative is to have some kind of hook for when files are rolled, allowing the application to do something like ZIP the file. But I'm not sure if there is any 'precedent' for hooks in any of the other Serilog sinks? Personally, in any case I'd favour a streaming implementation for its more predictable CPU and disk usage.

@cocowalla
Copy link
Contributor

@nblumhardt have you any thoughts on this?

@nblumhardt
Copy link
Member

Thanks for the nudge! Personally, and this might not be an informed point of view :-), I don't find compressed-file logging appealing. On Windows at least, it's likely easier and more ergonomic just to mark a folder as "compressed" and let the OS/filesystem deal with it, if you're stuck with low-disk-space local file logging.

With disk being at least as cheap to expand as CPU, it seems like adding storage is also probably a better bang-for-the-buck solution.

Adding something like a Stream-to-Stream extension point is an interesting possibility, though; by plugging in a factory from Stream to Stream, wrapper streams could be created, and these might be useful for implementing other cases like headers/footers (write header if length == 0 on wrapping, write the footer inside Dispose(), etc.), as well as this one.

There are likely quite a few gory details to figure out, though - it'd require some dedicated time and effort. If anyone's interested in enumerating the use cases, how they'd be achieved, and the rough API/internal design, it'd be a worthwhile discussion.

@cocowalla
Copy link
Contributor

My particular use case is for logging audit logs straight to an archive - the expectation is that these will be needed only rarely, so it makes sense to compress them from the start. I also need it to work on both Windows and Linux. While storage is cheap, we also expect to log a lot of data, and with excellent compression ratios achievable with almost no impact on CPU or memory, it's hard to think of a reason not to compress files.

Have you ever used Windows compressed folders before? I ask, because if you had, you likely wouldn't have suggested it ;) They have unfortunately always had terrible compression ratios and awful performance.

If I get some time, I'll come up with a couple of suggestions, for both the streaming (some kind of Stream extension point) and static (likely some kind of rolling callback) scenarios.

@nblumhardt
Copy link
Member

this might not be an informed point of view

Don't say I didn't warn you ;-). We use Snappy compression in Seq 👍 .. I should have further qualified:

if you're stuck with low-disk-space local file logging

It's been a long time since I thought of a local log file as anything more than a temporary destination while logs make their way off to a central archive, where compression is pretty normal. Auditing directly to the archive is an interesting twist on this. Will be interesting to see what you come up with!

@AdamUdovich
Copy link

AdamUdovich commented Oct 17, 2018

I have started to work on resolving this issue with @lajoie for our specific use case.

We are basing the compression choice parameters off Java's logback library.

As of now I am planning on checking parameters at the end of a roll period for compression, then compressing the log file using dotnet's FileZip to create a zip for each file. Once the compressed version is created, the original file will be deleted. I am working on gzip compression as well.

Do you have any suggestions related to this method or things I should consider?

@cocowalla
Copy link
Contributor

@AdamUdovich I've just got round to thinking about this again too. I'm interested to see what you do, and was wondering if it's extensible?

In particular, it would be nice to easily switch out the compression algorithm - for example by specifying your own implementation of an IFileRolledHandler (or even just an Action<string>, where the string is the filename). As a bonus, this would also allow performing arbitrary actions on rolled files, such as moving them to another folder, encrypting them, signing them or writing them to read-only media.

@nblumhardt what do you think about this? If we put aside the option of streaming compression and focus on compressing rolled files, these are the options I had been considering:

  1. Build support for popular compression algorithms (e.g. ZIP, GZIP), with the ability to configure compression parameters
  2. Build a user-extensible way to plugin any compression algorithm (or indeed any action on a rolled file), such as providing an implementation of an interface

@AdamUdovich
Copy link

AdamUdovich commented Nov 3, 2018

@cocowalla I do not currently have plans to make the compression system extensible, but I can check with @lajoie to see if it is something we would want to work on.

One question you might be able to answer is how unit tests should be structured for testing the rolling compression. Do you have specific testing goals for a basic Zip and GZip system that serilog would like to see?

@cocowalla
Copy link
Contributor

@AdamUdovich I think there are 2 parts:

  1. Test that the compression code produces compressed files. So you're checking for the existence of the expected file, and checking that it has the correct magic byte sequence for whichever algorithm is in use

  2. Test that compressed files are created when files are rolled (I guess you also optionally delete the plaintext file?)

There are existing tests for file rolling, so it makes sense to build on them.

Also, just so it's clear: I don't own this repo, I'm just a Serilog user who has written a few sinks, and would like compression to be added :)

@nblumhardt, would be great to get your thoughts on compression, especially with regards to the last few comments here?

@AdamUdovich
Copy link

Ok thank you @cocowalla ! I do not have an option to not delete non compressed files a the moment. I can check if this is something I will build in or if a repo admin says it should be included.

@cocowalla
Copy link
Contributor

@AdamUdovich Linux' logrotate is probably the most well known log rotating tool, and it has this feature. I can't think why you wouldn't want to delete the original if you now have a compressed copy? :)

@AdamUdovich
Copy link

@cocowalla do you think Seri log would like to see functionality based on logrotate for this type of compression fucntionality? Is this a question more for @nblumhardt on what the repo would like to see done structure wise in general?

@cocowalla
Copy link
Contributor

@AdamUdovich yes, I'd consider the option to delete compressed files as essential for any log rolling feature.

Would be good tho to get some input from @nblumhardt on the questions of extensibility, algorithm support and the like.

@nblumhardt
Copy link
Member

@cocowalla 👍 I had in mind that the primary challenge here would be to provide an appropriate hook/set of hooks in the file sink to allow compression (and other odd cases such as perhaps "headers" and "footers") to be plugged in. At least at the outset, I think it'd be better having difference schemes and algorithms provided as additional packages/projects not under this repo.

The main challenges around doing this (from the File sink's perspective), IIRC, are synchronization (the sink currently has no means of knowing when the last writer has finished with the outgoing file, and we'd ideally not introduce more contention in fixing this) and state management (at startup, maybe after a crash, for example, do we have any outstanding "rolled" hooks to invoke?)

HTH; seems like getting a fork going and experimenting with different approaches might be a way to work through this all more smoothly. Cheers!

@cocowalla
Copy link
Contributor

@nblumhardt do you mean that the file can still be written to after the call to RollingFileSink.CloseFile()? If so, could that be avoided rectified by calling FlushToDisk() before disposing of the sink in CloseFile()?

@AdamUdovich would be great if you could push your fork to GitHub so we can see how you're approaching some of the challenges?

@cocowalla
Copy link
Contributor

cocowalla commented Nov 12, 2018

@nblumhardt as an experiment, I just added support for wrapping the output stream, which enables things like streaming GZIP compression. Turns out this is much simpler than adding a 'rolling hook' 😏

My fork: https://github.com/cocowalla/serilog-sinks-file/tree/feature/streaming-compression
Diff: here

I've only added this to FileSink for now, but it would be trivial to add the same to the SharedFileSink implementations.

This is just an experiment, so I've kept it as bare-bones as possible. One thing of note is that the Func<Stream, Stream> wrapper parameter is not going to play nicely with JSON configuration - it's only going to work when programmatically building the sink. There are ways around that, such as changing the parameter to something like an IFileSinkStreamWrapper and having callers provide their own implementation (I think the JSON config extension can new up classes if a type is specified?)

@AdamUdovich
Copy link

AdamUdovich commented Nov 12, 2018

@cocowalla I will be able to push my fork in the next couple of days. We are reviewing it internally first.

My GZip compression uses a stream like your test, however, I do not use the sink.Emit(event). Can you touch on what Emit() does in your test on line 162,163?

@cocowalla
Copy link
Contributor

@AdamUdovich you had earlier said you would be implementing compression after rolling, which is why I created a fork in the other direction, using streaming compression?

Emit just writes a LogEvent to the sink.

@nblumhardt have you had any further thoughts about streaming vs rolling compression? If you think streaming compression is a goer, I'm happy to develop my fork a bit further.

@nblumhardt
Copy link
Member

No opinions here, currently, @cocowalla :-)

Sorry to not be more help right now but watching with interest.

@AdamUdovich
Copy link

@cocowalla I opened a pull request with my changes to support compression on roll. If I should not open a pull request and just share the branch let me know.

I have included a few unit tests as well which test the magic byte number.

@cocowalla
Copy link
Contributor

@AdamUdovich I've added a few questions/comments in a few of your most recent commits, but I confess I'm finding your changes a little difficult to follow.

Perhaps you could talk through exactly what is happening here:

var prevLog = Directory.GetFiles(_roller.LogFileDirectory, _roller.DirectorySearchPattern)
						 .Select(Path.GetFileName).LastOrDefault();
var logDirectory = _roller.LogFileDirectory;

CloseFile();

if (_compressionType != CompressionType.None)
{
	Compress(prevLog, logDirectory, _compressionType);

	var directoryFiles = Directory.GetFiles(_roller.LogFileDirectory, _roller.DirectorySearchPattern);

	foreach (var directoryFile in directoryFiles)
	{
		var directoryFileName = Path.GetFileName(directoryFile);

		// Not sure if we can assume log files will always be .txt?
		if (System.IO.Path.GetExtension(directoryFileName) == ".txt")
		{
			Compress(directoryFileName, logDirectory, _compressionType);
		}
	}
}

...

public void ZipCompress(string logFile, string logDirectory)
{
	var readDirectory = Path.Combine(logDirectory, "new_dir");
	Directory.CreateDirectory(readDirectory);
	System.IO.File.Move(
		Path.Combine(logDirectory, logFile), Path.Combine(readDirectory, logFile)
	);

	var zipName = Path.GetFileNameWithoutExtension(logFile);
	var zip_path = Path.Combine(logDirectory, $"{zipName}.zip");
	System.IO.Compression.ZipFile.CreateFromDirectory(readDirectory, zip_path);

	Directory.Delete(readDirectory, true);
}
        
public void GZipCompress(string logFile, string logDirectory)
{
	var inFile = Path.Combine(logDirectory, logFile);                  

	var logName = Path.GetFileNameWithoutExtension(logFile);
	var outFile = Path.Combine(logDirectory, $"{logName}.gz");

	using (var outputStream = System.IO.File.Create(outFile))
	using (var inputStream = System.IO.File.OpenRead(inFile))
	using (var gzipStream = new System.IO.Compression.GZipStream(outputStream, System.IO.Compression.CompressionMode.Compress))
	{
		int bufferSize = 1024;
		var buffer = new byte[bufferSize];

		inputStream.CopyTo(gzipStream, bufferSize);
	}       

	System.IO.File.Delete(inFile);
}

Also:

  1. It appears you are deleting the source files after compressing them?

    • Should this be configurable?
    • Are file locking issues a possibility here?
  2. Why are you overriding the default buffer size in CopyTo in GZipCompress (which I guess is 64K)? This will result in sub-par compression ratios

  3. Is there any possibility of synchronization issues (raised by @nblumhardt here?

  4. Does this address the "state management" issues raised by @nblumhardt here?

  5. In ZipCompress, you move the rolled file and create a ZIP from the directory. Is this just to keep things simple for now?

@AdamUdovich
Copy link

Hi @cocowalla sorry for the delay.

We did not think it was necessary to make an option to save the files after compressing them in an uncompressed state. Is this something @nblumhardt would have to comment on for what the final option should be?

I can adjust the buffer size if you think this would be best. The default buffer looks to be about 80Kb. link: https://docs.microsoft.com/en-us/dotnet/api/system.io.stream.copyto?view=netframework-4.7.2

I am not sure about addressing any synchronization or state issues. I asked @lajoie about these and we do not believe they are necessary for our project. I can address them if @nblumhardt is able to elaborate on them some more and time permits.

Finally, The ZipCompress method provided by microsoft only allows for compression of directories.

@cocowalla
Copy link
Contributor

  1. Having deleting the uncompressed files configurable should be pretty straightforward, and it still seems like an essential feature to me, and also something that others would clearly want, even if you don't personally need

  2. Regarding the buffer, yes, it seems the default buffer is 80KB. It's perhaps best to not explicitly set it, so the CLR can use its default.

  3. Regarding ZIP files, ZipCompress is not the only way to work with them - there is also the more flexible ZipArchive:

using (var zip = ZipFile.Open($"{zipName}.zip, ZipArchiveMode.Create))
{
    // Set to CompressionLevel to whatever is configured
    zip.CreateEntryFromFile(prevLog, Path.GetFileName(prevLog), CompressionLevel.Optimal);
}

@cocowalla
Copy link
Contributor

@nblumhardt I'm still thinking that the best way forward here is to add 2 extensibility points that allow for either/both of:

  1. Wrapping the output stream
  2. Firing a callback when the output stream is disposed (e.g. when the log file is closed or rolled)

These 2 would enable the creation of a range of new sinks that extend FileSink - or simply enable callers to easily pass in their own behaviours at FileSink creation.

My fork demonstrated the ease with which the former can be added. Wrapping the output stream supports a variety of use cases, such as streaming compression and encryption (see #71).

For the latter, I personally don't think compression should be built-in to Serilog.FileSink, and would prefer it to enabled by an extensibility point that allows the caller to do anything with the file after it's been closed by Serilog - that way you can plug in ZIP, GZIP or 7-ZIP compression, ship the logs to WORM storage, cryptographically sign them etc. As far as the external API for this, it could be something as simple as an Action<string> (where the filename is passed in), or a simple interface with a `Handle(string filename) method on it.

@nblumhardt , you've been unapologetically quiet on this so far 😄! What are your thoughts on the question of either building-in specific ZIP compression functionality, or adding extensibility points that enable a range of scenarios?

Also, if you like the idea of extensiblity points, would you accept a PR for my stream-wrapping extensibility point, if I first ensure it plays nicely with JSON/AppSettings configuration? (at the least, I guess I'd need to switch out the Func<Stream, Stream> for an interface)

@nblumhardt
Copy link
Member

Hey! Nice to see this rolling along ;-)

Regarding zip-specifics, no, I don't think we'd want to deal with compression, encryption, etc. in this repo. Some basic hooks like you've described sound best, as long as they're easy enough to use correctly (i.e. we're clear on what synchronization is required by implementers).

Is the stream wrapping alone enough for your use case, @cocowalla ? If so, perhaps we start on a PR that adds the extension point, in a non-breaking way, to FileSink? (FWIW, using an abstract base class rather than an interface will make future changes easier to make in a non-breaking way.)

Ideally we'd take the same approach to the post-roll handler - there's more to work out for that one, but perhaps once the first PR is in, we can take a look at adding on top? Regarding concurrency and so on @AdamUdovich - coming back around to this now, we could probably figure out most gotchas in the PR process.

Finally, to get it into a stable release, I think we'd need at least one usable implementation of each extension point published somewhere (a pair of extension NuGet packages?) Without some wider usage/feedback it will be hard to be sure we've got it right :-)

@cocowalla
Copy link
Contributor

@nblumhardt excellent, then on my side I'll keep the ball rolling on the stream extensibility point 😆

@cocowalla
Copy link
Contributor

@nblumhardt I rolled up my sleeves (OK, these are getting boring now 🙄) and got a PR in for you to review.

@nblumhardt
Copy link
Member

#81 relates to this.

@nblumhardt
Copy link
Member

#80 is merged an on NuGet, let's see how this goes 😎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants