Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: FileDescriptors not properly released on file rotate #2100

Closed
kpalumbokitu opened this issue Mar 29, 2022 · 23 comments
Closed

[Bug]: FileDescriptors not properly released on file rotate #2100

kpalumbokitu opened this issue Mar 29, 2022 · 23 comments

Comments

@kpalumbokitu
Copy link

🔎 Search Terms

file, disk usage

The problem

Environment: Ubuntu 18.04 LTS

On running the winston file transport, we found that with the tailable option is set to true, on rotation of logs, files are renamed up to max files, but old file descriptors remain open by the node process.

The result is that if you calculate file space usage using a tool like "du" the deleted files no longer show up, but the disk space is still "used" and will be reported as used by tool like "df". If the node application using the winston file transport in this way is left running long enough, eventually all file descriptors on the partition will be in use by the node process, which will result in the failure to create any new files due to no file descriptor availability.

What version of Winston presents the issue?

v3.2.1 and v3.6.0

What version of Node are you using?

v14.19.0

If this worked in a previous version of Winston, which was it?

unknown (but works in Node v12 and Node v10)

Minimum Working Example

This example will set up the file rotation such that you will quickly see the issue. You can play with the MAX_SIZE, MAX_FILES, and TIMEOUT_MS if the issue is happening too fast.

'use strict';

const MAX_SIZE = 2000;
const MAX_FILES = 10;
const TIMEOUT_MS = 1000;

const winston = require('winston');
const {format} = require('logform');
const path = require('path');
const {combine, timestamp, json} = format;

// Set up file transport - I assume logs folder exists
const logsDir = path.resolve(process.cwd(), './logs');
const filePath = path.join(logsDir, 'example.log');
const fileTransport = {
  tailable: true,
  maxsize: MAX_SIZE,
  maxFiles: MAX_FILES,
  filename: filePath,
  handleExceptions: true,
  level: 'debug',
};
fileTransport.format = combine(
  timestamp({format: 'YYYY-MM-DD HH:mm:ss.SSS'}),
  json(),
);

// Create logger with file transport
const logger = winston.createLogger({
  transports: [new winston.transports.File(fileTransport)],
  exitOnError: false,
});
logger.log('info', 'Created file logger');

// Add periodic logs so that files are rotated
let i = 0;
setTimeout(timeoutLogger, TIMEOUT_MS);
function timeoutLogger() {
  i++;
  logger.log('info', 'another log message ' + i);
  logger.log('error', 'error message ' + i);
  logger.log('http', 'http log message ' + i);
  logger.log('verbose', 'verbose log message ' + i);
  logger.log('debug', 'debug log message ' + i);
  setTimeout(timeoutLogger, TIMEOUT_MS);
}

While running this example, use a terminal to check the file descriptor status using:

  • lsof | grep example | grep deleted
  • ps aux | grep node (to find the node PID) and then ls -al /proc/${PID}/fd

You'll see "deleted" files associated with open file descriptors.

Additional information

As I mentioned above, this problem is occurring in Node 14 and we did NOT see it in Node 12 or Node 10. A check through the Node 14 documentation showed a change to the fd garbage collection that may be what is triggering this issue now:
nodejs/node#28396

I poked around the winston file transport code, and the problem seems to be tied to calling the _rotateFile() function multiple times (multiple logs triggering the logic that the max file size has been reached and the file needs to be rotated). That function then calls the _incFile() function multiple times. The asynchronous handling of the multiple calls seems to cause the file descriptor cruft.

Just to gain some proof, I modified the line lib/winston/transports/file.js:172 as follows, and that does fix the problem.

// Original line:
this._endStream(() => this._rotateFile());

// My change:
this._endStream(() => {
  if (this._rotate === true) {
    this._rotate = false;
    this._rotateFile();
  }
});

I don't suppose that is a good and robust fix, but maybe it points to a potential solution. I'm happy to try out potential solutions if the experts have thoughts on that. Please let me know if you need any other information or if there is anything I can do to help resolve this.

@wbt
Copy link
Contributor

wbt commented Mar 29, 2022

I wonder if this might be the cause of #2065?

@kpalumbokitu
Copy link
Author

We are not seeing that error with our implementation. I'm not able to run the MWE in that issue description to see if we see the same problem.

@sn-tm
Copy link

sn-tm commented Apr 11, 2022

Can confirm we are seeing the same issue, 3.2.1. Using lsof reports hundreds of file descriptors holding onto the *1.log and *9.log files. The issue only seems to happen to applications which log in rapid bursts (that's a separate issue which we'll address!) I haven't had a chance to troubleshoot yet.

@gcagle3
Copy link

gcagle3 commented Apr 12, 2022

Can confirm we are seeing the same issue, 3.2.1. Using lsof reports hundreds of file descriptors holding onto the *1.log and *9.log files. The issue only seems to happen to applications which log in rapid bursts (that's a separate issue which we'll address!) I haven't had a chance to troubleshoot yet.

I'm also seeing similar behavior. I've noticed that when tailable is set to true this happens very frequently, but changing tailable to false reduces the occurrence quite a bit. Important to note that it does still happen when tailable is false if the log writing bursts enough.

Testing with the proposed fix in the original post, both tailable true and false work correctly (as expected) without leaving stale file handlers open. @wbt what is the possibility of getting the proposed fix merged in?

@andrewscfc
Copy link

Hi folks,

We recently upgraded to Node 16 from Node 12 for our service and experienced this issue in live. We are shortly going deploy a patched version of Winston to live with the changes in this PR: #2231

I see the fix very much as a workaround, it would be good if any maintainers/contributors could advise whether you'd like this brought into your mainline or whether you would like to investigate a 'proper fix'?

We are happy to facilitate changes required to get this into mainline but have exhausted the time we have to investigate the bug any further given we have a workaround.

In any case, I hope this helps others to get around this issue!

Thanks
Andrew

@wbt
Copy link
Contributor

wbt commented Nov 15, 2022

I appreciate the efforts here to report and diagnose the issue - it does seem like something that should be fixed.
#2115 is still marked as a draft and fails checks, and might not be a root cause fix. With no funding for maintainers on the project, a deep dive into finding the root cause will have to come from the community, unless a core maintainer gets blocked by the same issue.

@esoltys
Copy link

esoltys commented Jun 6, 2023

This issue might have been fixed in 3.9.0 by the changes in #2301

@andrewscfc
Copy link

This issue might have been fixed in 3.9.0 by the changes in #2301

Thanks for the heads up @esoltys, I tried something similar I believe before ending up with this rather extreme workaround: https://github.com/winstonjs/winston/pull/2231/files

We'll give the upgrade a go at somepoint

@Harsh062
Copy link

Harsh062 commented Jul 5, 2023

@andrewscfc @esoltys We have upgraded to winston version 3.9.0. The issue is however not resolved and we are still "deleted" files associated with open file descriptors.

@andrewscfc
Copy link

Thanks @Harsh062 fwiw our workaround above definitely worked for us if you have an urgent need to get around the issue.

@JohnXLivingston
Copy link

Hello.

Any update on this bug?

We can confirm that we are experiencing same issues in the Peertube project.
In my case, it is with node v18. And i think i already had it with node v16.

@DABH
Copy link
Contributor

DABH commented Jan 3, 2024

@andrewscfc is #2231 still working for you?

@JohnXLivingston I'm assuming #2301 did not fix things for you?

This solution #2115 was also proposed earlier, but it's untested and apparently breaks an existing test. It sounds like it might be an attempt at an actual fix, though, as opposed to #2231, which does seem to bandaid the issue but doesn't fix whatever the root cause is here.

Actually, I guess #2115 is what was mentioned in the issue report at the top by @kpalumbokitu , but it was mentioned that this might not be robust. Anyone have thoughts/opinions on whether that's the case or not? If it's robust but ugly, and fixes things for everyone, perhaps we should go ahead and merge for now [if failing tests can be fixed]? Does anyone have time to volunteer to look a bit deeper?

Let me know - I'm happy to merge whatever folks think is the best solution here. Unfortunately the maintainers lack funding to investigate this more deeply, so we'll have to take the approach of merging whatever solution the community thinks is best, and going from there -- assuming there is at least something half-decent for us to merge :) Thanks folks!

@JohnXLivingston
Copy link

@JohnXLivingston I'm assuming #2301 did not fix things for you?

I just checked: when i was experiencing this bug, Peertube was still using winston 3.8.2. It now uses 3.11.0 (and this fix is present).
So I have to make some testing (i need to set log level to debug, and wait some hours/days).

I will let you know.

@JohnXLivingston
Copy link

Update: after 30minutes, the log files rotated. But Peertube was still writing in the first log file (peertube1.log):

image

(notice the modification time, set to 10:32 for peertube1.log and peertube2.log, even if the file has rotated several minutes ago)

And after a few other minutes, it rotated again. This time, it is loggin in peertube1.log and peertube3.log. The file peertube2.log is properly closed:

image

image

So, I assume that peertube1.log will remain open, and that after some time it will be kept open, even if deleted from disk. I will confirm this in a few days.

Winston version: 3.11.0

@DABH
Copy link
Contributor

DABH commented Jan 3, 2024

Thanks so much for the detailed follow up! 🙏 I’ll look for your update in a few days, and hopefully we can close in on a fix from there.

@JohnXLivingston
Copy link

I can confirm that the bug is still here with Winston 3.11.0 :(

image

Is this normal that the peertube1.log file is open multiple times?

@DABH
Copy link
Contributor

DABH commented Jan 7, 2024

@JohnXLivingston What OS are you on? When I try to run this (the MWE in the original post) it looks like it's working fine (only one open fd). Going to try on a linux vm...

@DABH
Copy link
Contributor

DABH commented Jan 7, 2024

I tried on Debian 12 with Node v19.3.0, and the original MWE seems to work fine for me... if I run your command
watch -n.5 "lsof -p 3841298 | grep -P 'log'"
while I'm running that script (and I increased some of those parameters to make it more aggressive), I only ever see one file handle open. (And in fact, I never see anything marked as deleted...)

@DABH
Copy link
Contributor

DABH commented Jan 7, 2024

I looked at the proposed fix from the original post, but I feel like https://github.com/winstonjs/winston/blob/master/lib/winston/transports/file.js#L136-L142 is supposed to address what that proposed fix is doing, so not sure what's going on here. Anyway, anyone who can help me reproduce, I can try to help debug further...

@JohnXLivingston
Copy link

@JohnXLivingston What OS are you on? When I try to run this (the MWE in the original post) it looks like it's working fine (only one open fd). Going to try on a linux vm...

Debian, with NodeJS v18.
In case it was unclear, i encounter this issue in the Peertube project. My tests were not on the example given in this issue.
Maybe it is not exactly the same bug.
Do you want me to open another issue? With more specific informations? (So you can close this one)

@DABH
Copy link
Contributor

DABH commented Jan 8, 2024

Yeah, that sounds great, thank you.

FWIW there was some good analysis done here #2301 (comment)

If there is any specific info you can provide on what you're doing so I can reproduce things with peertube on my machine, that would be helpful. Like if there's any way to make the log rotation happen faster so that I don't have to wait a long time to see if things are working or not, that would be helpful.

From what I can tell, #2301 solved the issue mentioned here though (at least I couldn't reproduce it anymore on Mac or Debian with two different node versions), so I'm going to go ahead and close this one out. Please feel free to @ me in the new issue you open.

@joeythelantern
Copy link

Although I cannot post my code as it is from my work, I can confirm this is happening to me right now, node 18 with winston 3.11.0

@DABH
Copy link
Contributor

DABH commented Apr 3, 2024

I was under the impression this was one fixed… if there is any non-sensitive MWE that can be shared, that would be helpful, and also platform details about where you’re experiencing this (OS, specific node version, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants