Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert log level changes #3913

Merged
merged 1 commit into from
Mar 31, 2024
Merged

Revert log level changes #3913

merged 1 commit into from
Mar 31, 2024

Conversation

bkchr
Copy link
Member

@bkchr bkchr commented Mar 30, 2024

Closes: #3906

@bkchr bkchr added the R0-silent Changes should not be mentioned in any release notes label Mar 30, 2024
@bkchr bkchr requested a review from alexggh March 30, 2024 20:20
@bkchr bkchr added this pull request to the merge queue Mar 31, 2024
Merged via the queue into master with commit 256d5ae Mar 31, 2024
137 of 139 checks passed
@bkchr bkchr deleted the bkchr-revert-logging branch March 31, 2024 21:08
@alexggh
Copy link
Contributor

alexggh commented Apr 1, 2024

Did this cause any problem @bkchr?

If not why wouldn't we want node operator understanding if their validators are properly connected or not

I guess we kind of disagree on this and this #3677 the fact that this information is not relevant for nodes.

Debug output is not to be printed using info logs. If people come to you complaining, you should either tell them which cli flags to change or come up for example with a dedicated RPC endpoint that generates debug information.

Let me try to see if I can change your mind :), at least on my perspective part of operating a system with minimal downtime is the ability to understand what happened in production after it happened, metrics are a part of the story for that, but we also need the default log level to give us enough information to be able to pin-point problems or at least know for sure certain things are going correctly.

A lot of our bug reports include just the Import #<block_number> log, it is really hard to figure it out just from that what happened and for a certain categories of bugs you can't go and tell them to enable DEBUG level because they are not reproducing systematically or we don't know how to reproduce it.

Hence, I has thinking we should actually, go the opposite direction and re-think, a bit the information our INFO level prints, so that we make it easier to understand what is going on.

I'm aware that too much logging might cause other problems, but I don't think that was the issue with this line, because it is outputed every 10 minutes.

@bkchr
Copy link
Member Author

bkchr commented Apr 1, 2024

I get what you want to achieve. However, printing all authorities that you are not connected are debug information. If you go ahead and just print the connectivity to all authorities in percentage every 10 minutes it is fine.

Having a bad connectivity is almost never a result of what you are doing locally. If not having these issues because you regenerated your node key every restart and then authority discovery not telling the others your new key. However, this is a bug and nothing the operator can change. They could also not change anything for these missing authorities as they are not required to open slots in their firewall for each outgoing connection. Yes they could have configured their server in this way, but then they already failed there.

All in all, printing all this information versus just printing the connectivity every 10 minutes as percentage doesn't really makes any difference.

@alexggh
Copy link
Contributor

alexggh commented Apr 1, 2024

All in all, printing all this information versus just printing the connectivity every 10 minutes as percentage doesn't really makes any difference.

It doesn't make a difference for that particular node, but I does it for the whole network and anyone trying to find out which nodes are the ones not properly connected since they would show repeatedly in the logs for all the others.

Anyways, this doesn't matter much because we figure it out the issue, the next one will probably be in other parts of the system, that's why I think that maybe we need to re-think our rules with what we output as INFO, currently there isn't much information in there or maybe recommend people running with DEBUG, but that is probably dangerous and costly because it would fill their disk logs.

@sandreim
Copy link
Contributor

sandreim commented Apr 1, 2024

@bkchr @alexggh you both raise valid points, but I think this is what we actually need: #816

@alexggh
Copy link
Contributor

alexggh commented Apr 1, 2024

... you both raise valid points, but I think this is what we actually need: #816

Yes, that would be golden :D.

@bkchr
Copy link
Member Author

bkchr commented Apr 1, 2024

It doesn't make a difference for that particular node, but I does it for the whole network and anyone trying to find out which nodes are the ones not properly connected since they would show repeatedly in the logs for all the others.

And then? You see some ids and don't know who they are. Even if you resolve them to some validator name, it still doesn't solve any problem. You could just write a dedicated tool to query the DHT and then trying to connect to all validators.

pgherveou pushed a commit that referenced this pull request Apr 2, 2024
Ank4n pushed a commit that referenced this pull request Apr 9, 2024
dharjeezy pushed a commit to dharjeezy/polkadot-sdk that referenced this pull request Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
R0-silent Changes should not be mentioned in any release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regular connectivity reports make log reading more difficult
5 participants