Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors in validation log , and session restarting every 120 seconds #52

Closed
AlexIT-FT opened this issue Aug 10, 2021 · 7 comments
Closed

Comments

@AlexIT-FT
Copy link

AlexIT-FT commented Aug 10, 2021

OS Debian 10.10
Linux rpki 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux
fort 1.5.1

config.json attached
config.json.txt

Every two minutes router sessions are restarted (max uptime is two minutes).
The first line is a fort-1.5.1 at about two minutes.
The second line is a fort-1.4.2 whose session is not restarted.


show validation session
me@mx>Session State Flaps Uptime #IPv4/IPv6 records
172.20.21.33 Up 55 00:01:58 227887/43730
172.30.37.33 Up 3 06:41:53 227887/43730

show validation session
me@mx>Session State Flaps Uptime #IPv4/IPv6 records
172.20.21.33 Up 55 00:01:59 227887/43730
172.30.37.33 Up 3 06:41:54 227887/43730

show validation session
me@mx>Session State Flaps Uptime #IPv4/IPv6 records
172.20.21.33 Up 56 00:00:00 227887/43730
172.30.37.33 Up 3 06:41:55 227887/43730

me@mx>show validation session
Session State Flaps Uptime #IPv4/IPv6 records
172.20.21.33 Up 56 00:00:01 227887/43730
172.30.37.33 Up 3 06:41:56 227887/43730


I had to set http transfer-timeout to 120, to avoid lots of timeout error in validation-log (not too bad)
Still have lots of errors in validation-log , in different flavors (at startup, and at each refresh)

fort.log attached
fort.log

@ydahhrk
Copy link
Member

ydahhrk commented Aug 10, 2021

You sure it isn't just a connection issue? It looks like a flapping link. (In fact, that's what the router output says.)

My environment is not reproducing it.

@AlexIT-FT
Copy link
Author

No it was not a connection issue. After some times of restarting every 2 minutes (about 20 times) it remained connected. Now it says
172.20.21.33 Up 98 1d 10:07:20 227785/43686

What about validation errors?

  • serial numbers not unique
  • certificate validation failes
  • could not open file
  • crl is stale/expired
  • expected vcard property 'VERSION:4.0', got 'VERSION:3.0'
    and others ...

Is it normal?

@ydahhrk
Copy link
Member

ydahhrk commented Aug 11, 2021

serial numbers not unique
expected vcard property 'VERSION:4.0', got 'VERSION:3.0'
certificate validation failes

I've seen these in the wild. It just means some RPKI objects are simply invalid.

For example, RFC 6493 clearly states that vCard version 3 is not valid, as far as the RPKI is concerned: https://datatracker.ietf.org/doc/html/rfc6493#section-5

I suppose it's possible Fort might be somehow misblaming these faults, in which case it would be helpful to analyze the objects, and see if they actually break the rules or not.

could not open file
crl is stale/expired

These suggest that the cache is not updating properly. This is why I suspect there might be some connection issue.


How about this:

Can you compress your cache directory, and somehow send it to me? If something is messing up Fort, I think the cache has a pretty high chance of being the culprit.

Then try stopping Fort, deleting the cache, then start Fort again. Fort should regenerate the cache from scratch. What happens then?

@AlexIT-FT
Copy link
Author

Here's a tgz of all FORT's stuff (I'll delete it in 24h)
https://drive.google.com/file/d/19XW_Falfbw-ylQbbj5uoARWTiTdnBtBk/view?usp=sharing

I also do this to clear cache :

  systemctl stop fort
  rm -fR /var/lib/fort/*
  systemctl start fort

but nothing changed , same errors in the syslog
and again , every 2 minutes session flaps and uptime restarts for a few times (three or four), and then it remains stable.

@ydahhrk
Copy link
Member

ydahhrk commented Aug 20, 2021

Ok, I think there's a misunderstanding here.

Fort is an RPKI Validator. The point isn't just to send ROAs to the routers; the whole reason why you want an RPKI validator is because you want someone to also verify said ROAs, before they are sent to the routers. This involves a large laundry list of checks.

The reason why Fort has two different types of logs (Operation logs and Validation logs) is because they're intended for different audiences. Operation logs are meant for the operator running Fort. Validation logs are meant for the people maintaining the RPKI repositories. I believe you are the former, but not the latter.

When Fort dumps an error or warning in the validation logs, all it means is that it had to drop some ROAs or Router Keys, because the relevant RPKI objects weren't up to snuff. In some cases, you might argue that Fort is being too strict, and that's why we sometimes create incidences.

But chances are, you don't care about that. If a ROA doesn't validate, it just doesn't validate. It needs to be thrown away. You don't want your routers to be making decisions off shit ROAs.

The way I see it, it does look like the validation errors you're getting are perfectly normal. Here are some examples I found:

Aug 12 07:47:46 rpki fort: ERR [Validation]: rsync://rpki-repo.registro.br/repo/7WDaXm36DeANkw22Zr4WTv56qNrwmMio6eUkBMifUm7b/0/D824C23121EFD846C683EDEED166C5C40784AE04.mft: CRL is stale/expired

This manifest's CRL is CD3EC2EB/rpki-repo.registro.br/repo/nicbr_repo/0/D824C23121EFD846C683EDEED166C5C40784AE04.crl, which, in the repository you posted, has the following valid dates:

    Last Update: Aug 11 05:39:00 2021 GMT
    Next Update: Aug 12 05:44:00 2021 GMT

(One way of querying a CRL's valid dates is openssl crl -inform DER -text -in /path/to/crl/file.)

It seems the validation is correct; according to the log line, the CRL expired two hours prior to the time at which the log states the object validation was performed.

(Just in case: You sure the node hosting Fort has its date set up properly? Maybe check the timezone too.)

Aug 12 07:47:53 rpki fort: ERR [Validation]: rsync://rpki.ripe.net/repository/DEFAULT/IBfMWA0nPFS6MGTNLNavObgEuIc.cer: Error requesting URL https://chloe.sobornost.net/rpki/news-public.xml (received HTTP code 404): The requested URL returned error: 404 Not Found

It's not a lie. My browser does return 404 on https://chloe.sobornost.net/rpki/news-public.xml.

IBfMWA0nPFS6MGTNLNavObgEuIc.cer does point to news-public.xml in its Subject Information Access Extension. (openssl x509 -inform DER -text -in '/path/to/IBfMWA0nPFS6MGTNLNavObgEuIc.cer'.) But it doesn't exist, so Fort can't really do anything about it.

Aug 12 07:48:11 rpki fort: ERR [Validation]: rsync://rpkica.mckay.com/rpki/MCnet/UEh2SAvdIgPsUFdv92RSSaNqBnY.mft: Could not open file '/var/lib/fort/246EB46A/rpkica.mckay.com/rpki/MCnet/UEh2SAvdIgPsUFdv92RSSaNqBnY.mft': No such file or directory

Also seems to be correct. It doesn't exist:

$ rsync rsync://rpkica.mckay.com/rpki/MCnet/UEh2SAvdIgPsUFdv92RSSaNqBnY.mft potato.mft
rsync: link_stat "/MCnet/UEh2SAvdIgPsUFdv92RSSaNqBnY.mft" (in rpki) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [Receiver=3.1.3]

UEh2SAvdIgPsUFdv92RSSaNqBnY.mft is pointed by certificate 246EB46A/rpki.ripe.net/repository/DEFAULT/UEh2SAvdIgPsUFdv92RSSaNqBnY.cer, which does exist.


I could go on and explain why I think the remaining error messages are probably fine, but you get the idea.

Because it seems like you don't care about the validation logs, feel free to remove them. That's why --validation-log.enabled exists.

On the other hand, if you think some of these errors should be turned into incidences, let's treat that as separate bugs.


Ok, so that's that problem out of the way. Let's now think about the router link flapping. I still can't reproduce it, so would you be willing to run a custom version of Fort, which I'll set up to print loads of hopefully relevant information?

@AlexIT-FT
Copy link
Author

Ok, just some errors can be ignored ... maybe they could be warning. So anyway it's working well. Thanks for the detailed explanation.
For the 120second restart ... it's not a problem for me, but it happens twice, every time I restart the service.
Maybe it's not an error too, you can see it here https://youtu.be/ps0xpBNf-c4

@ydahhrk
Copy link
Member

ydahhrk commented Aug 20, 2021

Hypothesis: Fort hasn't finished the first validation during those first two attempts. So those "flaps" are actually retries.

When Fort starts, it doesn't have any validation results to report, because it hasn't finished the first validation cycle.
Therefore, when the RTR client requests the data, Fort returns a message called No Data Available. Presumably, the router is set up to retry after 120 seconds. This happens twice, because that's what it takes Fort to download and validate the trees for the first time.

After the third retry, Fort is returning the validation data, so the link stays up.

And the reason why I can't replicate it is because I'm using a different client, which behaves differently.

Makes sense?

Fort drops the following messages in the operation logs before and after the first validation is completed:

First validation cycle has begun, wait until the next notification to connect your router(s)
First validation cycle successfully ended, now you can connect your router(s)

Do these line up with your router flaps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants