Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cymru whois expert: handle strange ASN symbols better #2132

Closed
monoidic opened this issue Dec 29, 2021 · 1 comment · Fixed by #2352
Closed

Cymru whois expert: handle strange ASN symbols better #2132

monoidic opened this issue Dec 29, 2021 · 1 comment · Fixed by #2352
Labels
bug Indicates an unexpected problem or unintended behavior component: bots help wanted Indicates that a maintainer wants help on an issue or pull request

Comments

@monoidic
Copy link
Contributor

I discovered a slight issue with the Cymru whois expert bot with some strange AS names. For instance, AS266522.

$ dig +short TXT AS266522.asn.cymru.com         
"266522 | BR | lacnic | 2017-01-20 | Realnet Telecomunicacoes e Multimidia ltda\226\128\143 , BR"

\226\128\143, in unicode, decodes to u+200f, the right-to-left mark. When the IMQ bot attempts to decode it with result['as_name'] = items[4].encode('latin1').decode('utf8'), it runs into this issue:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u200f' in position 42: ordinal not in range(256)

A solution which just permits decoding special symbols like RTL marks might not be the best, though just discarding these events into the dump file is also not a good option.

Various online tools for finding public data on ASes, found online via a quick search, give varying results.

  • ipinfo.io displays it in 3 different ways ("injected" RTL, †and aâ��)
  • qrator displays it as &#8207
  • bigdatacloud displays it as 8207
  • bgpview displays it as Â�Â�
  • etc.

I suppose there's no existing expectation of invalid UTF-8 sequences or special symbols like RTL marks in AS names, nor any "standard" way of handling them, and I feel iffy about passing arbitrary symbols like that on to other systems.

Are there any ideas for how to handle cases like this? The simplest option would be to simply pass the undecoded string on in case the decoding step fails.

@sebix
Copy link
Member

sebix commented Dec 30, 2021

I had some fun with the encoding issues in AS names in the past already (fff0b6e#diff-016f562585e3b1dd9bbc46a808574681f44f9f1f522890f9a8740182ea61e81f 3f0983d#diff-016f562585e3b1dd9bbc46a808574681f44f9f1f522890f9a8740182ea61e81f cb4948b#diff-016f562585e3b1dd9bbc46a808574681f44f9f1f522890f9a8740182ea61e81f #307) and I would now be in favour of doing .decode(errors='ignore') to decode everything we can and not fail with the other bogus data like this example.

@sebix sebix added bug Indicates an unexpected problem or unintended behavior component: bots labels Dec 30, 2021
@sebix sebix added the help wanted Indicates that a maintainer wants help on an issue or pull request label Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior component: bots help wanted Indicates that a maintainer wants help on an issue or pull request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants