Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always read data/list.txt as UTF-8 to avoid "ArgumentError: invalid byte sequence in US-ASCII" when parsing it #118

Open
dentarg opened this issue Sep 19, 2016 · 13 comments
Labels

Comments

@dentarg
Copy link
Contributor

dentarg commented Sep 19, 2016

If your environment fails to specify UTF-8, Ruby defaults to US-ASCII and when public_suffix try to parse the list data, it fails:

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH) ; PublicSuffix::List.parse(list_data, private_domains: false) ; nil
ArgumentError: invalid byte sequence in US-ASCII
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:89:in `strip!'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:89:in `block (2 levels) in parse'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:88:in `each_line'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:88:in `block in parse'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:128:in `initialize'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:87:in `new'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:87:in `parse'
    from (irb):1
    from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):002:0> Encoding.default_external
=> #<Encoding:US-ASCII>
irb(main):003:0> RUBY_VERSION
=> "2.2.5"
irb(main):004:0>

Passing encoding: Encoding::UTF_8 to File.read makes it work, even if the default encoding isn't UTF-8:

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8) ; PublicSuffix::List.parse(list_data, private_domains: false) ; nil
=> nil
irb(main):002:0> RUBY_VERSION
=> "2.2.5"
irb(main):003:0> Encoding.default_external
=> #<Encoding:US-ASCII>

Related to #94 (maybe the list data has changed since?)

@weppos
Copy link
Owner

weppos commented Oct 15, 2016

Thankis @dentarg, I'll investigate. Are you able to tell me which line in the definition file is causing the issue?

@weppos weppos self-assigned this Oct 15, 2016
@weppos weppos added the bug label Oct 15, 2016
@dentarg
Copy link
Contributor Author

dentarg commented Oct 16, 2016

@weppos I hope this help (I'm in a hurry now, so I haven't checked this too closely)

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH) ; nil
=> nil
irb(main):002:0> list_data.class
=> String
irb(main):007:0> ctr = 0 ; outside_line = "" ; list_data.each_line { |line| ctr += 1 ; outside_line = line ; line.strip! } ; nil
ArgumentError: invalid byte sequence in US-ASCII
    from (irb):7:in `strip!'
    from (irb):7:in `block in irb_binding'
    from (irb):7:in `each_line'
    from (irb):7
    from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):008:0> ctr
=> 610
irb(main):009:0> outside_line
=> "\xE5\x85\xAC\xE5\x8F\xB8.cn\n"

@dentarg
Copy link
Contributor Author

dentarg commented Oct 16, 2016

This was with 2.0.3:

irb(main):010:0> PublicSuffix::List::DEFAULT_LIST_PATH
=> "/Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.3/lib/public_suffix/../../data/list.txt"

@dentarg
Copy link
Contributor Author

dentarg commented Oct 16, 2016

Hmm... maybe I was naive to believe that everything would be good by File.read with encoding: Encoding::UTF_8 just because it doesn't raise any exception. Seems like "网络.cn\n" is read as "\u7F51\u7EDC.cn\n". This is on OS X 10.11.6, Ruby 2.2.5, zsh 5.0.8, public_suffix-2.0.3. I don't think I fully understand all the LANG, LANGUAGE, LC_* business.

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix'
=> true
irb(main):002:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8).each_line.to_a[610]
=> "\u7F51\u7EDC.cn\n"
irb(main):003:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8).each_line.to_a[610].strip!
=> "\u7F51\u7EDC.cn"
irb(main):004:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610]
=> "\xE7\xBD\x91\xE7\xBB\x9C.cn\n"
irb(main):005:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610].strip!
ArgumentError: invalid byte sequence in US-ASCII
    from (irb):5:in `strip!'
    from (irb):5
    from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):006:0> %w(LANG LANGUAGE LC_ALL LC_CTYPE).map { |v| ENV[v] }
=> ["", "", "", ""]
$ irb
irb(main):001:0> require 'public_suffix'
=> true
irb(main):002:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610]
=> "网络.cn\n"
irb(main):003:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610].strip!
=> "网络.cn"
irb(main):004:0> %w(LANG LANGUAGE LC_ALL LC_CTYPE).map { |v| ENV[v] }
=> ["en_US.UTF-8", "en_US.UTF-8", "en_US.UTF-8", "en_US.UTF-8"]

@weppos weppos removed their assignment Mar 6, 2017
@tamoyal
Copy link

tamoyal commented Sep 8, 2018

I'm having this problem with version 3.0.3

@SeanDunford
Copy link

Bump. Is this project dead? Does anyone have a fork or alternate project where this is working? @weppos

@weppos
Copy link
Owner

weppos commented Apr 4, 2019

Bump. Is this project dead? Does anyone have a fork or alternate project where this is working? @weppos

It is not dead. If your operating environment is set with the correct UTF8 language value, the library will work perfectly.

@aleksandrs-ledovskis
Copy link

FWIW, it would seem correct if gem wouldn't depend/be agnostic to any environment setups for nominal operation.

@weppos
Copy link
Owner

weppos commented Apr 4, 2019

@SeanDunford @aleksandrs-ledovskis feel free to provide a patch and I will review it. So far, the only one that provided a practical help was @dentarg but even him admitted the problem may not be that easy to solve.

Frankly, I am reluctant to put any effort into trying to make UTF-8 work because the real solution is to pre-process the list and have it stored in Punycode as this is how names should be managed and compared.

It's just not a the top of my priorities right now. PRs are always welcome.

@alexef
Copy link

alexef commented Feb 5, 2021

This is still broken in 4.0.3 on ruby:2.4-slim-buster docker image.

A workaround is setting: LANG=en_US.UTF-8 LANGUAGE=en_US.UTF-8 LC_ALL=en_US.UTF-8 before calling ruby.

@dentarg
Copy link
Contributor Author

dentarg commented Feb 5, 2021

Looks like LANG=C.UTF-8 is enough, the Docker images for Ruby >= 2.5 sets that
$ docker run --rm ruby:2.4-slim-buster env
PATH=/usr/local/bundle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=2ea0e1a03e36
RUBY_MAJOR=2.4
RUBY_VERSION=2.4.10
RUBY_DOWNLOAD_SHA256=d5668ed11544db034f70aec37d11e157538d639ed0d0a968e2f587191fc530df
RUBYGEMS_VERSION=3.0.3
GEM_HOME=/usr/local/bundle
BUNDLE_SILENCE_ROOT_WARNING=1
BUNDLE_APP_CONFIG=/usr/local/bundle
HOME=/root

vs

$ docker run --rm ruby:2.5-slim-buster env
PATH=/usr/local/bundle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=7d11ed52a0af
LANG=C.UTF-8
RUBY_MAJOR=2.5
RUBY_VERSION=2.5.8
RUBY_DOWNLOAD_SHA256=0391b2ffad3133e274469f9953ebfd0c9f7c186238968cbdeeb0651aa02a4d6d
RUBYGEMS_VERSION=3.0.3
GEM_HOME=/usr/local/bundle
BUNDLE_SILENCE_ROOT_WARNING=1
BUNDLE_APP_CONFIG=/usr/local/bundle
HOME=/root

Running my initial example

# publicsuffix.rb
require 'bundler/inline'
gemfile do
  source 'https://rubygems.org'
  gem 'public_suffix'
end
puts RUBY_VERSION
puts PublicSuffix::List::DEFAULT_LIST_PATH
list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH)
PublicSuffix::List.parse(list_data, private_domains: false)

In ruby:2.4-slim-buster

$ docker run --rm -it -v $(pwd):/app -w /app ruby:2.4-slim-buster bash
root@aa7eb67dce29:/app# gem install bundler
Fetching bundler-2.2.8.gem
Successfully installed bundler-2.2.8
1 gem installed
root@aa7eb67dce29:/app# ruby publicsuffix.rb
2.4.10
/usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt
/usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `count': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `initialize'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `new'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `build'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:334:in `factory'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:94:in `block (2 levels) in parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `each_line'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `block in parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:108:in `initialize'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `new'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `parse'
	from publicsuffix.rb:9:in `<main>'
root@aa7eb67dce29:/app# LANG=C.UTF-8 ruby publicsuffix.rb
2.4.10
/usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt

In ruby:2.5-slim-buster

$ docker run --rm -it -v $(pwd):/app -w /app ruby:2.5-slim-buster bash
root@b87a1b578bbf:/app# ruby publicsuffix.rb
2.5.8
/usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt

The problematic code in public_suffix is PublicSuffix::List.default

# Gets the default rule list.
#
# Initializes a new {PublicSuffix::List} parsing the content
# of {PublicSuffix::List.default_list_content}, if required.
#
# @return [PublicSuffix::List]
def self.default(**options)
@default ||= parse(File.read(DEFAULT_LIST_PATH), **options)
end

$ docker run --rm -it ruby:2.4-slim-buster bash
root@31cd6631fcaa:/# gem install public_suffix
Fetching public_suffix-4.0.6.gem
Successfully installed public_suffix-4.0.6
1 gem installed
root@31cd6631fcaa:/# ruby -rpublic_suffix -e 'PublicSuffix::List.default'
/usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `count': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `initialize'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `new'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `build'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:334:in `factory'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:94:in `block (2 levels) in parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `each_line'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `block in parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:108:in `initialize'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `new'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:51:in `default'
	from -e:1:in `<main>'
root@31cd6631fcaa:/# LANG=C.UTF-8 ruby -rpublic_suffix -e 'PublicSuffix::List.default'

@zavan
Copy link

zavan commented Feb 19, 2021

I'm encountering an error that is probably related to this:

domain = PublicSuffix.domain(request.host)
Tenant.find_by!(domain: domain)

Raises:
ArgumentError (Cannot transliterate strings with ASCII-8BIT encoding)

Forcing UTF-8 works:

domain = PublicSuffix.domain(host).to_s.force_encoding('UTF-8')

Ruby: 3.0.0
Rails: 6.1.3
Gem: 4.0.6

@mcarpenter
Copy link

Two workarounds below.

  1. Set the encoding using the Ruby interpreter's -E flag:
ruby -E utf-8 ./foo.rb
  1. Set the external encoding progamatically:
require 'public_suffix'

Encoding.default_external = 'utf-8'
puts PublicSuffix.parse('example.com').inspect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants