-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: Warn when reading BOM text with headers option #301
Comments
I can understand the motivation but I'm not sure that this is a proper approach. How about enabling BOM detection in |
Do you mean BOM is removed automatically without specifying encoding? # like this
csv = CSV.open('with-bom.csv', headers: true)
csv[0]['id'] #=> 1 It would be much better than warning! |
Yes. How about this? diff --git a/lib/csv.rb b/lib/csv.rb
index b016b8f..b5a4fe6 100644
--- a/lib/csv.rb
+++ b/lib/csv.rb
@@ -1581,7 +1581,14 @@ class CSV
def open(filename, mode="r", **options)
# wrap a File opened with the remaining +args+ with no newline
# decorator
- file_opts = options.dup
+ file_opts = {}
+ have_encoding_options = (options.key?(:encoding) or
+ options.key?(:external_encoding) or
+ mode.include?(":"))
+ if not have_encoding_options and Encoding.default_external == Encoding::UTF_8
+ file_opts[:encoding] = "bom|utf-8"
+ end
+ file_opts.merge!(options)
unless file_opts.key?(:newline)
file_opts[:universal_newline] ||= false
end |
Great! That works fine. Is it possible to apply to require './lib/csv'
text = "\u{feff}id,name\n1,Alice"
File.write('tmp/with-bom.csv', text)
csv = CSV.open('tmp/with-bom.csv', headers: true).read
p csv[0]['id']
p csv[0]['name']
csv = CSV.read('tmp/with-bom.csv', headers: true)
p csv[0]['id']
p csv[0]['name']
CSV.foreach('tmp/with-bom.csv', headers: true) do |row|
p csv[0]['id']
p csv[0]['name']
end
csv = CSV.parse(File.read('tmp/with-bom.csv'), headers: true)
p csv[0]['id']
p csv[0]['name']
|
No. It's an user's responsibility that BOM is handled before calling Why do you want to use |
I used require 'csv'
class MyClass
def do_something
CSV.parse(my_csv_text, headers: true).map do |row|
row['name']
end
end
# extract to a method to make it easier to stub
def my_csv_text
File.read('my_file.csv', encoding: 'bom|utf-8')
end
end
RSpec.describe MyClass do
let(:my_csv_text) do
<<~CSV
name,age
Alice,20
Bob,30
CSV
end
it 'does something' do
my_class = MyClass.new
allow(my_class).to receive(:my_csv_text).and_return(my_csv_text)
expect(my_class.do_something).to eq(['Alice', 'Bob'])
end
end |
The example works well because it uses I this case, both of the current |
Yes, but my first implementation was like this: require 'csv'
class MyClass
def do_something
CSV.parse(my_csv_text, headers: true).map do |row|
row['name']
end
end
# extract to a method to make it easier to stub
def my_csv_text
# my first implementation (it didn't work)
File.read('my_file.csv')
end
end
RSpec.describe MyClass do
let(:my_csv_text) do
<<~CSV
name,age
Alice,20
Bob,30
CSV
end
it 'does something' do
my_class = MyClass.new
allow(my_class).to receive(:my_csv_text).and_return(my_csv_text)
expect(my_class.do_something).to eq(['Alice', 'Bob'])
end
end So I was wondering why I couldn't get values by column name. |
I think that it's your program's problem. (In this case, I think that it's better that you don't use a stub for better testing.) |
OK, I understand your idea. But as a library user, it's hard to notice the behavior difference between |
Do you have a suggested documentation change? |
How about this? Please make sure if your text contains BOM or not. # remove BOM on calling File.open
csv_table = File.open(path, encoding: 'bom|utf-8') do |file|
CSV.parse(file, headers: true)
end |
It looks good to me. Could you open a PR that adds it to the Lines 1620 to 1731 in 73b877d
|
Thank you for your review. I created PR here: #305 |
Requested in #301 (comment) ![Screenshot 2024-05-17 at 10 48 23](https://github.com/ruby/csv/assets/1148320/ceeb5d01-c9b8-45c1-8f4a-64911f0a41bc)
I've applied #301 (comment) . |
Thank you! 😄 |
When BOM exists in CSV text, you cannot read first column by field name:
I understand there are some workarounds for this problem:
csv[0]["\u{feff}id"]
)bom|utf-8
forCSV.foreach
,CSV.read
,CSV.open
etc.But the biggest problem is it's too hard to notice the existence of BOM. Actually, when I came across this issue, I had no idea why I failed to get the id value. I guess most people never wants to keep BOM in field name.
So I'd be happy if
CSV.parse
or some other reading methods would warn when input text contains BOM andheaders: true
is specified, like this:By the way, why do we put BOM so often? This is because Excel cannot open UTF-8 CSV correctly without BOM!
Save CSV file in UTF-8 with BOM. Fix Korean text from being corrupted… | by Hyunbin | Medium
Related issue: #43
The text was updated successfully, but these errors were encountered: