Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined result when parsing tsv containing double quotes inside a field #1002

Open
alchemyst opened this issue Apr 30, 2022 · 1 comment
Open

Comments

@alchemyst
Copy link

I came across this issue when I tried to analyse the IMDB dataset available here. I was seeing #undef in in my dataframe after using CSV.jl to read it.

I have narrowed down the problem from the original 8 million lines and 9 columns to this 500 line one column file which triggers the issue. The file is produced from the original IMDB file by the following command, which gives you insight into the exact lines and fields from the original which were used here for context.

< title.basics.tsv | sed -n -e 1p -e 32035,32537p | cut -d $'\t' -f 3 > test.tsv

Deleting any line from this file results a call to CSV.File("test.tsv") to fail with ERROR: MethodError: Cannot ``convert`` an object of type Missing to an object of type String. With this file, the call succeeds, but the last row contains undefined.

The code required to trigger this problem:

using CSV

titles = CSV.File("test.tsv");

titles[end] 

This results in

CSV.Row: Error showing value of type CSV.Row:
ERROR: UndefRefError: access to undefined reference

Full stacktrace shown at the end of this post

I've included the ; in case you want to run this in the REPL. This shows the actual read succeeds. Of course, the last line is previewed in the REPL and it also triggers the error.

I noticed that the line in question starts with a double quote (and is the first one which does that in this file), which led me to work around this issue by passing quoted=false to CSV.File which allowed me to read the file correctly.

This feels like a parse error to me and I think it should be reported as such while reading the file instead of silently succeeding and passing through undefined values. This is especially problematic because if you pass this through to DataFrame, you don't get any sense that there is something wrong until you try to do something with those particular rows.

Weirdly, when I tried to read the .gz that I had to upload now directly withCSV.File("test.tsv.gz"), I see lots of warnings, but these do not appear when reading the tsv itself.

Versions:

  • Julia 1.7.2 on macOS Monterey 12.3.1, installed via Homebrew
  • CSV.jl v0.10.4

Stacktrace promised earlier:

julia> titles[end]
CSV.Row: Error showing value of type CSV.Row:
ERROR: UndefRefError: access to undefined reference
Stacktrace:
  [1] getindex(A::Vector{String}, i1::Int64)
    @ Base ./array.jl:861
  [2] getcolumn
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:24 [inlined]
  [3] (::Tables.var"#1#2"{CSV.Row})(nm::Symbol)
    @ Tables ./none:0
  [4] iterate
    @ ./generator.jl:47 [inlined]
  [5] collect(itr::Base.Generator{Vector{Symbol}, Tables.var"#1#2"{CSV.Row}})
    @ Base ./array.jl:724
  [6] _totuple
    @ ./tuple.jl:349 [inlined]
  [7] Tuple
    @ ./tuple.jl:317 [inlined]
  [8] NamedTuple(r::CSV.Row)
    @ Tables ~/.julia/packages/Tables/PxO1m/src/Tables.jl:195
  [9] show(io::IOContext{Base.TTY}, x::CSV.Row)
    @ Tables ~/.julia/packages/Tables/PxO1m/src/Tables.jl:201
 [10] show(io::IOContext{Base.TTY}, #unused#::MIME{Symbol("text/plain")}, x::CSV.Row)
    @ Base.Multimedia ./multimedia.jl:47
 [11] (::REPL.var"#43#44"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol("text/plain")}, Base.RefValue{Any}})(io::Any)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:266
 [12] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:510
 [13] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:259
 [14] display(d::REPL.REPLDisplay, x::Any)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:271
 [15] display(x::Any)
    @ Base.Multimedia ./multimedia.jl:328
 [16] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
 [17] invokelatest
    @ ./essentials.jl:714 [inlined]
 [18] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:293
 [19] (::REPL.var"#45#46"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:277
 [20] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:510
 [21] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:275
 [22] (::REPL.var"#do_respond#66"{Bool, Bool, REPL.var"#77#87"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:846
 [23] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
 [24] invokelatest
    @ ./essentials.jl:714 [inlined]
 [25] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/LineEdit.jl:2493
 [26] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/REPL/src/REPL.jl:1232
 [27] (::REPL.var"#49#54"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL ./task.jl:423

test.tsv.gz

@cocoa1231
Copy link

cocoa1231 commented Jan 10, 2024

+1 I am also facing the same issue. My tsv is around 250mb so I can't upload but it can be downloaded from athena.ohdsi.org (SNOMED dataset). Julia 1.10.0 on Debian 12 with CSV v0.10.12 and DataFrames v1.6.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants