Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bounds error when sorting a column after select #3340

Closed
George9000 opened this issue Jun 13, 2023 · 12 comments
Closed

Bounds error when sorting a column after select #3340

George9000 opened this issue Jun 13, 2023 · 12 comments
Labels
Milestone

Comments

@George9000
Copy link

George9000 commented Jun 13, 2023

Using Julia 1.9.1 and DataFrames 1.5.0, a bounds error is thrown when attempting to sort a dataframe by a column following a select and reassignment operation. This is not reproducible in another column in the same dataframe. Despite scrutinizing the contents of the column throwing the error, no clues to source are found.
[version details in expandable section below code]

Example data to reproduce error is here.

using DataFrames, CSV

const datap = "."

# Bounds error
let
    df = CSV.read(joinpath(datap, "examp.csv"), DataFrame)
    df = select(df, [:sku, :units])
    sort(df, :sku)
end


#Success
let
    df = CSV.read(joinpath(datap, "examp.csv"), DataFrame)
    df = select(df, [:sku, :units])
    sort(df, :units)
end


#Success
let
    df = CSV.read(joinpath(datap, "examp.csv"), DataFrame)
    df = select(df, [:sku, :units])
    sort(df.sku)
end


#Success
let
    df = CSV.read(joinpath(datap, "examp.csv"), DataFrame)
    sort(df, :sku)
end



# CSV.read error: InexactError: trunc(Int64, NaN)
let
    df = CSV.read(joinpath(datap, "examp.csv"), DataFrame; skipto = 8000, limit = 5000)
end


##  investigate :sku column ##
function skucategory(sku)
    if ismissing(sku)
        return missing
    elseif occursin(r"^\d+$", sku)
        return "digits"
    elseif occursin(r"^[\dS]+$", sku)
        return "S_sku"
    elseif occursin(r"^[[:alnum:]]+$", sku)
        return "alphanumeric"
    elseif occursin(r"\/", sku)
        return "slash"
    else
        return "other"
    end
end

let
    df = CSV.read(joinpath(datap, "examp.csv"), DataFrame)
    transform!(df, :sku => ByRow(skucategory) => :skucat)
    gdf = groupby(df, :skucat)
    combine(gdf, nrow)
end   
version info, project.toml, error output
julia> versioninfo()
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.6.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 5 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_EDITOR = Emacs
  JULIA_PKG_DEVDIR = /Users/george/Documents/julia/dev
  
Status `~/Desktop/exampl/Project.toml`
  [336ed68f] CSV v0.10.11
  [a93c6f00] DataFrames v1.5.0 `https://github.com/JuliaData/DataFrames.jl.git#main`

# Bounds error
ERROR: BoundsError: attempt to access 217465-element Vector{Int64} at index [0]
Stacktrace:
  [1] setindex!
    @ ./array.jl:969 [inlined]
  [2] _sort!(v::Vector{Int64}, a::Base.Sort.MissingOptimization{Base.Sort.BoolOptimization{Base.Sort.Small{10, Base.Sort.InsertionSortAlg, Base.Sort.IEEEFloatOptimization{Base.Sort.IsUIntMappable{Base.Sort.Small{40, Base.Sort.InsertionSortAlg, Base.Sort.CheckSorted{Base.Sort.ComputeExtrema{Base.Sort.ConsiderCountingSort{Base.Sort.CountingSort, Base.Sort.ConsiderRadixSort{Base.Sort.RadixSort, Base.Sort.Small{80, Base.Sort.InsertionSortAlg, Base.Sort.ScratchQuickSort{Missing, Missing, Base.Sort.InsertionSortAlg}}}}}}}, Base.Sort.StableCheckSorted{Base.Sort.ScratchQuickSort{Missing, Missing, Base.Sort.InsertionSortAlg}}}}}}}, o::Base.Order.Perm{Base.Order.ForwardOrdering, Vector{Union{Missing, String31}}}, kw::NamedTuple{(:lo, :hi, :allow_legacy_dispatch), Tuple{Int64, Int64, Bool}})
    @ Base.Sort ./sort.jl:583
  [3] sort!
    @ ./sort.jl:2123 [inlined]
  [4] sort!(v::Vector{Int64}, lo::Int64, hi::Int64, #unused#::SortingAlgorithms.TimSortAlg, o::Base.Order.Perm{Base.Order.ForwardOrdering, Vector{Union{Missing, String31}}})
    @ SortingAlgorithms ~/.julia/packages/SortingAlgorithms/n1AWW/src/SortingAlgorithms.jl:488
  [5] sort!
    @ ./sort.jl:2121 [inlined]
  [6] _sortperm(df::DataFrame, a::SortingAlgorithms.TimSortAlg, o::Base.Order.Perm{Base.Order.ForwardOrdering, Vector{Union{Missing, String31}}})
    @ DataFrames ~/.julia/packages/DataFrames/4GaAs/src/abstractdataframe/sort.jl:592
  [7] sortperm(df::DataFrame, cols::Symbol; alg::Nothing, lt::typeof(isless), by::typeof(identity), rev::Bool, order::Base.Order.ForwardOrdering)
    @ DataFrames ~/.julia/packages/DataFrames/4GaAs/src/abstractdataframe/sort.jl:589
  [8] sortperm
    @ ~/.julia/packages/DataFrames/4GaAs/src/abstractdataframe/sort.jl:577 [inlined]
  [9] #sort#378
    @ ~/.julia/packages/DataFrames/4GaAs/src/abstractdataframe/sort.jl:510 [inlined]
 [10] sort(df::DataFrame, cols::Symbol)
    @ DataFrames ~/.julia/packages/DataFrames/4GaAs/src/abstractdataframe/sort.jl:503
 [11] top-level scope
    @ REPL[5]:4


# CSV inexact error
ERROR: InexactError: trunc(Int64, NaN)
Stacktrace:
 [1] trunc
   @ ./float.jl:893 [inlined]
 [2] ceil(#unused#::Type{Int64}, x::Float64)
   @ Base ./float.jl:383
 [3] CSV.Context(source::CSV.Arg, header::CSV.Arg, normalizenames::CSV.Arg, datarow::CSV.Arg, skipto::CSV.Arg, footerskip::CSV.Arg, transpose::CSV.Arg, comment::CSV.Arg, ignoreemptyrows::CSV.Arg, ignoreemptylines::CSV.Arg, select::CSV.Arg, drop::CSV.Arg, limit::CSV.Arg, buffer_in_memory::CSV.Arg, threaded::CSV.Arg, ntasks::CSV.Arg, tasks::CSV.Arg, rows_to_check::CSV.Arg, lines_to_check::CSV.Arg, missingstrings::CSV.Arg, missingstring::CSV.Arg, delim::CSV.Arg, ignorerepeated::CSV.Arg, quoted::CSV.Arg, quotechar::CSV.Arg, openquotechar::CSV.Arg, closequotechar::CSV.Arg, escapechar::CSV.Arg, dateformat::CSV.Arg, dateformats::CSV.Arg, decimal::CSV.Arg, groupmark::CSV.Arg, truestrings::CSV.Arg, falsestrings::CSV.Arg, stripwhitespace::CSV.Arg, type::CSV.Arg, types::CSV.Arg, typemap::CSV.Arg, pool::CSV.Arg, downcast::CSV.Arg, lazystrings::CSV.Arg, stringtype::CSV.Arg, strict::CSV.Arg, silencewarnings::CSV.Arg, maxwarnings::CSV.Arg, debug::CSV.Arg, parsingdebug::CSV.Arg, validate::CSV.Arg, streaming::CSV.Arg)
   @ CSV ~/.julia/packages/CSV/OnldF/src/context.jl:647
 [4] #File#32
   @ ~/.julia/packages/CSV/OnldF/src/file.jl:222 [inlined]
 [5] read(source::String, sink::Type; copycols::Bool, kwargs::Base.Pairs{Symbol, Int64, Tuple{Symbol, Symbol}, NamedTuple{(:skipto, :limit), Tuple{Int64, Int64}}})
   @ CSV ~/.julia/packages/CSV/OnldF/src/CSV.jl:117
 [6] top-level scope
   @ REPL[6]:2
@bkamins bkamins added the bug label Jun 14, 2023
@bkamins bkamins added this to the 1.6 milestone Jun 14, 2023
@bkamins
Copy link
Member

bkamins commented Jun 14, 2023

Thank you for reporting.

This is unrelated to DataFrames.jl. The fundamental problem is:

o = Base.Order.Perm(Base.Order.ForwardOrdering(), df.sku)
a = DataFrames.SortingAlgorithms.TimSortAlg()
x = collect(1:nrow(df))
sort!(x, a, o) # this errors

This is triggered in SortingAlgorithms.jl / Base.

@LilithHafner - maybe there were some recent changes to these algorithms? The column sorted is Vector{Union{String31, Missing}}.

@LilithHafner
Copy link
Contributor

@George9000, thank you for reporting, and @bkamins, thank you for @ ing me. I can confirm that this is my fault. I introduced a fast path for sorting unions with missing that is only safe when sorting an entire vector (a la sort!(v)) but TimSort uses an internal method of sort sort!(v::AbstractVector, lo::Integer, hi::Integer, alg::Alorithm, o::Ordering) that incorrectly dispatches to this fast path.

I will fix the restrict the dispatch to the fastpath in Base's sorting to unbreak this internal method. This fix should make it into 1.9.2 and 1.10.0.

I will also patch SortingAlgorithms so that TimSort calls a public method (which I did not break). That will come out soon.

@bkamins
Copy link
Member

bkamins commented Jun 14, 2023

@LilithHafner - thank you for a prompt response. You are a top expert in this field and your expertise is highly appreciated here.

Could you please x-link this issue to the PR you would make (so that I can close this issue when things are sorted out 😄 ).

@LilithHafner
Copy link
Contributor

I'm assuming you are referring to the field of "fixing bugs that Lilith introduced" 😄 Happy to help.

@bkamins
Copy link
Member

bkamins commented Jun 15, 2023

I really appreciate your efforts to improve sorting in Julia!

@LilithHafner
Copy link
Contributor

@George9000, would you please run ]up and try again? I think this is fixed in SortingAlgorithms v1.1.1

@bkamins
Copy link
Member

bkamins commented Jun 15, 2023

Works for me. Is JuliaLang/julia#50171 needed to fix the issue (or it is just another enhancement, but DataFrames.jl is already fixed without it?)

LilithHafner pushed a commit to LilithHafner/DataFrames.jl that referenced this issue Jun 15, 2023
@LilithHafner
Copy link
Contributor

These are Base.Sort bugs so JuliaLang/julia#50171 is the real fix. JuliaCollections/SortingAlgorithms.jl#78 is a temporary shim so that SortingAlgorithms (and DataFrames by extension) is not broken before JuliaLang/julia#50171 is released.

@bkamins
Copy link
Member

bkamins commented Jun 15, 2023

Ah - OK. Now I see it. Still - this issue can be closed then. Right?

@George9000
Copy link
Author

George9000 commented Jun 15, 2023

Late to the game here. Now works with Julia 1.9.1. Thank you.

Still fails with 1.10.0-DEV built from source a few days ago, but will wait for julia#50171 to merge, rebuild, and try again.

julia 1.10 error
julia> versioninfo()
Julia Version 1.10.0-DEV.1467
Commit c58e508fdd (2023-06-10 16:12 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.6.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 6 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_EDITOR = Emacs
  JULIA_PKG_DEVDIR = /Users/george/Documents/julia/dev
  

julia> let
           df = CSV.read(joinpath(datap, "examp.csv"), DataFrame)
           df = select(df, [:sku, :units])
           sort(df, :sku)
       end

ERROR: ArgumentError: Base.Sort._sort!(::Vector{Int64}, ::SortingAlgorithms.TimSortAlg, ::Base.Order.Perm{Base.Order.ForwardOrdering, Base.Sort.WithoutMissingVector{String31, Vector{Union{Missing, String31}}}}) is not defined
Stacktrace:
  [1] _sort!
    @ Base.Sort ./sort.jl:2170 [inlined]
  [2] _sort!
    @ Base.Sort ./sort.jl:657 [inlined]
  [3] _sort!
    @ Base.Sort ./sort.jl:728 [inlined]
  [4] _sort!
    @ Base.Sort ./sort.jl:673 [inlined]
  [5] _sort!(v::Vector{…}, a::Base.Sort.MissingOptimization{…}, o::Base.Order.Perm{…}, kw::@NamedTuple{…})
    @ Base.Sort ./sort.jl:610
  [6] sort!
    @ Base.Sort ./sort.jl:2152 [inlined]
  [7] sort!
    @ Base.Sort ./sort.jl:2150 [inlined]
  [8] _sortperm(df::DataFrame, a::Base.Sort.MissingOptimization{Base.Sort.BoolOptimization{…}}, o::Base.Order.Perm{Base.Order.ForwardOrdering, Vector{…}})
    @ DataFrames ~/.julia/packages/DataFrames/LteEl/src/abstractdataframe/sort.jl:592
  [9] sortperm(df::DataFrame, cols::Symbol; alg::Nothing, lt::typeof(isless), by::typeof(identity), rev::Bool, order::Base.Order.ForwardOrdering)
    @ DataFrames ~/.julia/packages/DataFrames/LteEl/src/abstractdataframe/sort.jl:589
 [10] sort(df::DataFrame, cols::Symbol; alg::Nothing, lt::typeof(isless), by::typeof(identity), rev::Bool, order::Base.Order.ForwardOrdering, view::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/LteEl/src/abstractdataframe/sort.jl:510 [inlined]
 [11] sort(df::DataFrame, cols::Symbol)
    @ DataFrames ~/.julia/packages/DataFrames/LteEl/src/abstractdataframe/sort.jl:503
 [12] top-level scope
    @ REPL[7]:4
   

Status `~/Documents/exampl/Project.toml`
  [336ed68f] CSV v0.10.11
  [861a8166] Combinatorics v1.0.2
  [a93c6f00] DataFrames v1.5.0

@LilithHafner
Copy link
Contributor

Should work on Julia master too as of a few days ago

@bkamins
Copy link
Member

bkamins commented Jun 20, 2023

Yes, it works. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants