Improve flatten (slightly breaking) #2767

bkamins · 2021-05-17T16:57:47Z

Users find it confusing that strings are iterable, see https://discourse.julialang.org/t/flatten-in-case-column-contains-string-and-array/61284/5. They are not iterable in broadcasting. On the other hand we should not follow the full semantics of Base.broadcastable as it will then eg. disallow Dicts (this was the original reason why I went for iterator interface).

I think the way to go is to create a list of special cases. Here is a collection of standard "suspects that I think we should not flatten by default:

AbstractString
Pair
Markdown.MD

the question is if we want them to be left "as is" or throw an error. If we leave them "as is" then probably we should also consider what other types we want to leave "as is" (they currently error):

Symbol
Missing
Nothing

(I have listed only the most common cases)

The general question in the first place is: do we find it useful to complicate the API of flatten. My fist instinct is to throw an error on the cases we do not want to allow being flattened as this will be simpler.

The text was updated successfully, but these errors were encountered:

bkamins · 2021-05-17T16:59:51Z

As proposed in #2766 we could:

error by default in the cases I have listed
add a kwarg that would leave "as is" all values that do not support length or form my list

pdeffebach · 2021-05-17T17:02:56Z

I am much more concerned about supporting missing than AbstractString. Though I understand both are important, I think the probability of having a mixture of Vector{String} and String is unlikely.

bkamins · 2021-05-17T17:21:54Z

Yeah - that is why I have listed all options to consider 😄. I also think missing will be much more common.

@sprmnt21 - thank you for raising this issue as it really is relevant.

knuesel · 2021-05-17T18:08:53Z

Maybe missing should be handled orthogonally to this issue, with an allowmissing keyword?

I think errors on scalars (for example symbols) are unexpected: since numbers (the most common scalars) work, I would expect that every value is treated either as scalar or as collection, and errors would only occur when there is a "real" issue like missing.

We could have either a blacklist (AbstractString, Pair, ...) or a whitelist (AbstractArray, AbstractDict, Tuple, Generator...). In either case, it seems a bit arbitrary to have a fixed list, so maybe we should also have a keyword argument for the user to supply their own list? For example the user could call flatten(df, :a, scalars=[AbstractString]) to solve the issue in the Discourse thread.

pdeffebach · 2021-05-17T18:24:16Z

I think we should take no methods for length seriously and not do a white-list. I'm not super sure about keyword arguments. This seems rate enough that people can do cleaning on their own before, i.e. put all Strings in [x].

Do we know if this problem came about "naturally" or was it trying to work through the docs? because the post on discourse certainly seems a bit contrived.

sprmnt21 · 2021-05-17T20:27:24Z

an example of origin, I don't know how "natural", of this situation is the dataframe originating from the following expressions

                  using DataFrames
        list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","5666TB:GHLA","789789TSB:ASLT","2324R:CDCL","79067P:OPRA","79067R:OPRA"])
        list2 = DataFrame(ID =["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair = ["PairOne","PairOne","Two","Two"])
        
        function pairing(s1,s2)
                      if length(s1)==length(s2)
                          idx=findall(x->x!=0,cmp.(collect(s1),collect(s2)))
                          return Set([s1[idx],s2[idx]])
                      else
                          return Set([])
                      end
                  end
                  list = leftjoin(list1,list2, on = :ID)
                  nr=nrow(list)
                  list.twin=[[i, something(findfirst(x->x==1,[Set(["P","R"])==pairing(list.ID[i],list.ID[j]) for j in 1:nr]),i)] for i in 1:nr]
                  list.pair1=[coalesce(list.Pair[t[1]],list.Pair[t[2]])   for t in list.twin ]
                  list.pair_1=[coalesce(list.pair1[r],unique(list.ID[list.twin[r]])) for r in 1:nr]
                  sort(list,:Pair)

which was an attempt to resolve this problem

but, I also believe, that it is quite artificial and unlikely.

knuesel · 2021-05-18T08:17:11Z

Good point about taking no length seriously.

I also haven't seen a real-life case... I guess flattening a heterogeneous column is uncommon. So maybe not worth adding a scalars or similar keyword. On the other hand, cleaning the data by adding wrappers would probably have much worse performance?

ianqsong · 2021-05-19T09:59:13Z

Following the hint of Base.broadcastable, which uses axes, it leads to the implementation below.
Note that cell x would be flattened according to length(axes(x)[1]), i.e., only the outer most layer (bracket [ ] for vector, ( ) for tuple) would be removed (may be called flatten in depth level 1). Eg. the column with two items

[["aa", ["bb"]], "cc"]

is flattened to three items ["aa", ["bb"], "cc"].

function flatten_depth1(df::AbstractDataFrame,
                 cols::Union{ColumnIndex, MultiColumnIndex})
    _check_consistency(df)

    idxcols = index(df)[cols]
    isempty(idxcols) && return copy(df)
    col1 = first(idxcols)
    # lengths = length.(df[!, col1])
    flen(x) = try length(axes(x)[1]) catch; 1 end
    lengths = flen.(df[!, col1])
    if length(idxcols) > 1
        for col in idxcols[2:end]
            v = df[!, col]
            if any(x -> flen(x[1]) != x[2], zip(v, lengths))
                r = findfirst(x -> x != 0, length.(v) .- lengths)
                colnames = _names(df)
                throw(ArgumentError("Lengths of iterables stored in columns :$(colnames[col1]) " *
                                    "and :$(colnames[col]) are not the same in row $r"))
            end
	end
        sort!(idxcols)
    end

    new_df = similar(df[!, Not(cols)], sum(lengths))
    for name in _names(new_df)
        repeat_lengths!(new_df[!, name], df[!, name], lengths)
    end
    # length(idxcols) > 1 && sort!(idxcols)
    for col in idxcols
        # col_to_flatten = df[!, col]
        # flattened_col = col_to_flatten isa AbstractVector{<:AbstractVector} ?
        #     reduce(vcat, [i for i in j, j in df[!, col]]) :
        #     collect(Iterators.flatten(col_to_flatten))

        # [["aa", ["bb"]], "cc", ("dd", "ee")] flattened to ["aa", ["bb"], "cc", "dd", "ee"]
        flattened_col = mapreduce(vcat, df[!, col], lengths) do i,l
            l > 1 ? [j for j in i] : ifelse(isa(i, AbstractVector), i, [i])
        end
        insertcols!(new_df, col, _names(df)[col] => flattened_col)
    end

    return new_df
end

bkamins · 2022-10-14T21:15:43Z

I have thought about it and would add scalar kwarg (we could discuss the name). It would accept a type and values of this type would be repeated as many times as needed (aka broadcasting). By default no type would be treated as a scalar. But for example if you pass scalar=Missing then missings would be expanded. If you pass scalar=Union{Missing, AbstractString} then missings or strings would be expanded.

What do you think?

CC @nalimilan

nalimilan · 2022-10-17T20:27:35Z

I kind of regret we don't treat strings as scalars, but now that we do I agree that allowing to pass types to treat as scalars via an argument is a reasonable solution.

bkamins added decision breaking The proposed change is breaking. labels May 17, 2021

bkamins added this to the 1.x milestone May 17, 2021

bkamins mentioned this issue May 17, 2021

Flatten in case column contains string and array #2766

Closed

bkamins modified the milestones: 1.x, 1.5 Aug 15, 2022

bkamins mentioned this issue Aug 15, 2022

unnest #3116

Open

bkamins mentioned this issue Dec 28, 2022

Add nest, unnest, extract and extract! #3258

Draft

bkamins linked a pull request Dec 28, 2022 that will close this issue

Add nest, unnest, extract and extract! #3258

Draft

bkamins removed a link to a pull request Feb 5, 2023

Add nest, unnest, extract and extract! #3258

Draft

bkamins mentioned this issue Feb 5, 2023

Add scalar keyword argument to flatten #3283

Merged

bkamins closed this as completed in #3283 Feb 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve flatten (slightly breaking) #2767

Improve flatten (slightly breaking) #2767

bkamins commented May 17, 2021

bkamins commented May 17, 2021

pdeffebach commented May 17, 2021

bkamins commented May 17, 2021

knuesel commented May 17, 2021

pdeffebach commented May 17, 2021 •

edited

Loading

sprmnt21 commented May 17, 2021 •

edited

Loading

knuesel commented May 18, 2021

ianqsong commented May 19, 2021 •

edited

Loading

bkamins commented Oct 14, 2022

nalimilan commented Oct 17, 2022

Improve flatten (slightly breaking) #2767

Improve flatten (slightly breaking) #2767

Comments

bkamins commented May 17, 2021

bkamins commented May 17, 2021

pdeffebach commented May 17, 2021

bkamins commented May 17, 2021

knuesel commented May 17, 2021

pdeffebach commented May 17, 2021 • edited Loading

sprmnt21 commented May 17, 2021 • edited Loading

knuesel commented May 18, 2021

ianqsong commented May 19, 2021 • edited Loading

bkamins commented Oct 14, 2022

nalimilan commented Oct 17, 2022

pdeffebach commented May 17, 2021 •

edited

Loading

sprmnt21 commented May 17, 2021 •

edited

Loading

ianqsong commented May 19, 2021 •

edited

Loading