Add shuffle, shuffle! functions #2048

rana · 2019-12-09T00:43:16Z

Hi,

Would be helpful to see shuffle, shuffle! functions in DataFrames. Used in randomizing machine learning mini batches.

What do you think?

bkamins · 2019-12-09T07:31:33Z

Now you can do shuffle via df[shuffle(axes(df, 1)), :] but I agree we could add it.

@nalimilan - given we have settled to treat a DataFrame as a collection of rows I think it is OK to add it. If you agree, then I can make a PR.

rana · 2019-12-09T17:31:22Z

Thanks, I didn't know about df[shuffle(axes(df, 1)), :]. I will start using that in the mean time.

bkamins · 2019-12-09T17:43:25Z

A bit less efficient (but more aesthetic) way to do it is DataFrame(shuffle(eachrow(df))).

rana · 2019-12-10T22:11:47Z

Maybe also consider offering column shuffling?

shuffle(;cols=false)

shuffle!(;cols=false)

bkamins · 2019-12-10T22:24:38Z

We treat DataFrame as row oriented, so I would not implement column shuffling directly, rather this:

select(df, randperm(ncol(df)))

or this:

df[:, randperm(ncol(df))]

should be used

nalimilan · 2019-12-11T13:57:32Z

Reminds me of a similar discussion about sample. Maybe better leave this for post-1.0.

Shuffling columns doesn't sound too common, is it?

bkamins · 2019-12-11T15:25:55Z

Also another pattern that can be used to shuffle columns is df[randperm(nrow(df)), :].

An in-place operation is more challenging and will require a careful design.

OK - leaving this decision post 1.0 (mostly because it is easy to do this without this function).

rana · 2019-12-11T17:06:09Z

I haven't seen many column permutation examples, though I use it in my work. Appreciate the pointer on how to do it. When I'm deep in a language it is obvious. In this case I'm in multiple languages and frameworks and looking for convenience functions.

bkamins · 2019-12-11T17:17:49Z

Sure. I guess the point of @nalimilan is that we want to move towards 1.0 pretty soon.

In general - as we try to look at DataFrame as a collection of rows now I would be OK with adding shuffle and sample to it now. But @nalimilan is a kind of "ecosystem curator" (as it has to be consistent) so I prefer to delegate the final word to him 😄.

mahiki · 2020-10-30T05:36:16Z

I'd like to add a use case that is common in my work, for grouped dataframes. I want to shuffle the groups, which in my case consist of group of items with time series of transactions. Then I want to take the first N groups after shuffle (ie randomly select N groups).

Maybe there is a similarly simple way to shuffle the grouped df

The following process demonstrates the steps I'm currently taking:

df = DataFrame(time = [1, 2, 1, 2, 1, 2]
    , amt = [19.00, 11.00, 35.50, 32.50, 5.99, 5.99]
    , item = ["B001", "B001", "B020", "B020", "BX00", "BX00"])

6×3 DataFrame
│ Row │ time  │ amt     │ item   │
│     │ Int64 │ Float64 │ String │
├─────┼───────┼─────────┼────────┤
│ 1   │ 1     │ 19.0    │ B001   │
│ 2   │ 2     │ 11.0    │ B001   │
│ 3   │ 1     │ 35.5    │ B020   │
│ 4   │ 2     │ 32.5    │ B020   │
│ 5   │ 1     │ 5.99    │ BX00   │
│ 6   │ 2     │ 5.99    │ BX00   │

using StatsBase, Pipe
@pipe df |> groupby(_, :item) |>
         combine(_, :time, :amt, :item, :item => (x -> rand()) => :rando) |>
         sort(_, :rando) |>
         transform(_, :rando => denserank => :rnk_rnd)

6×5 DataFrame
│ Row │ item   │ time  │ amt     │ rando    │ rnk_rnd │
│     │ String │ Int64 │ Float64 │ Float64  │ Int64   │
├─────┼────────┼───────┼─────────┼──────────┼─────────┤
│ 1   │ BX00   │ 0     │ 5.99    │ 0.241881 │ 1       │
│ 2   │ BX00   │ 1     │ 5.99    │ 0.241881 │ 1       │
│ 3   │ B001   │ 0     │ 19.0    │ 0.292468 │ 2       │
│ 4   │ B001   │ 1     │ 11.0    │ 0.292468 │ 2       │
│ 5   │ B020   │ 0     │ 35.5    │ 0.70816  │ 3       │
│ 6   │ B020   │ 1     │ 32.5    │ 0.70816  │ 3       │

# I only want the original columns
 @pipe filter(:rnk_rnd => <=(2), res)  |>
         select(_, :item, :time, :amt)

4×3 DataFrame
│ Row │ item   │ time  │ amt     │
│     │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1   │ BX00   │ 1     │ 5.99    │
│ 2   │ BX00   │ 2     │ 5.99    │
│ 3   │ B020   │ 1     │ 35.5    │
│ 4   │ B020   │ 2     │ 32.5    │

mahiki · 2020-10-30T05:49:27Z

Got it:

# take the first 2 shuffled groups
@pipe df |> groupby(_, :item) |>
    _[shuffle(1:end)] |>
    combine(_[1:2], :)

4×3 DataFrame
│ Row │ item   │ time  │ amt     │
│     │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1   │ BX00   │ 0     │ 5.99    │
│ 2   │ BX00   │ 1     │ 5.99    │
│ 3   │ B001   │ 0     │ 19.0    │
│ 4   │ B001   │ 1     │ 11.0    │

I guess i'll put it up on stack overflow.

bkamins · 2020-10-30T07:58:32Z

Adding this and sample is planned but after 0.22 release as it is non-breaking.

bkamins added this to the 2.0 milestone Dec 11, 2019

bkamins mentioned this issue Dec 11, 2019

Row-wise vs. whole vector functions #1952

Closed

bkamins mentioned this issue Jan 22, 2020

Additional functions supported for DataFrame.jl #2088

Closed

bkamins mentioned this issue Feb 3, 2020

rand(::GroupedDataFrame) sampler? #2097

Closed

bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020

bkamins mentioned this issue May 14, 2020

Clarify position on iteration API #2254

Closed

bkamins modified the milestones: 1.x, 1.4 Feb 11, 2022

bkamins mentioned this issue Feb 18, 2022

add reverse!, shuffle, shuffle!, permute!, and invpermute! #3010

Merged

bkamins closed this as completed in #3010 Feb 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shuffle, shuffle! functions #2048

Add shuffle, shuffle! functions #2048

rana commented Dec 9, 2019 •

edited

Loading

bkamins commented Dec 9, 2019

rana commented Dec 9, 2019

bkamins commented Dec 9, 2019

rana commented Dec 10, 2019

bkamins commented Dec 10, 2019

nalimilan commented Dec 11, 2019

bkamins commented Dec 11, 2019 •

edited

Loading

rana commented Dec 11, 2019 •

edited

Loading

bkamins commented Dec 11, 2019

mahiki commented Oct 30, 2020 •

edited

Loading

mahiki commented Oct 30, 2020 •

edited

Loading

bkamins commented Oct 30, 2020

Add shuffle, shuffle! functions #2048

Add shuffle, shuffle! functions #2048

Comments

rana commented Dec 9, 2019 • edited Loading

bkamins commented Dec 9, 2019

rana commented Dec 9, 2019

bkamins commented Dec 9, 2019

rana commented Dec 10, 2019

bkamins commented Dec 10, 2019

nalimilan commented Dec 11, 2019

bkamins commented Dec 11, 2019 • edited Loading

rana commented Dec 11, 2019 • edited Loading

bkamins commented Dec 11, 2019

mahiki commented Oct 30, 2020 • edited Loading

mahiki commented Oct 30, 2020 • edited Loading

bkamins commented Oct 30, 2020

rana commented Dec 9, 2019 •

edited

Loading

bkamins commented Dec 11, 2019 •

edited

Loading

rana commented Dec 11, 2019 •

edited

Loading

mahiki commented Oct 30, 2020 •

edited

Loading

mahiki commented Oct 30, 2020 •

edited

Loading