Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shuffle, shuffle! functions #2048

Closed
rana opened this issue Dec 9, 2019 · 12 comments · Fixed by #3010
Closed

Add shuffle, shuffle! functions #2048

rana opened this issue Dec 9, 2019 · 12 comments · Fixed by #3010
Labels
non-breaking The proposed change is not breaking
Milestone

Comments

@rana
Copy link

rana commented Dec 9, 2019

Hi,

Would be helpful to see shuffle, shuffle! functions in DataFrames. Used in randomizing machine learning mini batches.

What do you think?

@bkamins
Copy link
Member

bkamins commented Dec 9, 2019

Now you can do shuffle via df[shuffle(axes(df, 1)), :] but I agree we could add it.

@nalimilan - given we have settled to treat a DataFrame as a collection of rows I think it is OK to add it. If you agree, then I can make a PR.

@rana
Copy link
Author

rana commented Dec 9, 2019

Thanks, I didn't know about df[shuffle(axes(df, 1)), :]. I will start using that in the mean time.

@bkamins
Copy link
Member

bkamins commented Dec 9, 2019

A bit less efficient (but more aesthetic) way to do it is DataFrame(shuffle(eachrow(df))).

@rana
Copy link
Author

rana commented Dec 10, 2019

Maybe also consider offering column shuffling?

shuffle(;cols=false)

shuffle!(;cols=false)

@bkamins
Copy link
Member

bkamins commented Dec 10, 2019

We treat DataFrame as row oriented, so I would not implement column shuffling directly, rather this:

select(df, randperm(ncol(df)))

or this:

df[:, randperm(ncol(df))]

should be used

@nalimilan
Copy link
Member

Reminds me of a similar discussion about sample. Maybe better leave this for post-1.0.

Shuffling columns doesn't sound too common, is it?

@bkamins
Copy link
Member

bkamins commented Dec 11, 2019

Also another pattern that can be used to shuffle columns is df[randperm(nrow(df)), :].

An in-place operation is more challenging and will require a careful design.

OK - leaving this decision post 1.0 (mostly because it is easy to do this without this function).

@bkamins bkamins added this to the 2.0 milestone Dec 11, 2019
@rana
Copy link
Author

rana commented Dec 11, 2019

I haven't seen many column permutation examples, though I use it in my work. Appreciate the pointer on how to do it. When I'm deep in a language it is obvious. In this case I'm in multiple languages and frameworks and looking for convenience functions.

@bkamins
Copy link
Member

bkamins commented Dec 11, 2019

Sure. I guess the point of @nalimilan is that we want to move towards 1.0 pretty soon.

In general - as we try to look at DataFrame as a collection of rows now I would be OK with adding shuffle and sample to it now. But @nalimilan is a kind of "ecosystem curator" (as it has to be consistent) so I prefer to delegate the final word to him 😄.

@mahiki
Copy link

mahiki commented Oct 30, 2020

I'd like to add a use case that is common in my work, for grouped dataframes. I want to shuffle the groups, which in my case consist of group of items with time series of transactions. Then I want to take the first N groups after shuffle (ie randomly select N groups).

Maybe there is a similarly simple way to shuffle the grouped df

The following process demonstrates the steps I'm currently taking:

df = DataFrame(time = [1, 2, 1, 2, 1, 2]
    , amt = [19.00, 11.00, 35.50, 32.50, 5.99, 5.99]
    , item = ["B001", "B001", "B020", "B020", "BX00", "BX00"])

6×3 DataFrame
│ Row │ time  │ amt     │ item   │
│     │ Int64 │ Float64 │ String │
├─────┼───────┼─────────┼────────┤
│ 1119.0    │ B001   │
│ 2211.0    │ B001   │
│ 3135.5    │ B020   │
│ 4232.5    │ B020   │
│ 515.99    │ BX00   │
│ 625.99    │ BX00   │

using StatsBase, Pipe
@pipe df |> groupby(_, :item) |>
         combine(_, :time, :amt, :item, :item => (x -> rand()) => :rando) |>
         sort(_, :rando) |>
         transform(_, :rando => denserank => :rnk_rnd)

6×5 DataFrame
│ Row │ item   │ time  │ amt     │ rando    │ rnk_rnd │
│     │ String │ Int64 │ Float64 │ Float64  │ Int64   │
├─────┼────────┼───────┼─────────┼──────────┼─────────┤
│ 1   │ BX00   │ 05.990.2418811       │
│ 2   │ BX00   │ 15.990.2418811       │
│ 3   │ B001   │ 019.00.2924682       │
│ 4   │ B001   │ 111.00.2924682       │
│ 5   │ B020   │ 035.50.708163       │
│ 6   │ B020   │ 132.50.708163# I only want the original columns
 @pipe filter(:rnk_rnd => <=(2), res)  |>
         select(_, :item, :time, :amt)

4×3 DataFrame
│ Row │ item   │ time  │ amt     │
│     │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1   │ BX00   │ 15.99    │
│ 2   │ BX00   │ 25.99    │
│ 3   │ B020   │ 135.5    │
│ 4   │ B020   │ 232.5

@mahiki
Copy link

mahiki commented Oct 30, 2020

Got it:

# take the first 2 shuffled groups
@pipe df |> groupby(_, :item) |>
    _[shuffle(1:end)] |>
    combine(_[1:2], :)

4×3 DataFrame
│ Row │ item   │ time  │ amt     │
│     │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1   │ BX00   │ 05.99    │
│ 2   │ BX00   │ 15.99    │
│ 3   │ B001   │ 019.0    │
│ 4   │ B001   │ 111.0

I guess i'll put it up on stack overflow.

@bkamins
Copy link
Member

bkamins commented Oct 30, 2020

Adding this and sample is planned but after 0.22 release as it is non-breaking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants