Add pandas comparison #2378

tk3369 · 2020-08-24T03:41:34Z

This PR adds a new section to compare with Python pandas.

Couple things:

I have fixed some other typos in the dplyr section as well.
I have formatted the markdown table so it's easier on eyes when editing. If desired, I can apply formatting to dplyr and stata sections as well (I'm using a vscode extension so it's literally "1-click").
I have kept the same use cases as copied from the dplyr section; however, I'm uncertain if this is the "best" list. Maybe we should run the list by more R/Python/Stata users and get some feedback.

Disclaimer: I'm not really proficient with pandas so if anyone can suggest better ways to write the sample code, please feel free to suggest below.

bkamins · 2020-08-24T06:58:58Z

Thank you for this PR

I have fixed some other typos in the dplyr section as well.

Thank you for spotting them.

I have formatted the markdown table so it's easier on eyes when editing. If desired, I can apply formatting to dplyr and stata sections as well (I'm using a vscode extension so it's literally "1-click").

Sure. Also maybe add information how to create this example mini-table for dplyr and Stata also for consistency.

I have kept the same use cases as copied from the dplyr section; however, I'm uncertain if this is the "best" list. Maybe we should run the list by more R/Python/Stata users and get some feedback.

I would try to finalize this PR, and then ask on #data in Slack if people have suggestions.

docs/src/man/comparisons.md

nalimilan · 2020-08-24T08:17:42Z

Thanks. While you're at it, can you fix two issues with the previous PR? These are 1) add a link to this page in the sidebar, and 2) check that in the generated HTML docs tables are rendered correctly (by running julia docs/make.jl).

bkamins · 2020-08-24T08:19:26Z

Both are already fixed on master. We only need to update the title to include pandas

Pandas - Add new section for indexing - Split into common vs. group/agg - Add join operations Dplyr and Stata - Fixed typos and reformated table

tk3369 · 2020-08-24T19:06:43Z

@bkamins Thanks for the feedback. I have revamped it further with a whole new subsection for indexing and joins.

docs/src/man/comparisons.md

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

tk3369 · 2020-08-25T07:29:01Z

The latest commit includes

Update to the left nav menu as Comparison with Python/R/Stata
Keeping only one way to do things in the DataFrames.jl column in the Accessing Data section
Removed the note about 1-based index
Removed the note about the off-by-1 issue due to the additional id column. I think that might make it more confusing.
Updated pandas sample code when calculating mean - using df['x'].mean() rather than df.mean().x

docs/src/man/comparisons.md

docs/make.jl

docs/src/man/comparisons.md

tk3369 · 2020-08-25T16:48:43Z

Summary of last commit:

Added examples/notes about handling missing data
Added note about 1-based indexing
Fixed indentation problem

P.S. I haven't been focusing on dplyr/stats sections. Once the pandas part is good, we can sync those up. I don't have access to Stata though so I will need some help by then.

docs/src/man/comparisons.md

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

…Frames.jl into tk/pandas-comparison

bkamins · 2020-09-08T15:23:41Z

I will squash the PR to a single commit when merging.

bkamins · 2020-09-08T21:06:59Z

Please let me know when you are done with simplifying this PR and then I will do a review.

tk3369 · 2020-09-09T00:18:02Z

Yes, I'm done now.

docs/src/man/comparisons.md

bkamins

Looks good except for some minor comments that I think should not be problematic.

bkamins · 2020-09-09T06:35:49Z

@KrainskiL - please run these examples and comment if you see something that confuses you as a user.

KrainskiL · 2020-09-09T21:33:58Z

I've got 3 comments:

Initially I was confused that I received different results by running

`df.iloc[1, 1]`	`df[2, 2]`

Of course this is due to additional id column in Julia's DataFrame, but still I was expecting to get the same number - maybe id should be moved to the end?

In following snippet:

`df.loc[:, 'x']`	`df[:, :x]`

df[:, :x] is returning an Array. I think it's worth mentioning that df[:,[:x]] will return a DataFrame, which is often desirable.

Maybe mention begin in the following part?

A special keyword end can be used to indicate the last index.

JuliaData#2378 (comment)

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

https://github.com/JuliaData/DataFrames.jl/pull/2378/files#r485369925

tk3369 · 2020-09-12T18:03:18Z

@KrainskiL Thanks for the suggestions.

df[:, :x] is returning an Array. I think it's worth mentioning that df[:,[:x]] will return a DataFrame, which is often desirable.

Is it really? In my use cases, I never select a single column and still want a data frame as a result.

JuliaData#2378 (comment)

bkamins · 2020-09-12T19:27:45Z

I think we can skip comment 2 and 3 by @KrainskiL, but comment 1. should be fixed I think as indeed it is confusing (either by moving :id column to the end or by using a column name instead of column number, which is a recommended practice in general anyway)

bkamins · 2020-09-12T20:16:56Z

Thank you for working on it.

Let us get the ball rolling and merge this PR.

If we find anything to be fixed let us just open a PR (and also - as discussed - please open atomic PRs for the "debatable" things)

tk3369 · 2020-09-13T18:30:41Z

@bkamins I have a second thought about the first two forms, which mutate the existing data frame. Pandas returns a copy and so it works better with transform. Shall we remove them? We could also mention in the notes about the mutation forms. For that matter, we could also add transform!.

bkamins · 2020-09-13T19:18:37Z

You mean that df.assign(z1 = df['z'] +1) copies df. I would be surprised if this is so (but maybe it does). But if it does not do a copy then we should use transform! not transform indeed. Just please open a PR fixing it. Thank you!

tk3369 · 2020-09-13T22:20:55Z

Well, pandas does make a copy. Same behavior for rename and sort_values. I'd say that's unintuitive, and this is the time to appreciate Julia's convention of using !.

>>> df
   grp  x  y    z
a    1  6  4  3.0
b    2  5  5  4.0
c    1  4  6  5.0
d    2  3  7  6.0
e    1  2  8  7.0
f    2  1  9  NaN
>>> df.assign(z1 = df['z'] + 1)
   grp  x  y    z   z1
a    1  6  4  3.0  4.0
b    2  5  5  4.0  5.0
c    1  4  6  5.0  6.0
d    2  3  7  6.0  7.0
e    1  2  8  7.0  8.0
f    2  1  9  NaN  NaN
>>> df
   grp  x  y    z
a    1  6  4  3.0
b    2  5  5  4.0
c    1  4  6  5.0
d    2  3  7  6.0
e    1  2  8  7.0
f    2  1  9  NaN

bkamins · 2020-09-14T06:25:37Z

Yes - and I have checked that it is a copy indeed (so there is no column aliasing - as what you show does not check for this). Can you then please - as usual 😄 - add a fix to the examples in a separate PR?

Tom Kwong added 3 commits August 23, 2020 20:07

Add pandas comparison

c31ad28

Replace newline chars

c033ce8

Fix typos

b685d8f

bkamins added the doc label Aug 24, 2020

bkamins added this to the 1.0 milestone Aug 24, 2020

bkamins reviewed Aug 24, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Aug 24, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Aug 24, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Aug 24, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

Reorgnize pandas section, add more examples.

6be519a

Pandas - Add new section for indexing - Split into common vs. group/agg - Add join operations Dplyr and Stata - Fixed typos and reformated table

bkamins reviewed Aug 24, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Aug 24, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Aug 24, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

tk3369 and others added 3 commits August 24, 2020 14:11

Update docs/src/man/comparisons.md

a090bd4

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

Update docs/src/man/comparisons.md

ac558c3

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

Improve formatting and more concise

0992d98

bkamins reviewed Aug 25, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

nalimilan reviewed Aug 25, 2020

View reviewed changes

docs/make.jl Show resolved Hide resolved

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

Updated with missing examples

91d739f

bkamins reviewed Aug 25, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Aug 25, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

tk3369 and others added 3 commits August 25, 2020 11:29

Use repeat and vcat array syntax

54cdb95

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

Remove Example column

506e804

Merge branch 'tk/pandas-comparison' of https://github.com/tk3369/Data…

8687255

…Frames.jl into tk/pandas-comparison

tk3369 mentioned this pull request Sep 8, 2020

Product of multiple aggregation functions and columns #2419

Open

Add df2 for pandas, revised semi/anti-join notes

039d139

bkamins reviewed Sep 9, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Sep 9, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Sep 9, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Sep 9, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Sep 9, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Sep 9, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Sep 9, 2020

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins approved these changes Sep 9, 2020

View reviewed changes

tk3369 and others added 3 commits September 12, 2020 10:50

Incorporate suggestions from @bkamins

32b36b3

JuliaData#2378 (comment)

Apply suggestions from code review

1dbf91c

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

Minor text update per suggestion

f6edd64

https://github.com/JuliaData/DataFrames.jl/pull/2378/files#r485369925

tk3369 added 2 commits September 12, 2020 11:08

Move id column to the end

8dedaef

JuliaData#2378 (comment)

Add note about begin keyword

d144bda

bkamins merged commit cc20da7 into JuliaData:master Sep 12, 2020

tk3369 mentioned this pull request Sep 12, 2020

Left align pandas comparison tables #2426

Merged

tk3369 mentioned this pull request Sep 16, 2020

Remove mutating examples for adding new columns #2434

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pandas comparison #2378

Add pandas comparison #2378

tk3369 commented Aug 24, 2020

bkamins commented Aug 24, 2020

nalimilan commented Aug 24, 2020

bkamins commented Aug 24, 2020

tk3369 commented Aug 24, 2020

tk3369 commented Aug 25, 2020

tk3369 commented Aug 25, 2020

bkamins commented Sep 8, 2020

bkamins commented Sep 8, 2020

tk3369 commented Sep 9, 2020

bkamins left a comment

bkamins commented Sep 9, 2020

KrainskiL commented Sep 9, 2020

tk3369 commented Sep 12, 2020

bkamins commented Sep 12, 2020

bkamins commented Sep 12, 2020

tk3369 commented Sep 13, 2020 •

edited

Loading

bkamins commented Sep 13, 2020

tk3369 commented Sep 13, 2020 •

edited

Loading

bkamins commented Sep 14, 2020

Add pandas comparison #2378

Add pandas comparison #2378

Conversation

tk3369 commented Aug 24, 2020

bkamins commented Aug 24, 2020

nalimilan commented Aug 24, 2020

bkamins commented Aug 24, 2020

tk3369 commented Aug 24, 2020

tk3369 commented Aug 25, 2020

tk3369 commented Aug 25, 2020

bkamins commented Sep 8, 2020

bkamins commented Sep 8, 2020

tk3369 commented Sep 9, 2020

bkamins left a comment

Choose a reason for hiding this comment

bkamins commented Sep 9, 2020

KrainskiL commented Sep 9, 2020

tk3369 commented Sep 12, 2020

bkamins commented Sep 12, 2020

bkamins commented Sep 12, 2020

tk3369 commented Sep 13, 2020 • edited Loading

bkamins commented Sep 13, 2020

tk3369 commented Sep 13, 2020 • edited Loading

bkamins commented Sep 14, 2020

tk3369 commented Sep 13, 2020 •

edited

Loading

tk3369 commented Sep 13, 2020 •

edited

Loading