Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pandas comparison #2378

Merged
merged 49 commits into from
Sep 12, 2020
Merged

Add pandas comparison #2378

merged 49 commits into from
Sep 12, 2020

Conversation

tk3369
Copy link
Contributor

@tk3369 tk3369 commented Aug 24, 2020

This PR adds a new section to compare with Python pandas.

Couple things:

  1. I have fixed some other typos in the dplyr section as well.
  2. I have formatted the markdown table so it's easier on eyes when editing. If desired, I can apply formatting to dplyr and stata sections as well (I'm using a vscode extension so it's literally "1-click").
  3. I have kept the same use cases as copied from the dplyr section; however, I'm uncertain if this is the "best" list. Maybe we should run the list by more R/Python/Stata users and get some feedback.

Disclaimer: I'm not really proficient with pandas so if anyone can suggest better ways to write the sample code, please feel free to suggest below.

@bkamins bkamins added the doc label Aug 24, 2020
@bkamins bkamins added this to the 1.0 milestone Aug 24, 2020
@bkamins
Copy link
Member

bkamins commented Aug 24, 2020

Thank you for this PR

I have fixed some other typos in the dplyr section as well.

Thank you for spotting them.

I have formatted the markdown table so it's easier on eyes when editing. If desired, I can apply formatting to dplyr and stata sections as well (I'm using a vscode extension so it's literally "1-click").

Sure. Also maybe add information how to create this example mini-table for dplyr and Stata also for consistency.

I have kept the same use cases as copied from the dplyr section; however, I'm uncertain if this is the "best" list. Maybe we should run the list by more R/Python/Stata users and get some feedback.

I would try to finalize this PR, and then ask on #data in Slack if people have suggestions.

@nalimilan
Copy link
Member

Thanks. While you're at it, can you fix two issues with the previous PR? These are 1) add a link to this page in the sidebar, and 2) check that in the generated HTML docs tables are rendered correctly (by running julia docs/make.jl).

@bkamins
Copy link
Member

bkamins commented Aug 24, 2020

Both are already fixed on master. We only need to update the title to include pandas

Pandas
- Add new section for indexing
- Split into common vs. group/agg
- Add join operations

Dplyr and Stata
- Fixed typos and reformated table
@tk3369
Copy link
Contributor Author

tk3369 commented Aug 24, 2020

@bkamins Thanks for the feedback. I have revamped it further with a whole new subsection for indexing and joins.

tk3369 and others added 3 commits August 24, 2020 14:11
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
@tk3369
Copy link
Contributor Author

tk3369 commented Aug 25, 2020

The latest commit includes

  • Update to the left nav menu as Comparison with Python/R/Stata
  • Keeping only one way to do things in the DataFrames.jl column in the Accessing Data section
  • Removed the note about 1-based index
  • Removed the note about the off-by-1 issue due to the additional id column. I think that might make it more confusing.
  • Updated pandas sample code when calculating mean - using df['x'].mean() rather than df.mean().x

docs/make.jl Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
@tk3369
Copy link
Contributor Author

tk3369 commented Aug 25, 2020

Summary of last commit:

  • Added examples/notes about handling missing data
  • Added note about 1-based indexing
  • Fixed indentation problem

P.S. I haven't been focusing on dplyr/stats sections. Once the pandas part is good, we can sync those up. I don't have access to Stata though so I will need some help by then.

@bkamins
Copy link
Member

bkamins commented Sep 8, 2020

I will squash the PR to a single commit when merging.

@bkamins
Copy link
Member

bkamins commented Sep 8, 2020

Please let me know when you are done with simplifying this PR and then I will do a review.

@tk3369
Copy link
Contributor Author

tk3369 commented Sep 9, 2020

Yes, I'm done now.

docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except for some minor comments that I think should not be problematic.

@bkamins
Copy link
Member

bkamins commented Sep 9, 2020

@KrainskiL - please run these examples and comment if you see something that confuses you as a user.

@KrainskiL
Copy link

I've got 3 comments:

  1. Initially I was confused that I received different results by running
df.iloc[1, 1] df[2, 2]

Of course this is due to additional id column in Julia's DataFrame, but still I was expecting to get the same number - maybe id should be moved to the end?

  1. In following snippet:
df.loc[:, 'x'] df[:, :x]

df[:, :x] is returning an Array. I think it's worth mentioning that df[:,[:x]] will return a DataFrame, which is often desirable.

  1. Maybe mention begin in the following part?

A special keyword end can be used to indicate the last index.

@tk3369
Copy link
Contributor Author

tk3369 commented Sep 12, 2020

@KrainskiL Thanks for the suggestions.

df[:, :x] is returning an Array. I think it's worth mentioning that df[:,[:x]] will return a DataFrame, which is often desirable.

Is it really? In my use cases, I never select a single column and still want a data frame as a result.

@bkamins
Copy link
Member

bkamins commented Sep 12, 2020

I think we can skip comment 2 and 3 by @KrainskiL, but comment 1. should be fixed I think as indeed it is confusing (either by moving :id column to the end or by using a column name instead of column number, which is a recommended practice in general anyway)

@bkamins bkamins merged commit cc20da7 into JuliaData:master Sep 12, 2020
@bkamins
Copy link
Member

bkamins commented Sep 12, 2020

Thank you for working on it.

Let us get the ball rolling and merge this PR.

If we find anything to be fixed let us just open a PR (and also - as discussed - please open atomic PRs for the "debatable" things)

@tk3369
Copy link
Contributor Author

tk3369 commented Sep 13, 2020

@bkamins I have a second thought about the first two forms, which mutate the existing data frame. Pandas returns a copy and so it works better with transform. Shall we remove them? We could also mention in the notes about the mutation forms. For that matter, we could also add transform!.

Screen Shot 2020-09-13 at 11 24 25 AM

@bkamins
Copy link
Member

bkamins commented Sep 13, 2020

You mean that df.assign(z1 = df['z'] +1) copies df. I would be surprised if this is so (but maybe it does). But if it does not do a copy then we should use transform! not transform indeed. Just please open a PR fixing it. Thank you!

@tk3369
Copy link
Contributor Author

tk3369 commented Sep 13, 2020

Well, pandas does make a copy. Same behavior for rename and sort_values. I'd say that's unintuitive, and this is the time to appreciate Julia's convention of using !.

>>> df
   grp  x  y    z
a    1  6  4  3.0
b    2  5  5  4.0
c    1  4  6  5.0
d    2  3  7  6.0
e    1  2  8  7.0
f    2  1  9  NaN
>>> df.assign(z1 = df['z'] + 1)
   grp  x  y    z   z1
a    1  6  4  3.0  4.0
b    2  5  5  4.0  5.0
c    1  4  6  5.0  6.0
d    2  3  7  6.0  7.0
e    1  2  8  7.0  8.0
f    2  1  9  NaN  NaN
>>> df
   grp  x  y    z
a    1  6  4  3.0
b    2  5  5  4.0
c    1  4  6  5.0
d    2  3  7  6.0
e    1  2  8  7.0
f    2  1  9  NaN

@bkamins
Copy link
Member

bkamins commented Sep 14, 2020

Yes - and I have checked that it is a copy indeed (so there is no column aliasing - as what you show does not check for this). Can you then please - as usual 😄 - add a fix to the examples in a separate PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants