Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

length(::DataFrame) returns number of columns #1200

Closed
omus opened this issue Jul 17, 2017 · 28 comments
Closed

length(::DataFrame) returns number of columns #1200

omus opened this issue Jul 17, 2017 · 28 comments
Labels

Comments

@omus
Copy link
Member

omus commented Jul 17, 2017

Currently calling length on a DataFrame returns the number of columns. This is strange as length usually returns the number of elements.

@omus
Copy link
Member Author

omus commented Jul 17, 2017

cc: @ararslan @andyferris

@ararslan
Copy link
Member

Yeah this is really weird. We probably shouldn't define length at all.

@kmsquire
Copy link
Contributor

There was debate about this when it was first added, mostly between those coming from an R background (to whom, I think, the current definition made sense) and those coming from a Pandas background (where length is the number of rows). So what makes the most sense probably depends on what you've used before.

@ararslan
Copy link
Member

Having it be inconsistent between languages is another reason not to define it here, IMO. Then it confuses no one. 🙂

@andyferris
Copy link
Member

If we want to think of a dataframe in the relational algebra sense (as a collection of named tuples, i.e. rows), then iterating over rows and having length for the number of rows makes sense to me.

There has been a lot of discussion about this surrounding Jeff's NamedTuple pull request (partly because it is infrastructure for making such iteration fast).

@rofinn
Copy link
Member

rofinn commented Jul 18, 2017

Given that more descriptive methods such as size, nrow and ncol exist (could be better documented though) I don't really see a reason to keep length if there's a debate about what it should return.

@andyferris
Copy link
Member

It goes with iteration, so if you can't iterate a DataFrame then you shouldn't have a length.

@rofinn
Copy link
Member

rofinn commented Jul 18, 2017

I'm not sure length even needs to go with iteration. For example, we can iterate over a Channel which doesn't provide a length method either.

@ararslan
Copy link
Member

length is an optional part of the iteration protocol, per the documentation. I guess we could have length defined on the EachRow or whatever iterator types we define for rows/columns, though it doesn't really seem useful there.

@nalimilan
Copy link
Member

length(df) is consistent with the fact that df[1] returns the first column. We could remove both and require writing df[:, 1].

nrow and ncol should probably be deprecated too, cf. #406.

@rofinn
Copy link
Member

rofinn commented Jul 19, 2017

length(df) is consistent with the fact that df[1] returns the first column. We could remove both and require writing df[:, 1].

Yeah, I recall that confusing me the first time I used dataframes cause I figured df[1] would give me the first row.

@ararslan
Copy link
Member

Okay, so the plan as I understand it:

  1. Deprecate length in favor of nothing
  2. Deprecate linear indexing into a DataFrame in favor of two indices
  3. Deprecate nrow/ncol in favor of size

@nalimilan
Copy link
Member

nalimilan commented Jul 19, 2017

Actually I'm afraid removing the df[:a] syntax would be too annoying. We have even considered supporting df.a once/if getfield can be overloaded. Don't both R and Pandas support it?

@rofinn
Copy link
Member

rofinn commented Jul 20, 2017

Don't both R and Pandas support it?

Yes, but pandas determines whether that is a row or col based on what you give it.

>>> df = pandas.DataFrame({ 'A' : 1., 'B' : pandas.Series(1,index=list(range(4)),dtype='float32'),})
>>> df
     A    B
0  1.0  1.0
1  1.0  1.0
2  1.0  1.0
3  1.0  1.0
>>> df[:1]
     A    B
0  1.0  1.0
>>> df["A"]
0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64

If we restricted column names to Symbols (or automatically converted) then we could always return columns for Symbol and row for Int?

@ararslan
Copy link
Member

Actually I'm afraid removing the df[:a] syntax would be too annoying.

By "linear indexing," I meant specifically with a number. It's not immediately obvious whatdf[1] is, but df[:a] is perfectly clear.

@nalimilan
Copy link
Member

nalimilan commented Jul 20, 2017

Interesting. Honestly, I find Pandas' behavior really confusing: returning either a row or a column depending on the argument type is too clever for my taste. We could stop supporting df[1], since it's indeed less explicit than df[:a], but I'm not sure it would really improve things. At least for now it's consistent with how NamedArray works and how NamedTuple will work, and it reflects the fact that columns are ordered.

OTOH we can deprecate nrow/ncol independently of this issue.

@quinnj
Copy link
Member

quinnj commented Sep 8, 2017

Ok, PR up at #1224. Deprecates length, nrow, and ncol in favor of size. Bit of a pain, but hopefully will be cleaner and simpler going forward.

@Wikunia
Copy link

Wikunia commented Dec 11, 2017

@nalimilan I love pandas for being that clever 😄 There is some stuff which just seems weird to me in DataFrames.jl
Actually there are a lot more people working with pandas than with DataFrames. Maybe it's not the worst choice to be compatible for people who have experience with pandas which I suppose are a lot of people.

@nalimilan
Copy link
Member

Actually there are a lot more people working with pandas than with DataFrames. Maybe it's not the worst choice to be compatible for people who have experience with pandas which I suppose are a lot of people.

The policy general followed by Julia packages is to try to find a consistent design which makes sense for users once they are familiar with the package. We don't generally support features just because they sound "natural" to people used to other software (but of course we prefer being consistent when that doesn't hurt). Also there are lots of people coming from other software (e.g. R/dplyr/data.table), and what they find "natural" is often mutually exclusive.

I think the way forward here is that once field overloading is available in Base (JuliaLang/julia#24960), we deprecate df[:col] in favor of df.col, so that length(df) can be deprecated in favor of size(df, 1) or nrow(df). Then we can discuss whether df[1] should be an error or whether it should return the first row, in which case iterating over a DataFrame should also return rows (as NamedTuple objects).

@Wikunia
Copy link

Wikunia commented Dec 12, 2017

First of all I agree that overloading will make it easier and the general policy is reasonable. I'm wondering whether it is necessary to not support df[:col] anymore. I think it doesn't harm anyone if it works.
nrow and ncol seem to be nice in my opinion also length and width would work ;)
df[1] is probably a bit more challenging than df[:col] as there might be two different outcomes.

@nickeubank
Copy link
Contributor

I think the way forward here is that once field overloading is available in Base (JuliaLang/julia#24960), we deprecate df[:col] in favor of df.col, so that length(df) can be deprecated in favor of size(df, 1) or nrow(df).

Does assigning to a field work in julia? e.g. can one still do:

df = DataFrame(a = [1, 2, 3], b = ["a", "b", "c"]) 
df.c = 1:3

the way one can now do

df = DataFrame(a = [1, 2, 3], b = ["a", "b", "c"]) 
df[:c] = 1:3

? I know in pandas that created a real gotcha -- you can pull a column with the dot-notation, but you couldn't set using it. If you try, it created a new property, but not a column, and then you couldn't find it again...

Also, in a similar vein, note that the dot-field notation causes problems with spaces in column names that are easier to address with the current df[Symbol("First Name")] type notation.

@pdeffebach
Copy link
Contributor

Deprecating df[:a] wouldn't be great because then you would have to replace df[x] with getfield(df, x) if x = :a.

@nalimilan
Copy link
Member

On Julia 0.7 you can use df.c = 1:3 on current DataFrames master. But indeed that doesn't completely replace df[col]/df[:, col] for situations where col isn't a literal symbol without spaces. The question is then: is it OK to deprecate df[col] in favor of df[:, col], or is it too annoying for these cases?

@pdeffebach
Copy link
Contributor

I feel like i rarely work with the symbols themselves. All of my cleaning is in for loops or functions. So the easier it is to refer to columns with a variable the better.

@nickeubank
Copy link
Contributor

nickeubank commented Jul 12, 2018

IMHO I'm of a similar view as @pdeffebach.

My view is that (a) pulling out one column is common enough we need a compact way to do it, and (b) I don't think the dot-field notation is a good substitute for the square-bracket-column-symbol notation.

The problem, in my view, is that dot-field notation is fine for objects with stable field names (like graph.vertices in a graph object), but given that column names are inherently unstable in DataFrames, I don't like the dot-field notation because it encourages non-generalizable code because you have to hard-code the field names into your code. Seems contrary to Julia styles guidelines. And I don't like the idea of having one syntax for one's own scripts and another for generalizable code.

So I think we should keep support for df[x] / df[:colname] / df[[:colname]] etc. I'm fine with the view above we should stop supporting numeric indexing into the columns this way (e.g. df[2]) and boolean indexing (df[[true, false, false]]), but I think just keeping "pass symbols, get columns" for square brackets is unlikely to cause confusion. I think if we also had column names (as in pandas) I agree it might be confusing, but as symbols only refer to columns in DataFrames, I think it's pretty clear.

@nalimilan
Copy link
Member

If we support df[:colname], we may as well support df[1]. It would be weird to reject integers for this syntax but not for df[:, 1], just because Pandas happens to do something completely weird. Also, the similarity with NamedTuple is appealing.

@nickeubank
Copy link
Contributor

nickeubank commented Jul 12, 2018

OK -- I'm totally ok with using square-brackets as "indexing into columns". I just meant I have stronger feelings about losing ability to use symbols than losing ability to do numeric indexing into columns. @nalimilan You've sold me on not doing something pandas-like with sometimes-row-indexing. :)

(EDITS: lots of sloppy typos)

@pdeffebach
Copy link
Contributor

I've been playing around with a rowwise command that applies a function to each row of a dataframe, returning a vector of length nrow(df), like stata's egen x = rowmean(v1 v2...).

With the way dataframes is set up, it's difficult to make this performant, since we will have to collect (maybe not with collect) each row, and rows may have heterogenous types. mapslices, which acts on matrices, is very fast, on the other hand.

This is fine, because row-wise operations, while I think important enough to live in DataFrames, are relatively uncommon, and DataFrame's structure is well-optimized for column-oriented operations, which is the dominant use-case.

I guess my point is that if people expect something that acts on rows to be as easy and fast as mapslices, they are going to be frustrated. So its better to have an API that differentiates itself more from generic matrix-like functions. In the end, this is just a vote for nrow and ncol instead of size, but the principal can apply more broadly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants