length(::DataFrame) returns number of columns #1200

omus · 2017-07-17T15:52:42Z

Currently calling length on a DataFrame returns the number of columns. This is strange as length usually returns the number of elements.

The text was updated successfully, but these errors were encountered:

omus · 2017-07-17T15:52:58Z

cc: @ararslan @andyferris

ararslan · 2017-07-17T17:43:37Z

Yeah this is really weird. We probably shouldn't define length at all.

kmsquire · 2017-07-17T22:31:11Z

There was debate about this when it was first added, mostly between those coming from an R background (to whom, I think, the current definition made sense) and those coming from a Pandas background (where length is the number of rows). So what makes the most sense probably depends on what you've used before.

ararslan · 2017-07-17T22:37:05Z

Having it be inconsistent between languages is another reason not to define it here, IMO. Then it confuses no one. 🙂

andyferris · 2017-07-18T00:25:46Z

If we want to think of a dataframe in the relational algebra sense (as a collection of named tuples, i.e. rows), then iterating over rows and having length for the number of rows makes sense to me.

There has been a lot of discussion about this surrounding Jeff's NamedTuple pull request (partly because it is infrastructure for making such iteration fast).

rofinn · 2017-07-18T05:12:22Z

Given that more descriptive methods such as size, nrow and ncol exist (could be better documented though) I don't really see a reason to keep length if there's a debate about what it should return.

andyferris · 2017-07-18T05:18:42Z

It goes with iteration, so if you can't iterate a DataFrame then you shouldn't have a length.

rofinn · 2017-07-18T05:34:02Z

I'm not sure length even needs to go with iteration. For example, we can iterate over a Channel which doesn't provide a length method either.

ararslan · 2017-07-18T05:50:52Z

length is an optional part of the iteration protocol, per the documentation. I guess we could have length defined on the EachRow or whatever iterator types we define for rows/columns, though it doesn't really seem useful there.

nalimilan · 2017-07-19T14:04:27Z

length(df) is consistent with the fact that df[1] returns the first column. We could remove both and require writing df[:, 1].

nrow and ncol should probably be deprecated too, cf. #406.

rofinn · 2017-07-19T16:10:29Z

length(df) is consistent with the fact that df[1] returns the first column. We could remove both and require writing df[:, 1].

Yeah, I recall that confusing me the first time I used dataframes cause I figured df[1] would give me the first row.

ararslan · 2017-07-19T18:48:03Z

Okay, so the plan as I understand it:

Deprecate length in favor of nothing
Deprecate linear indexing into a DataFrame in favor of two indices
Deprecate nrow/ncol in favor of size

nalimilan · 2017-07-19T23:55:47Z

Actually I'm afraid removing the df[:a] syntax would be too annoying. We have even considered supporting df.a once/if getfield can be overloaded. Don't both R and Pandas support it?

rofinn · 2017-07-20T00:29:39Z

Don't both R and Pandas support it?

Yes, but pandas determines whether that is a row or col based on what you give it.

>>> df = pandas.DataFrame({ 'A' : 1., 'B' : pandas.Series(1,index=list(range(4)),dtype='float32'),})
>>> df
     A    B
0  1.0  1.0
1  1.0  1.0
2  1.0  1.0
3  1.0  1.0
>>> df[:1]
     A    B
0  1.0  1.0
>>> df["A"]
0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64

If we restricted column names to Symbols (or automatically converted) then we could always return columns for Symbol and row for Int?

ararslan · 2017-07-20T01:31:00Z

Actually I'm afraid removing the df[:a] syntax would be too annoying.

By "linear indexing," I meant specifically with a number. It's not immediately obvious whatdf[1] is, but df[:a] is perfectly clear.

nalimilan · 2017-07-20T07:24:34Z

Interesting. Honestly, I find Pandas' behavior really confusing: returning either a row or a column depending on the argument type is too clever for my taste. We could stop supporting df[1], since it's indeed less explicit than df[:a], but I'm not sure it would really improve things. At least for now it's consistent with how NamedArray works and how NamedTuple will work, and it reflects the fact that columns are ordered.

OTOH we can deprecate nrow/ncol independently of this issue.

…#1200

quinnj · 2017-09-08T03:29:08Z

Ok, PR up at #1224. Deprecates length, nrow, and ncol in favor of size. Bit of a pain, but hopefully will be cleaner and simpler going forward.

Wikunia · 2017-12-11T18:59:26Z

@nalimilan I love pandas for being that clever 😄 There is some stuff which just seems weird to me in DataFrames.jl
Actually there are a lot more people working with pandas than with DataFrames. Maybe it's not the worst choice to be compatible for people who have experience with pandas which I suppose are a lot of people.

nalimilan · 2017-12-12T08:58:59Z

Actually there are a lot more people working with pandas than with DataFrames. Maybe it's not the worst choice to be compatible for people who have experience with pandas which I suppose are a lot of people.

The policy general followed by Julia packages is to try to find a consistent design which makes sense for users once they are familiar with the package. We don't generally support features just because they sound "natural" to people used to other software (but of course we prefer being consistent when that doesn't hurt). Also there are lots of people coming from other software (e.g. R/dplyr/data.table), and what they find "natural" is often mutually exclusive.

I think the way forward here is that once field overloading is available in Base (JuliaLang/julia#24960), we deprecate df[:col] in favor of df.col, so that length(df) can be deprecated in favor of size(df, 1) or nrow(df). Then we can discuss whether df[1] should be an error or whether it should return the first row, in which case iterating over a DataFrame should also return rows (as NamedTuple objects).

Wikunia · 2017-12-12T16:12:30Z

First of all I agree that overloading will make it easier and the general policy is reasonable. I'm wondering whether it is necessary to not support df[:col] anymore. I think it doesn't harm anyone if it works.
nrow and ncol seem to be nice in my opinion also length and width would work ;)
df[1] is probably a bit more challenging than df[:col] as there might be two different outcomes.

nickeubank · 2018-07-12T01:54:11Z

I think the way forward here is that once field overloading is available in Base (JuliaLang/julia#24960), we deprecate df[:col] in favor of df.col, so that length(df) can be deprecated in favor of size(df, 1) or nrow(df).

Does assigning to a field work in julia? e.g. can one still do:

df = DataFrame(a = [1, 2, 3], b = ["a", "b", "c"]) 
df.c = 1:3

the way one can now do

df = DataFrame(a = [1, 2, 3], b = ["a", "b", "c"]) 
df[:c] = 1:3

? I know in pandas that created a real gotcha -- you can pull a column with the dot-notation, but you couldn't set using it. If you try, it created a new property, but not a column, and then you couldn't find it again...

Also, in a similar vein, note that the dot-field notation causes problems with spaces in column names that are easier to address with the current df[Symbol("First Name")] type notation.

pdeffebach · 2018-07-12T02:20:05Z

Deprecating df[:a] wouldn't be great because then you would have to replace df[x] with getfield(df, x) if x = :a.

nalimilan · 2018-07-12T08:46:44Z

On Julia 0.7 you can use df.c = 1:3 on current DataFrames master. But indeed that doesn't completely replace df[col]/df[:, col] for situations where col isn't a literal symbol without spaces. The question is then: is it OK to deprecate df[col] in favor of df[:, col], or is it too annoying for these cases?

pdeffebach · 2018-07-12T11:49:09Z

I feel like i rarely work with the symbols themselves. All of my cleaning is in for loops or functions. So the easier it is to refer to columns with a variable the better.

nickeubank · 2018-07-12T18:10:09Z

IMHO I'm of a similar view as @pdeffebach.

My view is that (a) pulling out one column is common enough we need a compact way to do it, and (b) I don't think the dot-field notation is a good substitute for the square-bracket-column-symbol notation.

The problem, in my view, is that dot-field notation is fine for objects with stable field names (like graph.vertices in a graph object), but given that column names are inherently unstable in DataFrames, I don't like the dot-field notation because it encourages non-generalizable code because you have to hard-code the field names into your code. Seems contrary to Julia styles guidelines. And I don't like the idea of having one syntax for one's own scripts and another for generalizable code.

So I think we should keep support for df[x] / df[:colname] / df[[:colname]] etc. I'm fine with the view above we should stop supporting numeric indexing into the columns this way (e.g. df[2]) and boolean indexing (df[[true, false, false]]), but I think just keeping "pass symbols, get columns" for square brackets is unlikely to cause confusion. I think if we also had column names (as in pandas) I agree it might be confusing, but as symbols only refer to columns in DataFrames, I think it's pretty clear.

nalimilan · 2018-07-12T19:20:35Z

If we support df[:colname], we may as well support df[1]. It would be weird to reject integers for this syntax but not for df[:, 1], just because Pandas happens to do something completely weird. Also, the similarity with NamedTuple is appealing.

nickeubank · 2018-07-12T19:31:45Z

OK -- I'm totally ok with using square-brackets as "indexing into columns". I just meant I have stronger feelings about losing ability to use symbols than losing ability to do numeric indexing into columns. @nalimilan You've sold me on not doing something pandas-like with sometimes-row-indexing. :)

(EDITS: lots of sloppy typos)

pdeffebach · 2018-07-22T18:13:59Z

I've been playing around with a rowwise command that applies a function to each row of a dataframe, returning a vector of length nrow(df), like stata's egen x = rowmean(v1 v2...).

With the way dataframes is set up, it's difficult to make this performant, since we will have to collect (maybe not with collect) each row, and rows may have heterogenous types. mapslices, which acts on matrices, is very fast, on the other hand.

This is fine, because row-wise operations, while I think important enough to live in DataFrames, are relatively uncommon, and DataFrame's structure is well-optimized for column-oriented operations, which is the dominant use-case.

I guess my point is that if people expect something that acts on rows to be as easy and fast as mapslices, they are going to be frustrated. So its better to have an API that differentiates itself more from generic matrix-like functions. In the end, this is just a vote for nrow and ncol instead of size, but the principal can apply more broadly.

rofinn added the decision label Sep 7, 2017

quinnj added a commit that referenced this issue Sep 8, 2017

Deprecate length, nrow, and ncol on DataFrames in favor of size. Fixes …

15c3b37

…#1200

quinnj mentioned this issue Sep 8, 2017

Deprecate length, nrow, and ncol on DataFrames in favor of size. Fixe… #1224

Closed

spurll mentioned this issue Sep 11, 2017

isempty checks number of columns, rather than number of rows #1230

Closed

rofinn mentioned this issue Sep 12, 2017

isempty(df) should return true if either dimension == 0. #1231

Merged

ararslan mentioned this issue Dec 11, 2017

sub df by row or column #1313

Closed

rofinn mentioned this issue Feb 1, 2018

Strict column names #1348

Closed

nalimilan mentioned this issue Apr 12, 2018

Add permutecols!(df, p) to allow column reordering #1395

Merged

This was referenced Jul 11, 2018

Scalar indexing by row should return a DataFrameRow #1400

Closed

Feature Request: subset rows when boolean vector passed alone #1445

Closed

nalimilan mentioned this issue Jul 21, 2018

Functions for column- and row-wise processing #956

Closed

nalimilan mentioned this issue Sep 18, 2018

Review row vs. column orientation of API #1514

Closed

6 tasks

nalimilan mentioned this issue Nov 10, 2018

Deprecate length(df::AbstractDataFrame) in favor of size(df, 2) #1591

Merged

nalimilan closed this as completed in #1591 Nov 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

length(::DataFrame) returns number of columns #1200

length(::DataFrame) returns number of columns #1200

omus commented Jul 17, 2017 •

edited

Loading

omus commented Jul 17, 2017

ararslan commented Jul 17, 2017

kmsquire commented Jul 17, 2017

ararslan commented Jul 17, 2017

andyferris commented Jul 18, 2017

rofinn commented Jul 18, 2017 •

edited

Loading

andyferris commented Jul 18, 2017

rofinn commented Jul 18, 2017

ararslan commented Jul 18, 2017

nalimilan commented Jul 19, 2017

rofinn commented Jul 19, 2017 •

edited

Loading

ararslan commented Jul 19, 2017

nalimilan commented Jul 19, 2017 •

edited

Loading

rofinn commented Jul 20, 2017

ararslan commented Jul 20, 2017

nalimilan commented Jul 20, 2017 •

edited

Loading

quinnj commented Sep 8, 2017

Wikunia commented Dec 11, 2017

nalimilan commented Dec 12, 2017

Wikunia commented Dec 12, 2017

nickeubank commented Jul 12, 2018

pdeffebach commented Jul 12, 2018

nalimilan commented Jul 12, 2018

pdeffebach commented Jul 12, 2018

nickeubank commented Jul 12, 2018 •

edited

Loading

nalimilan commented Jul 12, 2018

nickeubank commented Jul 12, 2018 •

edited

Loading

pdeffebach commented Jul 22, 2018

length(::DataFrame) returns number of columns #1200

length(::DataFrame) returns number of columns #1200

Comments

omus commented Jul 17, 2017 • edited Loading

omus commented Jul 17, 2017

ararslan commented Jul 17, 2017

kmsquire commented Jul 17, 2017

ararslan commented Jul 17, 2017

andyferris commented Jul 18, 2017

rofinn commented Jul 18, 2017 • edited Loading

andyferris commented Jul 18, 2017

rofinn commented Jul 18, 2017

ararslan commented Jul 18, 2017

nalimilan commented Jul 19, 2017

rofinn commented Jul 19, 2017 • edited Loading

ararslan commented Jul 19, 2017

nalimilan commented Jul 19, 2017 • edited Loading

rofinn commented Jul 20, 2017

ararslan commented Jul 20, 2017

nalimilan commented Jul 20, 2017 • edited Loading

quinnj commented Sep 8, 2017

Wikunia commented Dec 11, 2017

nalimilan commented Dec 12, 2017

Wikunia commented Dec 12, 2017

nickeubank commented Jul 12, 2018

pdeffebach commented Jul 12, 2018

nalimilan commented Jul 12, 2018

pdeffebach commented Jul 12, 2018

nickeubank commented Jul 12, 2018 • edited Loading

nalimilan commented Jul 12, 2018

nickeubank commented Jul 12, 2018 • edited Loading

pdeffebach commented Jul 22, 2018

omus commented Jul 17, 2017 •

edited

Loading

rofinn commented Jul 18, 2017 •

edited

Loading

rofinn commented Jul 19, 2017 •

edited

Loading

nalimilan commented Jul 19, 2017 •

edited

Loading

nalimilan commented Jul 20, 2017 •

edited

Loading

nickeubank commented Jul 12, 2018 •

edited

Loading

nickeubank commented Jul 12, 2018 •

edited

Loading