Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isempty checks number of columns, rather than number of rows #1230

Closed
spurll opened this issue Sep 11, 2017 · 12 comments
Closed

isempty checks number of columns, rather than number of rows #1230

spurll opened this issue Sep 11, 2017 · 12 comments

Comments

@spurll
Copy link
Contributor

spurll commented Sep 11, 2017

Related to #1200.

Compare:

julia> arr = Array{Any}((0,2))
0×2 Array{Any,2}

julia> isempty(arr)
true

with

julia> df = DataFrame(col1=[], col2=[])
0×2 DataFrames.DataFrame


julia> isempty(df)
false

It seems clear to me that a DataFrame with zero rows should be considered empty.

FWIW, @quinnj's PR from a few days ago #1224 fixes this as well, but this particular issue seems less controversial than how we define length.

@quinnj
Copy link
Member

quinnj commented Sep 11, 2017

I think it's got to be a wholesale switch from viewing DataFrames as a columnar datastore to more of a "bag of tuples" definition (without needing to change the underlying actual representation obviously). The current view I think grew out of viewing a DataFrame as a "data-smart" Matrix (Array{T, 2}), which is column-oriented.

@rofinn
Copy link
Member

rofinn commented Sep 11, 2017

I'd argue that the current behaviour is still broken even from the "data-smart" Matrix perspective though.

@nalimilan
Copy link
Member

Yes, what justifies the current behavior is rather the definition of DataFrame as a vector of columns (which also justifies the behavior of df[i] and of length). That's clearly not very natural, the only issue with getting rid of this representation is that df[i] is quite convenient compared to df[:, i]. Maybe we could keep it even if we fix isempty and length.

@rofinn
Copy link
Member

rofinn commented Sep 11, 2017

@nalimilan I don't want to derail this issue, but when (or how often) do you want to get a column based in its integer index? I don't think I've ever wanted to do that, but that could just be my use cases.

@quinnj
Copy link
Member

quinnj commented Sep 11, 2017

FWIW, I do column integer-indexing all the time, but that's because in my workflows, I code towards integer indexing instead of symbol indexing; in my mind it's faster because I can avoid the extra indirection lookup of symbol=>integer, but that extra cost is probably negligible in production. Anyway, I'd re-iterate again though that I think we need to commit to either a column-oriented or row-oriented representation, regardless of the internal implementation. Currently things are (mostly) consistent for a column-orientation, but there's obviously some desire to switch that. For example, if we switch to a row-orientation, I would definitely expect df[1] to give me the first row instead of the first column.

@ararslan
Copy link
Member

Either way, isn't it clearer to write df[:,i] and df[i,:] so it's immediately obvious what you're asking for?

@nalimilan
Copy link
Member

@nalimilan I don't want to derail this issue, but when (or how often) do you want to get a column based in its integer index? I don't think I've ever wanted to do that, but that could just be my use cases.

@rofinn i wasn't necessarily an integer index in my example, it could have been a symbol too. I think the problem is the same.

@ararslan I agree df[:, i] is clearer than df[i], but it's less convenient to type, which is annoying since that's sometime you need to type all the time. Maybe with things like Query it shouldn't be as common as in R, though. If we get field overloading, we could use df.i instead (which I think is the reason why column names are required to be valid identifiers).

@nalimilan
Copy link
Member

Anyway, I'd re-iterate again though that I think we need to commit to either a column-oriented or row-oriented representation, regardless of the internal implementation. Currently things are (mostly) consistent for a column-orientation, but there's obviously some desire to switch that. For example, if we switch to a row-orientation, I would definitely expect df[1] to give me the first row instead of the first column.

Seeing how people disagree on what's the most natural orientation, I'd rather make DataFrame orientation-agnostic and require people to be explicit about what they want, e.g. using for r in eachrow(df) / for r in eachcol(df) or df[i, :]/df[i, :].

@ararslan
Copy link
Member

Discussion of notation aside, I think a 0-row DataFrame should be isempty regardless of whether DataFramess are column- or row-oriented. It seems a natural definition of emptiness, even if it doesn't correspond directly to length(df) == 0 (though when that's true we'd also necessarily have isempty).

@spurll
Copy link
Contributor Author

spurll commented Sep 12, 2017

I could put together a separate PR to address this in the morning, if there's interest.

@rofinn
Copy link
Member

rofinn commented Sep 12, 2017

Oops, I missed your comment @spurll.

@spurll
Copy link
Contributor Author

spurll commented Sep 12, 2017

Hey, I would have done it, but I'm chairing a board meeting right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants