Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unifying the ecosystem around observations-as-row convention #114

Open
mkborregaard opened this issue Feb 3, 2020 · 9 comments
Open

unifying the ecosystem around observations-as-row convention #114

mkborregaard opened this issue Feb 3, 2020 · 9 comments

Comments

@mkborregaard
Copy link

mkborregaard commented Feb 3, 2020

In statistical analysis, e.g. with DataFrames.jl, it is a convention that columns equal variables, rows equal observations. The same is true for the entire R analytical ecosystem (statistics + machine learning) + the entire Python analytical ecosystem (stats + machine learning) + matlab. I believe it is also the case in most cases in the rest of the Julia ecosystem. For DataFrames this is the only obvious choice, as the same observation (row) can contain multiple types; and given that different columns can have different types, observations-are-rows is the only feasible way. Not true for Matrixes, which are the same type in both directions, so in principle columns can be interpreted as observations for Matrices.

In some parts of the Julia ecosystem that are built only for Matrix{<:Number}, that is actually done - the convention is flipped. This is true for the kmeans algorithm in Clustering (JuliaStats/Clustering.jl#79), and for NearestNeighbours. It also used to be the case for standardize in StatsBase and for Distances, but a deprecation has just been added requiring an explicit dims argument (e.g. https://github.com/JuliaStats/StatsBase.jl/pull/490/files), which is the first step towards remedying this. And, it is also the case here in MultivariateStats - e.g. you'll need to transpose the data if you get them from a DataFrame, in order to do a PCA on them.

On the other hand, the stdlib Statistics (e.g. functions cov and cor), StatsModels, GLM, MixedModels, Plots, the MLJ machine learning library, and Turing all use the more widely accepted standard of observations-as-rows.

I think we should standardize on using the same convention across the ecosystem, and preferably the same standard as everyone else. The main arguments against has been 1) that Machine-learning often uses columns-as-observations, and 2) that because Julia is column-major, analyses of all variables for an observation should be faster.

Ad 1), I'd like to question that assertion - most machine-learning frameworks I know of uses rows-are-observations (or indeed DataFrames). A notable exception appears to be Flux, which looks like it is columns-are-observations, AFAICS.

Ad 2) I have some other concerns:

  1. Column-major is an implementation detail, that shouldn't define the API. Especially as the fastest implementation will differ depending on what type of analysis you're using, but a mixture of rows or columns as observations is a really dangerous footgun for introducing bugs.
  2. Often rows-as-observations is likely to be fastest, as many analyses focus on each variable individually.
  3. If most users get their data from DataFrames anyway, like in the very example used for PCA in the docs for MultivariateStats, the user is likely to transpose, and the algorithm will work on a lazy adjoint - losing the entire value of column-major memory layout. So most users will only experience the inconvenience but not harvest the performance benefits. Instead, with observations-as-rows, algorithms can be optimised for that; and in cases where they can't be, superusers that want max performance can make their object with observations-as-columns, then lazy transpose before passing in; if I'm not much mistaken these users would then still reap the performance gains.

I believe an added advantage is to improve the JuliaStats ecosystem combatability with Tables. The approach I'd suggest would be to require a dims keyword across all functions for at least 1 julia minor version, then flip the default, to make all code calling these functions deprecate noisily.

cc @nalimilan @andreasnoack @ararslan @tsela

@mkborregaard
Copy link
Author

In the MultivariateStats case in particular one could also add that there are types of factor analysis that allows multivariate statistics on mixed data types, such as numeric, logical and nominal data (via e.g. Gower's distance). Such methods would seem to me to require DataFrames (or similar Table types), and thus observations-as-rows.

@nickrobinson251
Copy link

nickrobinson251 commented Feb 3, 2020

I've no strong opinion here, but...

@Tokazama
Copy link
Contributor

Tokazama commented Feb 4, 2020

I would prefer a trait system that uses something like NamedDims to establish the observation dimension.

@mkborregaard
Copy link
Author

I don't think that would solve the problem, though. We need all of these methods to be able to accept simple Matrixes. There's nothing wrong with packages implementing new clever types with traits-based approaches to allow access in any direction, but even the simplest traits-based approach requires the trait to be associated with a given type (so Matrix{<:Number} needs to have a uniform behaviour).
BTW I just checked and passing in a transpose doesn't really work, as access is slow on Adjoint regardless of access pattern.

@Tokazama
Copy link
Contributor

Tokazama commented Feb 4, 2020

I think the way to go is using your suggested dims method. There's no reason we can't have packages like NamedDims.jl specialize on the methods provided that have the dims argument available.

What needs to happen so that we can start implementing this?

@nickrobinson251
Copy link

Re accepting a Matrix, this is of course true. ObservationDims sets _default_obsdim(::AbstractArray) = 1 (i.e. rows, agreeing with the proposal here.

I'm just suggesting we be greedy: rows as default, but still optional, without resorting to adding a keyword everywhere

@mkborregaard
Copy link
Author

Adding a keyword seems to be the only tenable deprecation strategy for functions where the behaviour is changed, though.

@Tokazama
Copy link
Contributor

Tokazama commented Feb 4, 2020

@nickrobinson251 , by default do you mean something like:

function pca(X; dims=1)
    ...
end

@ajaynshah
Copy link

I feel we should all standardise on $T \times K$ datasets in Julia+Statistics.

But failing that, one pathway is to have something like https://github.com/xKDR/CRRao.jl as an API which is what end-users see, which is the collection of shims that overcome the idiosyncrasies of diverse packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants