-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unifying the ecosystem around observations-as-row convention #114
Comments
In the MultivariateStats case in particular one could also add that there are types of factor analysis that allows multivariate statistics on mixed data types, such as numeric, logical and nominal data (via e.g. Gower's distance). Such methods would seem to me to require DataFrames (or similar Table types), and thus observations-as-rows. |
I've no strong opinion here, but...
|
I would prefer a trait system that uses something like NamedDims to establish the observation dimension. |
I don't think that would solve the problem, though. We need all of these methods to be able to accept simple |
I think the way to go is using your suggested dims method. There's no reason we can't have packages like What needs to happen so that we can start implementing this? |
Re accepting a I'm just suggesting we be greedy: rows as default, but still optional, without resorting to adding a keyword everywhere |
Adding a keyword seems to be the only tenable deprecation strategy for functions where the behaviour is changed, though. |
@nickrobinson251 , by default do you mean something like: function pca(X; dims=1)
...
end |
I feel we should all standardise on But failing that, one pathway is to have something like https://github.com/xKDR/CRRao.jl as an API which is what end-users see, which is the collection of shims that overcome the idiosyncrasies of diverse packages. |
In statistical analysis, e.g. with DataFrames.jl, it is a convention that columns equal variables, rows equal observations. The same is true for the entire R analytical ecosystem (statistics + machine learning) + the entire Python analytical ecosystem (stats + machine learning) + matlab. I believe it is also the case in most cases in the rest of the Julia ecosystem. For DataFrames this is the only obvious choice, as the same observation (row) can contain multiple types; and given that different columns can have different types, observations-are-rows is the only feasible way. Not true for
Matrix
es, which are the same type in both directions, so in principle columns can be interpreted as observations for Matrices.In some parts of the Julia ecosystem that are built only for
Matrix{<:Number}
, that is actually done - the convention is flipped. This is true for thekmeans
algorithm in Clustering (JuliaStats/Clustering.jl#79), and for NearestNeighbours. It also used to be the case forstandardize
in StatsBase and for Distances, but a deprecation has just been added requiring an explicitdims
argument (e.g. https://github.com/JuliaStats/StatsBase.jl/pull/490/files), which is the first step towards remedying this. And, it is also the case here in MultivariateStats - e.g. you'll need to transpose the data if you get them from a DataFrame, in order to do a PCA on them.On the other hand, the stdlib Statistics (e.g. functions
cov
andcor
), StatsModels, GLM, MixedModels, Plots, the MLJ machine learning library, and Turing all use the more widely accepted standard of observations-as-rows.I think we should standardize on using the same convention across the ecosystem, and preferably the same standard as everyone else. The main arguments against has been 1) that Machine-learning often uses columns-as-observations, and 2) that because Julia is column-major, analyses of all variables for an observation should be faster.
Ad 1), I'd like to question that assertion - most machine-learning frameworks I know of uses rows-are-observations (or indeed DataFrames). A notable exception appears to be Flux, which looks like it is columns-are-observations, AFAICS.
Ad 2) I have some other concerns:
I believe an added advantage is to improve the JuliaStats ecosystem combatability with Tables. The approach I'd suggest would be to require a
dims
keyword across all functions for at least 1 julia minor version, then flip the default, to make all code calling these functions deprecate noisily.cc @nalimilan @andreasnoack @ararslan @tsela
The text was updated successfully, but these errors were encountered: