Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of a function to compute the k-th order central moment of an array #153

Merged
merged 3 commits into from
Mar 14, 2020

Conversation

jvdp1
Copy link
Member

@jvdp1 jvdp1 commented Mar 1, 2020

Related to #113

In this draft pull request I propose the implementation of a function for the k-th order central moment of an array.

Other languages

Matlab moment
SciPy moment
Julia moment
R moment
Octave moment

Note: The functions of R and Octave can also return absolute central moment, raw moment, and absolute raw moment.

Proposed API and implementation

moment - central moment of array elements

Description

Returns the k-th order central moment of all the elements of array, or of the elements of array along dimension dim if provided, and if the corresponding element in mask is true.

The k-th order central moment is defined as :

 moment(array) = 1/n sum_i (array(i) - mean(array))^k

where n is the number of elements.

Syntax

result = moment(array, order [, mask])

result = moment(array, order, dim [, mask])

Arguments

array: Shall be an array of type integer, real, or complex.

order: Shall be an scalar of type integer.

dim: Shall be a scalar of type integer with a value in the range from 1 to n, where n is the rank of array.

mask (optional): Shall be of type logical and either by a scalar or an array of the same shape as array.

Return value

If array is of type real or complex, the result is of the same type as array.
If array is of type integer, the result is of type real(dp).

If dim is absent, a scalar with the k-th central moment of all elements in array is returned. Otherwise, an array of rank n-1, where n equals the rank of array, and a shape similar t
o that of array with dimension dim dropped is returned.

If mask is specified, the result is the k-th central moment of all elements of array corresponding to true elements of mask. If every element of mask is false, the result is IE
EE NaN.

Example

program demo_moment
    use stdlib_experimental_stats, only: moment
    implicit none
    real :: x(1:6) = [ 1., 2., 3., 4., 5., 6. ]
    print *, moment(x, 2)                            !returns 2.9167
    print *, moment( reshape(x, [ 2, 3 ] ), 2)       !returns 2.9167
    print *, moment( reshape(x, [ 2, 3 ] ), 2, 1)    !returns [0.25, 0.25, 0.25]
    print *, moment( reshape(x, [ 2, 3 ] ), 2, 1,&
                     reshape(x, [ 2, 3 ] ) > 3.)     !returns [NaN, 0., 0.25]
end program demo_moment

Questions for discussion

  1. Agreement on API?
  2. Agreement on implementation (e.g. for complex numbers)?
  3. Should we support (absolute)(raw/central) moments in this function? E.g. through optional logicals absolute and raw? (see example in this branch )

commit c9f73b5
Author: Vandenplas, Jeremie <jeremie.vandenplas@gmail.com>
Date:   Sun Mar 1 20:20:58 2020 +0100

    moment_dev

commit df4fedb
Author: Vandenplas, Jeremie <jeremie.vandenplas@gmail.com>
Date:   Sat Feb 29 22:26:53 2020 +0100

    moment_dev: addition of TOCs

commit 20a76b8
Author: Vandenplas, Jeremie <jeremie.vandenplas@gmail.com>
Date:   Sat Feb 29 22:19:03 2020 +0100

    moment_dev: addition of spec

commit 7c33e00
Author: Vandenplas, Jeremie <jeremie.vandenplas@gmail.com>
Date:   Sat Feb 29 21:07:24 2020 +0100

    moment_dev: addition of tests for complex

commit 6c2630b
Author: Vandenplas, Jeremie <jeremie.vandenplas@gmail.com>
Date:   Fri Feb 28 23:33:00 2020 +0100

    moment_dev: test progress

commit 1743f87
Author: Vandenplas, Jeremie <jeremie.vandenplas@gmail.com>
Date:   Fri Feb 28 22:25:43 2020 +0100

    moment_dev: correction of moment for complex

commit dfbdd97
Merge: 195c62d 5f35833
Author: Vandenplas, Jeremie <jeremie.vandenplas@gmail.com>
Date:   Fri Feb 28 20:12:35 2020 +0100

    Merge remote-tracking branch 'upstream/master' into moment_dev

commit 195c62d
Author: Vandenplas, Jeremie <jeremie.vandenplas@gmail.com>
Date:   Sun Feb 23 20:26:09 2020 +0100

    moment_dev: removed some explicit statements

commit 7c5233f
Author: Vandenplas, Jeremie <jeremie.vandenplas@gmail.com>
Date:   Sat Feb 22 16:51:06 2020 +0100

    moment_dev: addtition of test

commit 5ff0968
Author: Vandenplas, Jeremie <jeremie.vandenplas@gmail.com>
Date:   Sat Feb 22 16:09:54 2020 +0100

    moment_dev: addition of a function for central moment
Copy link
Member

@certik certik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, along the lines of previous work in stdlib.

I don't have strong opinions on the questions you asked at the end. Overall +1.

@jvdp1
Copy link
Member Author

jvdp1 commented Mar 9, 2020

I just implemented (absolute) central/raw moments in this branch. Personally I only use central moments, which is the default returned value for most packages I checked (that sometimes supports only the central moment).

Any thoughts @ivan-pi @aradi @fiolj @leonfoks?

@jvdp1 jvdp1 marked this pull request as ready for review March 10, 2020 07:11
@ivan-pi
Copy link
Member

ivan-pi commented Mar 10, 2020

I noticed the Scipy version uses exponentiation by squares for efficiency. Do you think in Fortran we can simply trust the compiler to do the right thing?

Second thing I am wondering is what is the sensitivity of this function to underflow/overflow? The reason I am asking is because I recall reading a paper by Hanson and Hopkin concerning some subtleties of the intrinsic Fortran norm2.

Concerning the API, the SciPy version allows to pass a list of moments orders, e.g.

real :: x(1:5) = [ 1., 2., 3., 4., 5. ]
print *, moment(x, [1, 2]) ! prints 0 2

Since I have not used this function before, I don't know whether this is particularly useful or not? It sure would complicate the allocation of a LHS variable...

@ivan-pi
Copy link
Member

ivan-pi commented Mar 10, 2020

  1. Should we support (absolute)(raw/central) moments in this function? E.g. through optional logicals absolute and raw? (see example in this branch )

Concerning this point, I find the Octave solution of passing a character ("c", "a", "ac", "r", "ar") to an optional type dummy variable easier than two logicals:

real :: x(1:5) = [ 1., 2., 3., 4., 5. ]
print *, moment(x, 2, type='ar')
print *, moment(x, 2, absolute=.true.,raw=.true.)

@jvdp1
Copy link
Member Author

jvdp1 commented Mar 10, 2020

Thank you @ivan-pi for the comments.

I noticed the Scipy version uses exponentiation by squares for efficiency. Do you think in Fortran we can simply trust the compiler to do the right thing?

I hope so. At least I always trust it, while never check it. If it is not the case, it could a nice procedure for stdlib, e.g., power(x, order) that would perform something similar to Scipy.

Second thing I am wondering is what is the sensitivity of this function to underflow/overflow? The reason I am asking is because I recall reading a paper by Hanson and Hopkin concerning some subtleties of the intrinsic Fortran norm2.

Good point. The function mainly relies on intrinsic functions (e.g., sum).

I think that this point is applicable to all procedures in stdlib (e.g., mean, var, trapz,....). A possible solution (already proposed in another post; I don't remember which one) would be to have a standard implementation of the function (e.g., moment), and followed by other optimized/specific implementations of the same function (e.g., moment_largearray????)

Concerning the API, the SciPy version allows to pass a list of moments orders, e.g.

real :: x(1:5) = [ 1., 2., 3., 4., 5. ]
print *, moment(x, [1, 2]) ! prints 0 2

Since I have not used this function before, I don't know whether this is particularly useful or not? It sure would complicate the allocation of a LHS variable...

I am not in favor of such a behavior (the user can call several times the same function, with different order). The main issue would be the dimension of the result (what would happen if only one value is given? Is the result a scalar (res) or a vector (res(1))? However, I may miss something.

@jvdp1
Copy link
Member Author

jvdp1 commented Mar 10, 2020

  1. Should we support (absolute)(raw/central) moments in this function? E.g. through optional logicals absolute and raw? (see example in this branch )

Concerning this point, I find the Octave solution of passing a character ("c", "a", "ac", "r", "ar") to an optional type dummy variable easier than two logicals:

real :: x(1:5) = [ 1., 2., 3., 4., 5. ]
print *, moment(x, 2, type='ar')
print *, moment(x, 2, absolute=.true.,raw=.true.)

Nice suggestion. I agree that using characters may simplify the API.

Here is my reasoning to propose logicals and not characters:

  • with logicals (absolute and raw), only 4 combinations are possible. With characters (e.g., a, c, and r as in Octave), many combinations are possible (there are even 5 combinations in Octave for 4 combinations allowed). This implies that the function will have 1) to check if the provided combination is allowed (e.g., ar and ra would be allowed and are the same, while cr or arc are not allowed) (so, 5 cases with characters, instead of 4 with logicals), and 2) to report an error message that the combination is incorrect. The latter may limit the implementation of pure functions. Using characters and its additional checks may also impact performance.

  • it follows the API of variance that uses a logical to allow the Bessel 's correction of not.

Again, I also find that the API with characters is nicer, but have some (unfounded?) concerns about performance.

@ivan-pi
Copy link
Member

ivan-pi commented Mar 10, 2020

Thank you for explaining these points. As long as the API does not change, we can always modify for performance/accuracy later, or as you suggest add specialized versions.

I agree with your concern about the dimension of the result. MATLAB does not have it either, so I suppose there is no need really. The current behavior is also fairly easy to remember, for the case of computing the moment along a single axis, the user simply needs to remember, the target array is one rank smaller than the original array:

real :: array(:,:,:,:)
real, allocatable :: mom2(:,:,:)
mom2 = moment(array,order=2,dim=1)

For array slices, or chained calls to moments it becomes more tricky (I don't know whether there are such use cases).

The purpose of this function is primarily statistics, right? I have used central moments before for image analysis purposes (e.g. to locate the centroid or area of an object), but I think a different function would be necessary for this.

@ivan-pi
Copy link
Member

ivan-pi commented Mar 10, 2020

Nice suggestion. I agree that using characters may simplify the API.

Here is my reasoning to propose logicals and not characters:

  • with logicals (absolute and raw), only 4 combinations are possible. With characters (e.g., a, c, and r as in Octave), many combinations are possible (there are even 5 combinations in Octave for 4 combinations allowed). This implies that the function will have 1) to check if the provided combination is allowed (e.g., ar and ra would be allowed and are the same, while cr or arc are not allowed) (so, 5 cases with characters, instead of 4 with logicals), and 2) to report an error message that the combination is incorrect. The latter may limit the implementation of pure functions. Using characters and its additional checks may also impact performance.

  • it follows the API of variance that uses a logical to allow the Bessel 's correction of not.

Again, I also find that the API with characters is nicer, but have some (unfounded?) concerns about performance.

I think your concerns about performance are unjustified. If you check the routines in other languages they performs checks for NaN, dimensions, etc., but they are still more popular than Fortran (simply because they provide these functions). LAPACK also uses characters in the API to specify the form of the system of equations (transpose, no transpose...), but they return the info argument to inform about incorrect usage. Plus, the routines already stop if the dimension exceeds the rank of the array, meaning they can not be pure (at least until Fortran introduces some assert facility that could be used in pure routines).

I do agree with your concern about the multiple character combinations. I suppose renaming the absolute dummy variable to the shorter abs is not possible, because it would clash with the intrinsic function? (Is it possible to rename intrinsic functions?) Introducing two new functions for rawmoment and absmoment is less appealing than the current proposition.

@jvdp1
Copy link
Member Author

jvdp1 commented Mar 11, 2020

The purpose of this function is primarily statistics, right? I have used central moments before for image analysis purposes (e.g. to locate the centroid or area of an object), but I think a different function would be necessary for this.

Indeed, it is primarily statistics. I think the function you mentioned is an extension to 2D.

I think your concerns about performance are unjustified. If you check the routines in other languages they performs checks for NaN, dimensions, etc., but they are still more popular than Fortran (simply because they provide these functions).

You convinced me. I will start an implementation with characters, instead of logicals. Then we can further discuss pros and cons based on the code.

However, it could be easier to first merge this PR (if there is a consensus about it), and then submit another PR with the different moments.

LAPACK also uses characters in the API to specify the form of the system of equations (transpose, no transpose...), but they return the info argument to inform about incorrect usage. Plus, the routines already stop if the dimension exceeds the rank of the array, meaning they can not be pure (at least until Fortran introduces some assert facility that could be used in pure routines).

#95 could be usefull in these cases.
On the other hand, we could also implement an optional argument info in the different functions (mean, var,...). However, such an argument would be useless for some implementations, and the API of these functions would not be the smae/similar to the API of intrinsic functions (like sum).

Introducing two new functions for rawmoment and absmoment is less appealing than the current proposition.

I agree.

@ivan-pi
Copy link
Member

ivan-pi commented Mar 11, 2020

After taking a closer look at the source code, the error message for "wrong dimension" currently refers to the mean function.


However, it could be easier to first merge this PR (if there is a consensus about it), and then submit another PR with the different moments.

As the raw and absolute do not break backward compatibility of API I think this might be easier too. It would be nice to get an opinion from the remaining developers concerning their preference between logicals vs characters for option passing.

@jvdp1
Copy link
Member Author

jvdp1 commented Mar 12, 2020

After taking a closer look at the source code, the error message for "wrong dimension" currently refers to the mean function.

Thank you for reviewing the code. It is now corrected as suggested.

It would be nice to get an opinion from the remaining developers concerning their preference between logicals vs characters for option passing.

I agree! It needs more discussions.

@fiolj
Copy link
Contributor

fiolj commented Mar 12, 2020

Sorry to barging in late to the discussion. I do not have strong feelings about the function since I do not usually use it.

I do not want to disrupt the implementation when it already has a lot of work in it but there are two points that are not clear.

First, probably the simplest is that I do not understand what happens when a scalar mask is provided. I though that perhaps for functions as sum it is used to avoid an extra if statement.

Secondly, when started to ponder about the interface with logical vs character flags I went to look to other implementations and started to think on the usefulness of the options.

Because is not my area of expertise, I only was able to think of

Some Applications

  • Statistics (centered)
  • System of particles
    • Center of mass: need weights (mass)
    • Kinetic energy: need weights (sqrt(mass))
    • Internal kinetic energy: need weigths (sqrt(mass)) and centered on v_cm (different from mean)
    • Moments of inertia (also centered in places other than the mean)

For these reasons I would favor (slightly) an API more similar to Julia in that the center be given as a float point number (or rather a number of the same type than the array).

result = moment(array, order [, center] [, mask])
result = moment(array, order, dim [, center] [, mask])

Additionally, in the same spirit we could consider adding an array of weights

Thus, we only would have a remaining flag for absolute, which (because of my ignorance) I do not know how used/needed is.

Because of the advanced state of the current pull request, I'd be perfectly happy to leave it as it is now, and if there is any interest keep the discussion for a future next iteration.

@jvdp1
Copy link
Member Author

jvdp1 commented Mar 12, 2020

Sorry to barging in late to the discussion. I do not have strong feelings about the function since I do not usually use it.

I do not want to disrupt the implementation when it already has a lot of work in it but there are two points that are not clear.

Thank you for your comments @fiolj . There is no problem! I mainly implemented it to facilitate discussion.

First, probably the simplest is that I do not understand what happens when a scalar mask is provided. I though that perhaps for functions as sum it is used to avoid an extra if statement.

Personally I do not understand the usefullness of a scalar logical, because it will always return NaN when it is .false.. I probably miss something there. So it was introduced in mean, var and moment to have the same API as the intrinsic function sum, which accept a scalar logical, and to be consistent with the standard.

Regarding the implementation, it could be indeed passed to the intrinsic function sum, instead of having the additional condition. I implemented it that way to avoid unnecessary computations when mask = .false..

Secondly, when started to ponder about the interface with logical vs character flags I went to look to other implementations and started to think on the usefulness of the options.
...
For these reasons I would favor (slightly) an API more similar to Julia in that the center be given as a float point number (or rather a number of the same type than the array).

This is a nice idea. center should be a scalar when dim is not provided, and an array of rank-1 when dim is provided (Julia does not allow to specify a dim). Not providing center would lead to the raw moment, right? Using the function mean, something like that could be written:

result = moment(x, 2) ! raw moment
result = moment(x, 2, center = mean(x))  ! central moment
result = moment(x, 2, dim = 3, center = mean(x, 3, mask=(x >3)), mask = (x >3)) !central moment

Your proposed API adds some flexibility to the function moment and might interest a broader audience, So I would be in favor of such an API.
For completness, I suggest to add a logical flag (with default being .false.) for absolute moments.
So the API could be:

result = moment(array, order [, center] [,absolute] [, mask])
result = moment(array, order, dim [, center] [,absolute] [, mask])

What are your opinions about this API? @fiolj @ivan-pi @certik @milancurcic

Re: weights, the aim was to implement a function for the moment in terms of mathematics/statistics. However, if the addition of an argument can extend its used in physics, I would agree.

n = size(x, kind = int64)
mean = sum(x) / n

res = sum((x - mean)**order) / n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

res = mean((x - mean(x))**order)

here? Is there a concern about the overhead from calling mean()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was indeed to avoid the overhead of calling mean (especially when mean is an array).

Copy link
Member

@milancurcic milancurcic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. This is excellent, thank you. I especially appreciate the hyperlinked TOC in the spec file.

I only had one question about the implementation. Otherwise, good to go.

@jvdp1
Copy link
Member Author

jvdp1 commented Mar 13, 2020

Sorry for the late review. This is excellent, thank you. I especially appreciate the hyperlinked TOC in the spec file.

I only had one question about the implementation. Otherwise, good to go.

Thank you @milancurcic for the review. Any thoughs about a different API:

result = moment(array, order [, center] [,absolute] [, mask])
result = moment(array, order, dim [, center] [,absolute] [, mask])

where center would replace the computation of the mean, and absolute would be an optional logical for computing absolute moments?

@fiolj
Copy link
Contributor

fiolj commented Mar 13, 2020

I've started some simple timing tests with gfortran on the overhead of using mean(), I do not have conclusive results yet, but noticed a few (minor) points in the code. Some of these will change with the new API implementation, but I mention them anyway.

n is defined real, so we could do here the same that mean() does and convert explicitly the size to real:

@@ -53,7 +53,7 @@ contains
           return
         end if
 
-        n = size(x, kind = int64)
+        n = real(size(x, kind = int64), dp)
         mean = sum(real(x, dp)) / n

         res = sum((real(x, dp) - mean)**order) / n

Also, we could put all calculations under the right case (although it perhaps may become quite irrelevant with the new API, unless we decide to use "center moment" as a default)

@@ -82,11 +82,13 @@ contains
           return
         end if
 
-        n = size(x, dim)
-        mean = sum(x, dim) / n
-
-        res = 0
         select case(dim)
+
+          n = real(size(x, kind = int64), ${k1}$)
+          mean = sum(x, dim) / n
+
+          res = 0
+
           #:for fi in range(1, rank+1)
           case(${fi}$)
             do i = 1, size(x, dim)

@jvdp1
Copy link
Member Author

jvdp1 commented Mar 13, 2020

  n = size(x, dim)
 -        mean = sum(x, dim) / n
 -
 -        res = 0
          select case(dim)
 +
 +          n = real(size(x, kind = int64), ${k1}$)
 +          mean = sum(x, dim) / n
 +

I've started some simple timing tests with gfortran on the overhead of using mean(), I do not have conclusive results yet, but noticed a few (minor) points in the code. Some of these will change with the new API implementation, but I mention them anyway.

n is defined real, so we could do here the same that mean() does and convert explicitly the size to real:

@@ -53,7 +53,7 @@ contains
           return
         end if
 
-        n = size(x, kind = int64)
+        n = real(size(x, kind = int64), dp)
         mean = sum(real(x, dp)) / n

         res = sum((real(x, dp) - mean)**order) / n

Funny. I implemented it in the way you proposed for var (see comment ), but @milancurcic suggested to avoid explicit type casts, mainly to improve readability. The compiler should do the job.

I aimed to review the code of mean() to remove such stuff, and harmonize it. But I will wait.

@milancurcic
Copy link
Member

The API with optional absolute and center arguments seems reasonable to me. I am not the target audience for these though; I only ever calculate moments where center == mean(x). So I can't speak for the usefulness of it.

I suggest to:

  1. Merge the basic implementation (this PR) into master;
  2. Open a PR for optional center and absolute (perhaps even separate PRs if they're independent);

Small and specific PRs will allow for a focused discussion.

@certik
Copy link
Member

certik commented Mar 14, 2020

I agree with @milancurcic about how to move forward. I am not a target audience either for this, so I do not have a strong opinion either.

@jvdp1
Copy link
Member Author

jvdp1 commented Mar 14, 2020

@milancurcic @certik Thank you for your advices. I will merge this PR.
I will later submit other PRs for optional center and absolute as discussed in this PR with @fiolj and @ivan-pi .

@fiolj I would be interested by your results regarding the overhead of calling mean() inside another function of stdlib, especially when the result is a large array (see #114 for a discussion on that topic). These results might help us on how to implement stat functions skewness() and kurtosis() because both need multiple computations of mean.

@jvdp1 jvdp1 merged commit 42af7d5 into fortran-lang:master Mar 14, 2020
@jvdp1 jvdp1 deleted the central_moment branch March 14, 2020 19:52
@fiolj
Copy link
Contributor

fiolj commented Mar 19, 2020

@fiolj I would be interested by your results regarding the overhead of calling mean() inside another function of stdlib, especially when the result is a large array (see #114 for a discussion on that topic). These results might help us on how to implement stat functions skewness() and kurtosis() because both need multiple computations of mean.

A (naive) comparison of times for moment is in:

https://gist.github.com/fiolj/7ebb1611e200d1fcd248b5866f851b76

Strangely enough, I am not finding differences between using mean() and the intrinsic sum().
I've just copied the implementation of mean and moment for a few (1D, 2D, 5D, real(dp)) cases and timed them with cpu_time().
Please, somebody check that these are valid tests of speed, I may be missing something.

@jvdp1
Copy link
Member Author

jvdp1 commented Mar 21, 2020

Thank you @fiolj for these time comparisons.
I added a loop to repeat 100 times the computations in the last part. No difference either.
There are maybe a bit naive and we should probably use some tools to get better numbers. But at least it seems to show that the overhead of calling the functionmean() is very small. Good to know.
Thank you for the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants