Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chatterjee Correlation Coefficient #770

Merged
merged 23 commits into from
May 25, 2022
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
7d8452e
Implement rank vector
mborland Feb 25, 2022
cbee638
Add documentation. Admittedly terrible.
NAThompson Mar 2, 2022
3761782
Add unit tests.
NAThompson Mar 2, 2022
cc5a24a
Cleanup method of detecting if execution policies are valid or not
mborland Mar 2, 2022
91326b1
Implement and test chatterjee correlation
mborland Mar 2, 2022
185e9c8
Add spot checks and special handling for constant Y
mborland Mar 3, 2022
b2539a5
Add performance file
mborland Mar 3, 2022
779589c
Add execution policy support to rank
mborland Mar 3, 2022
79e2b59
Remove duplicates from v when generating the order vector
mborland Mar 4, 2022
5a80e5d
Fix macro error for use of <execution>
mborland Mar 4, 2022
d0e51cc
Use explicit types instead of auto to avoid warnings
mborland Mar 4, 2022
30fd8a0
Add execution policy testing to rank
mborland Mar 4, 2022
9aa8e53
Add threaded implementation
mborland Apr 19, 2022
0ea9a48
Added threaded testing
mborland Apr 20, 2022
260541b
Fix formatting and ASCII issues in test
mborland Apr 20, 2022
3d63c38
Fix more ASCII issues
mborland May 11, 2022
d43bbba
refactoring
mborland May 22, 2022
ff15648
Fix threaded impl
mborland May 22, 2022
7219017
Remove non-ASCII apostrophe
mborland May 22, 2022
65d4ec5
Doc fixes and add test comparing generally to paper values
mborland May 23, 2022
e992d6c
Merge remote-tracking branch 'boostorg/develop' into chaterjee
mborland May 24, 2022
e55fb3a
Significantly tighten tolerance around expected values from paper
mborland May 24, 2022
ea41f94
Change tolerance for sin comparison
mborland May 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions doc/statistics/chatterjee_correlation.qbk
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
[/
Copyright 2022 Matt Borland

Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at
http://www.boost.org/LICENSE_1_0.txt).
]

[section:chatterjee_correlation Chatterjee Correlation]

[heading Synopsis]

``
#include <boost/math/statistics/chatterjee_correlation.hpp>

namespace boost::math::statistics {

C++17:
template <typename ExecutionPolicy, typename Container>
auto chatterjee_correlation(ExecutionPolicy&& exec, const Container& u, const Container& v);

C++11:
template <typename Container>
auto chatterjee_correlation(const Container& u, const Container& v);
}
``

[heading Description]

The classical correlation coefficients like the Pearson's correlation are useful primarily for distinguishing when one dataset depends linearly on another.
However, Pearson's correlation coefficient has a known weakness in that when the dependent variable has an obvious functional relationship with the independent variable, the value of the correlation coefficient can take on any value.
As Chatterjee says:

> Ideally, one would like a coefficient that approaches
its maximum value if and only if one variable looks more and more like a
noiseless function of the other, just as Pearson correlation is close to its maximum value if and only if one variable is close to being a noiseless linear function of the other.

This is the problem Chatterjee's coefficient solves.
Let X and Y be random variables, where Y is not constant, and let (X_i, Y_i) be samples from this distribution.
Rearrange these samples so that X_(0) < X_{(1)} < ... X_{(n-1)} and create (X_{(i)}, Y_{(i)}).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mborland : Does this render properly? I wonder if we need to use (say) this to get the docs correct?

The Chatterjee correlation is then given by

[$../equations/chatterjee_correlation.svg]

In the limit of an infinite amount of i.i.d data, the statistic lies in [0, 1].
However, if the data is not infinite, the statistic may be negative.
If X and Y are independent, the value is zero, and if Y is a measurable function of X, then the statistic is unity.
The complexity is O(n log n).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to get O(n log n) nicely rendered?


An example is given below:

std::vector<double> X{1,2,3,4,5};
std::vector<double> Y{1,2,3,4,5};
using boost::math::statistics::chatterjee_correlation;
double coeff = chatterjee_correlation(X, Y);

The implementation follows [@https://arxiv.org/pdf/1909.10140.pdf Chatterjee's paper].

/Nota bene:/ If the input is an integer type the output will be a double precision type.

[heading Invariants]

The function expects at least two samples, a non-constant vector Y, and the same number of X's as Y's.
If Y is constant, the result is a quiet NaN.
The data set must be sorted by x values.
mborland marked this conversation as resolved.
Show resolved Hide resolved
If there are ties in the values of X, then the statistic is random due to the random breaking of ties.
Of course, there is not random numbers used internally, but the result is not guaranteed to be identical on different systems.
mborland marked this conversation as resolved.
Show resolved Hide resolved

[heading References]

* Chatterjee, Sourav. "A new coefficient of correlation." Journal of the American Statistical Association 116.536 (2021): 2009-2022.

[endsect]
[/section:chatterjee_correlation Chatterjee Correlation]
15 changes: 6 additions & 9 deletions include/boost/math/statistics/bivariate_statistics.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,10 @@
#include <boost/math/tools/assert.hpp>
#include <boost/math/tools/config.hpp>

// Support compilers with P0024R2 implemented without linking TBB
// https://en.cppreference.com/w/cpp/compiler_support
#if !defined(BOOST_NO_CXX17_HDR_EXECUTION) && defined(BOOST_HAS_THREADS)
#ifdef BOOST_MATH_EXEC_COMPATIBLE
#include <execution>
#include <future>
#include <thread>
#define EXEC_COMPATIBLE
#endif

namespace boost{ namespace math{ namespace statistics { namespace detail {
Expand Down Expand Up @@ -60,7 +57,7 @@ ReturnType means_and_covariance_seq_impl(ForwardIterator u_begin, ForwardIterato
return std::make_tuple(mu_u, mu_v, cov/i, Real(i));
}

#ifdef EXEC_COMPATIBLE
#ifdef BOOST_MATH_EXEC_COMPATIBLE

// Numerically stable parallel computation of (co-)variance
// https://dl.acm.org/doi/10.1145/3221269.3223036
Expand Down Expand Up @@ -154,7 +151,7 @@ ReturnType means_and_covariance_parallel_impl(ForwardIterator u_begin, ForwardIt
return std::make_tuple(mu_u_a, mu_v_a, cov_a, n_a);
}

#endif // EXEC_COMPATIBLE
#endif // BOOST_MATH_EXEC_COMPATIBLE

template<typename ReturnType, typename ForwardIterator>
ReturnType correlation_coefficient_seq_impl(ForwardIterator u_begin, ForwardIterator u_end, ForwardIterator v_begin, ForwardIterator v_end)
Expand Down Expand Up @@ -204,7 +201,7 @@ ReturnType correlation_coefficient_seq_impl(ForwardIterator u_begin, ForwardIter
return std::make_tuple(mu_u, Qu, mu_v, Qv, cov, rho, Real(i));
}

#ifdef EXEC_COMPATIBLE
#ifdef BOOST_MATH_EXEC_COMPATIBLE

// Numerically stable parallel computation of (co-)variance:
// https://dl.acm.org/doi/10.1145/3221269.3223036
Expand Down Expand Up @@ -324,11 +321,11 @@ ReturnType correlation_coefficient_parallel_impl(ForwardIterator u_begin, Forwar
return std::make_tuple(mu_u_a, Qu_a, mu_v_a, Qv_a, cov_a, rho, n_a);
}

#endif // EXEC_COMPATIBLE
#endif // BOOST_MATH_EXEC_COMPATIBLE

} // namespace detail

#ifdef EXEC_COMPATIBLE
#ifdef BOOST_MATH_EXEC_COMPATIBLE

template<typename ExecutionPolicy, typename Container, typename Real = typename Container::value_type>
inline auto means_and_covariance(ExecutionPolicy&& exec, Container const & u, Container const & v)
Expand Down
159 changes: 159 additions & 0 deletions include/boost/math/statistics/chatterjee_correlation.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
// (C) Copyright Matt Borland 2022.
// Use, modification and distribution are subject to the
// Boost Software License, Version 1.0. (See accompanying file
// LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

#ifndef BOOST_MATH_STATISTICS_CHATTERJEE_CORRELATION_HPP
#define BOOST_MATH_STATISTICS_CHATTERJEE_CORRELATION_HPP

#include <cstdint>
#include <cmath>
#include <algorithm>
#include <iterator>
#include <vector>
#include <limits>
#include <utility>
#include <type_traits>
#include <boost/math/tools/assert.hpp>
#include <boost/math/tools/config.hpp>
#include <boost/math/statistics/detail/rank.hpp>

#ifdef BOOST_MATH_EXEC_COMPATIBLE
#include <execution>
#include <future>
#include <thread>
#endif

namespace boost { namespace math { namespace statistics {

namespace detail {

template <typename BDIter>
std::size_t chatterjee_transform(BDIter begin, BDIter end)
{
std::size_t sum = 0;

while(++begin != end)
{
if(*begin > *std::prev(begin))
{
sum += *begin - *std::prev(begin);
}
else
{
sum += *std::prev(begin) - *begin;
}
}

return sum;
}

template <typename ReturnType, typename ForwardIterator>
ReturnType chatterjee_correlation_seq_impl(ForwardIterator u_begin, ForwardIterator u_end, ForwardIterator v_begin, ForwardIterator v_end)
{
using std::abs;

BOOST_MATH_ASSERT_MSG(std::is_sorted(u_begin, u_end), "The x values must be sorted in order to use this functionality");

const std::vector<std::size_t> rank_vector = rank(v_begin, v_end);

std::size_t sum = chatterjee_transform(rank_vector.begin(), rank_vector.end());

ReturnType result = static_cast<ReturnType>(1) - (static_cast<ReturnType>(3 * sum) / static_cast<ReturnType>(rank_vector.size() * rank_vector.size() - 1));

// If the result is 1 then Y is constant and all the elements must be ties
if (abs(result - static_cast<ReturnType>(1)) < std::numeric_limits<ReturnType>::epsilon())
{
return std::numeric_limits<ReturnType>::quiet_NaN();
}

return result;
}

} // Namespace detail

template <typename Container, typename Real = typename Container::value_type,
typename ReturnType = typename std::conditional<std::is_integral<Real>::value, double, Real>::type>
inline ReturnType chatterjee_correlation(const Container& u, const Container& v)
{
return detail::chatterjee_correlation_seq_impl<ReturnType>(std::begin(u), std::end(u), std::begin(v), std::end(v));
}

}}} // Namespace boost::math::statistics

#ifdef BOOST_MATH_EXEC_COMPATIBLE

namespace boost::math::statistics {

namespace detail {

template <typename ReturnType, typename ExecutionPolicy, typename ForwardIterator>
ReturnType chatterjee_correlation_par_impl(ExecutionPolicy&& exec, ForwardIterator u_begin, ForwardIterator u_end,
ForwardIterator v_begin, ForwardIterator v_end)
{
using std::abs;
BOOST_MATH_ASSERT_MSG(std::is_sorted(std::forward<ExecutionPolicy>(exec), u_begin, u_end), "The x values must be sorted in order to use this functionality");

auto rank_vector = rank(std::forward<ExecutionPolicy>(exec), v_begin, v_end);

const auto num_threads = std::thread::hardware_concurrency() == 0 ? 2u : std::thread::hardware_concurrency();
std::vector<std::future<std::size_t>> future_manager;
const auto elements_per_thread = std::ceil(static_cast<double>(rank_vector.size()) / num_threads);

auto it = rank_vector.begin();
auto end = rank_vector.end();
for(std::size_t i {}; i < num_threads - 1; ++i)
{
future_manager.emplace_back(std::async(std::launch::async | std::launch::deferred, [it, elements_per_thread]() -> std::size_t
{
return chatterjee_transform(it, std::next(it, elements_per_thread));
}));
it = std::next(it, elements_per_thread - 1);
}

future_manager.emplace_back(std::async(std::launch::async | std::launch::deferred, [it, end]() -> std::size_t
{
return chatterjee_transform(it, end);
}));

std::size_t sum {};
for(std::size_t i {}; i < future_manager.size(); ++i)
{
sum += future_manager[i].get();
}

ReturnType result = static_cast<ReturnType>(1) - (static_cast<ReturnType>(3 * sum) / static_cast<ReturnType>(rank_vector.size() * rank_vector.size() - 1));

// If the result is 1 then Y is constant and all the elements must be ties
if (abs(result - static_cast<ReturnType>(1)) < std::numeric_limits<ReturnType>::epsilon())
{
return std::numeric_limits<ReturnType>::quiet_NaN();
}

return result;
}

} // Namespace detail

template <typename ExecutionPolicy, typename Container, typename Real = typename Container::value_type,
typename ReturnType = std::conditional_t<std::is_integral_v<Real>, double, Real>>
inline ReturnType chatterjee_correlation(ExecutionPolicy&& exec, const Container& u, const Container& v)
{
if constexpr (std::is_same_v<std::remove_reference_t<decltype(exec)>, decltype(std::execution::seq)>)
{
return detail::chatterjee_correlation_seq_impl<ReturnType>(std::cbegin(u), std::cend(u),
std::cbegin(v), std::cend(v));
}
else
{
return detail::chatterjee_correlation_par_impl<ReturnType>(std::forward<ExecutionPolicy>(exec),
std::cbegin(u), std::cend(u),
std::cbegin(v), std::cend(v));
}
}

} // Namespace boost::math::statistics

#endif

#endif // BOOST_MATH_STATISTICS_CHATTERJEE_CORRELATION_HPP
Loading