CosmicSplines

Implementation of Regression Splines from scratch to predict e cosmic microwave background (CMB) angular power spectrum. 1-st Homework for the course of "Statistical learning" at La Sapienza University of Rome (Kaggle competition link)

Brief description

We are talking about a snapshot of our universe in its infancy, something like 379,000 years after the Big Bang, nothing compared to the estimated age of our universe.

The map next was taken by the Wilkinson Microwave Anisotropy Probe (WMAP) and shows differences across the sky in the temperature of the cosmic microwave background (CMB), the radiant heat remaining from the Big Bang. The average temperature is 2.73 degrees above absolute zero but the temperature is not constant across the sky. The fluctuations in the temperature map provide information about the early universe. Indeed, as the universe expanded, there was a tug of war between the force of expansion and contraction due to gravity. This caused acoustic waves in the hot gas, which is why there are temperature fluctuations. The strength of the temperature fluctuations f(x) at each frequency (or multipole) x is called the power spectrum and this power spectrum can be used by cosmologists to answer cosmological questions. For example, the relative abundance of different constituents of the universe (such as baryons and dark matter) corresponds to peaks in the power spectrum. the temperature map can be reduced to the following scatterplot of power versus frequency:

Goal

In a nutshell:

The dataset in training consists of 675 CMB angular power spectrum observations estimated from the latest WMAP data release.

The goal is to predict the angular power spectrum at 224 additional frequencies.

RMSE is the adopted metric.

Splines Implementation

Any d-th-order spline $f(\cdot)$ is a piecewise polynomial function of degree $d$ that is continuous and has continuous derivatives of orders $1, \ldots, d - 1$ at the so-called knot points. To build a generic dth-order spline $f(\cdot)$, we start from a bunch of points, say $q$, that we call knots $\xi_1 < \xi_2 < \cdots < \xi_q$, and we then ask the following:

$f(\cdot)$ is some polynomial of degree $d$ on each of the intervals: $(-\infty, \xi_1], [\xi_1, \xi_2], [\xi_2, \xi_3], \ldots, [\xi_q, +\infty)$;
its j-th derivative $f^{(j)}(\cdot)$ is continuous at $\xi_1, \ldots, \xi_q$ for each $j \in {0, 1, \ldots, d - 1}$.

Given a set of points $\xi_1 < \xi_2 < \cdots < \xi_q$, there is a quick-and-dirty way to describe/generate the whole set of d-th-order spline functions over those $q$ knots:

Start from truncated power functions $G_{d,q} = { g_1(x), \ldots, g_{d+1}(x), g_{(d+1)+1}(x), \ldots, g_{(d+1)+q}(x) }$, defined as: ${ g_1(x) = 1, g_2(x) = x, \ldots, g_{d+1}(x) = x^d }$, ${ g_{(d+1)+j}(x) = (x - \xi_j)_{+}^d }$ for $j = 1$ to $q$, where $(x) _{+} = max(0, x)$.
Then, if $f(\cdot)$ is a d-th-order spline with knots ${\xi_1, \ldots, \xi_q}$, you can show it can be obtained as a linear combination over $G_{d,q}$:

$f(x) = \sum_{j=1}^{d+1+q} \beta_j g_j(x)$, for some set of coefficients $\beta = [\beta_1, \ldots, \beta_{d+1}, \beta_{(d+1)+1}, \ldots, \beta_{(d+1)+q}]^T$

Nested Cross Validation

Considering the knots as positioned on q-equispaced locations, we proceed with different Cross Validation techniques to tune the hyperparameters (knots, maximum degree of the truncated power functions, etc.) such as: Grid Search CV, Vanilla CV, and the Nested CV from the Bates et al. article. We used Repeated CV to find the best degree and number of knots.

Degree	Knots

Then we implemented an Elastic Net regularization and tuned the related hyperparameters always with the CV.

Shrinkage type	Shrinkage weight

Fit the splines

Finally we can fit the obtained splines to our WMAP data.

Final results

Due to the Heteroschedaticity of our train data our predictions may be affected by the big noise of the training data in the final part of the shape. We use the Box-Cox transformation.

Then we fit another time the splines and obtain our final results!

Team ("🍫I Cioccolatosi🍫"):

Enrico Grimaldi (Linkedin - Github)
Giuseppe Di Poce (Linkedin - Github)
Davide Vigneri (Linkedin - Github)
Nicola Grieco (Linkedin - Github)

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.gitignore		.gitignore
CosmicSplines.Rmd		CosmicSplines.Rmd
LICENSE		LICENSE
NestedCV.R		NestedCV.R
README.md		README.md
Regression_spline.R		Regression_spline.R
report.html		report.html
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CosmicSplines

Brief description

Goal

Splines Implementation

Nested Cross Validation

Fit the splines

Final results

Team ("🍫I Cioccolatosi🍫"):

Used technologies

About

Releases

Packages

Languages

License

Engrima18/CosmicSplines

Folders and files

Latest commit

History

Repository files navigation

CosmicSplines

Brief description

Goal

Splines Implementation

Nested Cross Validation

Fit the splines

Final results

Team ("🍫I Cioccolatosi🍫"):

Used technologies

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages