Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

Estimation Data types

Frank Noe edited this page Oct 31, 2015 · 1 revision

Data types in early processing (e.g. TICA)

Some data processing steps are currently inefficient - in memory usage, CPU usage, or both.

  • Allow to use sparse matrices as input
  • Allow to use different data types, e.g. boolean or bitarrays for contact maps

The question arises how we still keep generality in the data processing pipeline.

  • Build specialized low-level estimators for specific datatypes, e.g. covariance estimators for integer and sparse boolean data. (simple one-pass algorithm is robust for integral data, C implementation can efficiently deal with 1's and 0's.)
  • High-level estimator (e.g. TICA) encapsulates multiple types, e.g. float/int.
  • There is a fallback implementation if specialized low-level algorithms are not implemented. For example a boolean array can be cast to a float array containing 0.0 and 1.0, a sparse data chunk can be copied into a dense data chunk etc.

Data types in late Estimation

  • Clustering output is integer, MSM/HMSM input is integer
  • How can they be included in a data processing pipeline?
Clone this wiki locally