-
Notifications
You must be signed in to change notification settings - Fork 119
Estimator Data design
###Types of data readers We can consider different types of Readers for mass data:
- Streamed: Data are read fully from beginning to end.
- Ordered access: Data are read from beginning to end, but with a list which trajectories / frames should be read. This list is ordered, in order to avoid unnecessary seeking operations and multiple open/close operations on the same file.
- Random access: Trajectories and frames can be provided in random order.
Currently we have implemented 1. and all algorithms (TICA etc) operate on streamed data. This is inconvenient with algorithms that require many passes through the data (k-means), and with algorithms that only work with a fraction of the data at a time (mini-batch k-means). We currently have only workarounds for these cases: 2. can be emulated by generating an ordered list and skipping/streaming through files in order to collect only the requested coordinates into memory. 3. can only be realized if we can fit the data in memory (as done in k-means). In order to implement 2. and 3. efficiently we need file types / storage strategies that support some sort of indexing. For example, xtc files are especially bad for random access, but we could create an index when opening the file first and then saving an index file for future access.
###Types of estimators Correspondingly, we have different times of estimators, in terms of how they need their data:
- Single dataset in memory; one-shot estimation:
estimate(X)
- Large dataset that needs to be fed in batches:
add_data(X, **params)
, conclude by executing the estimation. - Multi-pass Estimation: data needs to be fed multiple times to complete Estimation (e.g. two-pass covariance estimation: i) compute mean, ii) compute covariance from mean-free data)
- Estimator is updated many times with different data sets (e.g. mini-batch k-means, stochastic approximation)
Ideally, we would have a match between Reader and Estimator.
###Push versus pull
1) Push data into estimators When streaming data, we can push the data into the estimator. In this case the streamed reader has reference to the data and to the attached estimators. Advantages:
- We can attach multiple estimators and feed every chunk of data read to all of them. This way, estimators that can run in parallel (e.g. computation of a mean and the sparse features in a PCA/TICA estimation) can be fed. Problems:
- Data exchange needs to be triggered by the streamed reader. The streamed reader needs to ask the Estimators if they are finished, or if they need e.g. more passes.
- This approach only works for streaming, but not with random / ordered access.
Dependencies: Data flow (usually in chunks):
Data <--- Streamer ---> Estimator 1 Data ===> Streamer ===> Estimator 1
| |
+------> Estimator 2 +=======> Estimator 2
2) Observed Data Stream We currently have the following variant of principle 1:
Dependencies: Data flow (usually in chunks):
Data <--- Streamer Data ===> Streamer
| |
V V
Estimator Estimator
Our Streamer class is the Transformer
class which currently does both transformation and data stream processing (this should be separated). Stream-based Estimators such as TICA are subclasses to the Transformer
. They observe the data stream that is being pushed through their superclass and take the data in the way they need it. The data will be repeatedly streamed until the Estimators declares that it is finished.
This approach is limited when data is requested in ordered access and cannot be used for random access. We have implemented ordered access with a workaround, but it may be possible to realize it in a cleaner way.
3) Estimator pulls the data Only the estimator knows how it wants to consume the data. Based on this idea, we can design a pull scheme: Advantages:
- Any access type (streamed, ordered, random) can be realized Problems:
- Multiple parallel estimators (e.g. mean and sparsification in TICA) cannot be realized. If we want to combine parallel estimation tasks, we have to combine multiple estimators into one, creating effectively one data consumer
Dependencies: Data flow (usually in chunks):
Data <--- Reader <--- Estimator 1 Data ===> Reader ===> Estimator 1
read pull X X
4) Pull-push stage Consider the following variant: an algorithm-specific High-level estimator stage pulls the data from a Reader and feeds it into sub-estimators
Dependencies: Data flow (usually in chunks):
Data <--- Reader <--- HL-Est Data ===> Reader ===> HL-Est
| | pull pull |feed|
| | V V
Est. 1 Est. 2 Est. 1 Est. 2
Advantages:
- Any access type (streamed, ordered, random) can be realized when corresponding Readers are available
- Multiple parallel sub-estimators can be fed. Sub-estimators are grouped and organized by high-level Estimator
- Define Reader interface
- Define High-level Estimator interface
- Define Low-level Estimator interfaces
The current Estimator interface follows the sklearn concept, and is defined by:
class Estimator:
def __init__(self, *params, **params):
"""
All estimation parameters must be defined in this argument list.
"""
def set_params(**params):
""" sets the estimation parameters as given. """
def get_params(**params):
""" gets the estimation parameters dictionary. """
def estimate(self, X, **params):
"""
Runs estimation with data set X. Estimation parameters set in __init__ can be
overridden here. This function will set the parameters accordingly, and then
call _estimate(X). For compatibility with sklearn, Estimator also has a fit(X)
function that will call to estimate(X).
"""
def _estimate(self, X):
""" conducts the estimation. Override this method to implement estimator. """
If X can be a reference to a Reader object, we can probably keep this design.
Differences in handling the data (streaming vs random access) would then be handled
inside _estimate
.
1) single-shot estimators (e.g. MSMs) see Estimation scheme above. But we already have that. Here, low-level is equal to high-level. We don't need two levels.
2) updated estimators (e.g. PCA, TICA)
class CovarianceEstimator
mean """ mean vector """
cov """ covariance matrix """
covtau """ time-lagged covariance matrix """
def __init__(self, lag=0):
"""
lag : int, default=0
if lag > 0, will compute time-lagged covariances with given lag time.
"""
def add_data(self, X, Y=None, weights=1, itraj=0, **params):
""" adds chunk X (Txn) to the running estimate of mean and covariances.
Parameters:
-----------
X : ndarray (T, n)
data chunk
Y : ndarray (T, n)
time-lagged data chunk. If given, time-lagged covariances will be computed
"""
What's the advantage of having a low-level estimator here? We could still do this as a normal estimator with estimate(X), when X is a streamable reader.
1) Two-pass TICA
Data ---> mean ---> +--------+
| | |
| | Covars | ---> Eigenvalue solver ---> Transformer
| | |
+-------------------> +--------+
2) Sparse feature TICA
Data ---> mean ---> +--------+
| | |
+-----> sparsify ---> | Covars | ---> Eigenvalue solver ---> Transformer
| | |
+-------------------> +--------+
3) One-pass TICA
Data ---> mean ---> Eigenvalue solver ---> Transformer
covars
4) mini-batch k-means
N updates
Data ----------> centers ---> Discretizer
assignments