Lakeland Model: Implement a version 0 clustering model called `KMeansModel` for Lakeland data #251

LswaN58 · 2024-12-02T23:19:57Z

We currently have a PopulationModel base class that inherits from Generator via the Model class.
Our goal is to write up a KMeansModel subclass of PopulationModel, which will be run once all the other infrastructure around it is implemented. This should go in the corresponding python file at src/ogd/games/LAKELAND/models/KMeansModel.py.

The following functions need to be implemented:

_featureFilter : just like a normal 2nd-order feature extractor. Specify a list of the names of features we want for training
_updateFromFeature : again, similar to a 2nd-order feature extractor. At the moment, this approach to implementation sucks and is dumb and will be replaced, but this function will need to effectively turn a bunch of individual FeatureData objects back into a table or whatever we want to train from. Further explanation in another comment.
_train : This function will run once, after the _updateFromFeature function has been called once on each FeatureData object for the whole population. Whatever variable (say, self._data) has been constructed with _updateFromFeature can now be used as input for the training of the k-means model. My assumption is that the implementation of _train here is calling all the code we currently have for the filtering, PCA, and k-means training steps. We can put any classes you've written for these steps into the same LAKELAND/models folder as this class, and import from there. We'll find a better home for those files later.
_apply : This function takes in a list of FeatureData objects, and converts them into a 1-D array (or whatever is needed for the scikit model's "predict" function, or whatever the function is called). Then it returns a new FeatureData object with the predicted value. Again, more details on FeatureData class in comments below.

The text was updated successfully, but these errors were encountered:

LswaN58 · 2024-12-02T23:39:29Z

Some additional information @vijayrampatel

`FeatureData` class

First, the FeatureData class.

This is just a dumb struct to store information on the output from a feature, and is imported from ogd.common.models.FeatureData. You can see the class in GH here. The important parts are Name, FeatureType, FeatureNames, and FeatureValues. The implementation is a little messy and redundant and should get fixed at some point, the upshot is that FeatureNames and FeatureValues are parallel arrays containing the column names and values as they would appear in an export file.

Because the features you've written for Lakeland have not involved subfeatures (at least as far as I remember), you should be able to safely assume that FeatureNames and FeatureValues always have length 1. Further, FeatureNames's one element should be the same as Name and FeatureType, though we don't really want to use Name or FeatureType directly.

Also noteworthy, we have an update to ogd-common coming very soon which will add SessionID and PlayerID elements to the FeatureData class. While we wait for that update, we'll just use SessionData as if it were an actual property of the class, and deal with the existence of the red error underline until the new release comes out.

`_updateFromFeature` function

Second, implementing _updateFromFeature. The function header is:

def _updateFromFeatureData(self, feature:FeatureData):

As mentioned, the current architecture of things doesn't map nicely to the array-based world of training models. We'll work to streamline this process and remove the need for this step, but for now we're stuck with it. This function should basically operate to place the values of FeatureData objects into a 2-D array or dataframe or whatever is most convenient for working with scikit model training. Each time it runs, it'll receive a new FeatureData object.
Using the feature.SessionID, feature.FeatureNames[0], and feature.FeatureValues[0] properties/values, we should be able to fill in an array bit-by-bit. For example:

def _updateFromFeatureData(self, feature:FeatureData):
    _row = feature.SessionID
    _col = feature.FeatureNames[0]
    _val = feature.FeatureValues[0]
    self._data[_row][_col] = _val

I don't recall the numpy/pandas syntax exactly for setting value at given row/column, and not sure whether you'll need session IDs as rows or as columns for the training step. Also, feel free to compress all into a one-liner. But broadly it should look something like this.
Then the end result should be a filled-in table/array/dataframe with all our feature data, with the same values that currently would go to the output tsv file.

LswaN58 added new feature New feature or request model Create or improve a Model class labels Dec 2, 2024

LswaN58 added this to the ogd-core 1.2.0 "modeling" milestone Dec 2, 2024

LswaN58 assigned LswaN58 and vijayrampatel Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lakeland Model: Implement a version 0 clustering model called `KMeansModel` for Lakeland data #251

Lakeland Model: Implement a version 0 clustering model called `KMeansModel` for Lakeland data #251

LswaN58 commented Dec 2, 2024 •

edited

Loading

LswaN58 commented Dec 2, 2024

Lakeland Model: Implement a version 0 clustering model called KMeansModel for Lakeland data #251

Lakeland Model: Implement a version 0 clustering model called KMeansModel for Lakeland data #251

Comments

LswaN58 commented Dec 2, 2024 • edited Loading

LswaN58 commented Dec 2, 2024

FeatureData class

_updateFromFeature function

Lakeland Model: Implement a version 0 clustering model called `KMeansModel` for Lakeland data #251

Lakeland Model: Implement a version 0 clustering model called `KMeansModel` for Lakeland data #251

LswaN58 commented Dec 2, 2024 •

edited

Loading

`FeatureData` class

`_updateFromFeature` function