Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lakeland Model: Implement a version 0 clustering model called KMeansModel for Lakeland data #251

Open
LswaN58 opened this issue Dec 2, 2024 · 1 comment
Assignees
Labels
model Create or improve a Model class new feature New feature or request

Comments

@LswaN58
Copy link
Member

LswaN58 commented Dec 2, 2024

We currently have a PopulationModel base class that inherits from Generator via the Model class.
Our goal is to write up a KMeansModel subclass of PopulationModel, which will be run once all the other infrastructure around it is implemented. This should go in the corresponding python file at src/ogd/games/LAKELAND/models/KMeansModel.py.

The following functions need to be implemented:

  • _featureFilter : just like a normal 2nd-order feature extractor. Specify a list of the names of features we want for training
  • _updateFromFeature : again, similar to a 2nd-order feature extractor. At the moment, this approach to implementation sucks and is dumb and will be replaced, but this function will need to effectively turn a bunch of individual FeatureData objects back into a table or whatever we want to train from. Further explanation in another comment.
  • _train : This function will run once, after the _updateFromFeature function has been called once on each FeatureData object for the whole population. Whatever variable (say, self._data) has been constructed with _updateFromFeature can now be used as input for the training of the k-means model. My assumption is that the implementation of _train here is calling all the code we currently have for the filtering, PCA, and k-means training steps. We can put any classes you've written for these steps into the same LAKELAND/models folder as this class, and import from there. We'll find a better home for those files later.
  • _apply : This function takes in a list of FeatureData objects, and converts them into a 1-D array (or whatever is needed for the scikit model's "predict" function, or whatever the function is called). Then it returns a new FeatureData object with the predicted value. Again, more details on FeatureData class in comments below.
@LswaN58 LswaN58 added new feature New feature or request model Create or improve a Model class labels Dec 2, 2024
@LswaN58 LswaN58 added this to the ogd-core 1.2.0 "modeling" milestone Dec 2, 2024
@LswaN58
Copy link
Member Author

LswaN58 commented Dec 2, 2024

Some additional information @vijayrampatel

FeatureData class

First, the FeatureData class.

This is just a dumb struct to store information on the output from a feature, and is imported from ogd.common.models.FeatureData. You can see the class in GH here. The important parts are Name, FeatureType, FeatureNames, and FeatureValues. The implementation is a little messy and redundant and should get fixed at some point, the upshot is that FeatureNames and FeatureValues are parallel arrays containing the column names and values as they would appear in an export file.

Because the features you've written for Lakeland have not involved subfeatures (at least as far as I remember), you should be able to safely assume that FeatureNames and FeatureValues always have length 1. Further, FeatureNames's one element should be the same as Name and FeatureType, though we don't really want to use Name or FeatureType directly.

Also noteworthy, we have an update to ogd-common coming very soon which will add SessionID and PlayerID elements to the FeatureData class. While we wait for that update, we'll just use SessionData as if it were an actual property of the class, and deal with the existence of the red error underline until the new release comes out.

_updateFromFeature function

Second, implementing _updateFromFeature. The function header is:

def _updateFromFeatureData(self, feature:FeatureData):

As mentioned, the current architecture of things doesn't map nicely to the array-based world of training models. We'll work to streamline this process and remove the need for this step, but for now we're stuck with it. This function should basically operate to place the values of FeatureData objects into a 2-D array or dataframe or whatever is most convenient for working with scikit model training. Each time it runs, it'll receive a new FeatureData object.
Using the feature.SessionID, feature.FeatureNames[0], and feature.FeatureValues[0] properties/values, we should be able to fill in an array bit-by-bit. For example:

def _updateFromFeatureData(self, feature:FeatureData):
    _row = feature.SessionID
    _col = feature.FeatureNames[0]
    _val = feature.FeatureValues[0]
    self._data[_row][_col] = _val

I don't recall the numpy/pandas syntax exactly for setting value at given row/column, and not sure whether you'll need session IDs as rows or as columns for the training step. Also, feel free to compress all into a one-liner. But broadly it should look something like this.
Then the end result should be a filled-in table/array/dataframe with all our feature data, with the same values that currently would go to the output tsv file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Create or improve a Model class new feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants