-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation: Proper DataSource format and usage for K-Means Clustering #72
Comments
I think I'm conflating |
If it helps, essentially what I'm trying to do is associate named features to coordinates such that they can be classified/clustered using K-Means. There's probably a way to do this already, but I'm missing it. I'd then like to use the trained output clusters as an input for classification. |
Do you have ground truth cluster ids in your csv file? If not then you should add any response processor you like, make sure you pass it If you do have ground truth cluster ids (and those ids are integers), then you should use The overall architecture of Tribuo's output system is discussed here - https://tribuo.org/learn/4.0/docs/architecture.html#structure, but there's nothing specific on loading in Once you've got the |
Thank you so much. This is extremely helpful. I think I'm mostly following. Again please forgive me as I'd self describe as an experienced programmer who is new to machine learning. I don't have ground truth cluster ids. What I have is essentially tabular data where each row has an id, name, a name weight (quality of name), and Essentially, this:
I'm wondering if there will be issues with duplicate features? E.g. do I need to coerce My goal is to cluster this data based on the features, weighted by how many times that feature appears. (Not sure how to add multi-dimensional data?) So I've currently ended up with this:
And as you mentioned, I need to convert the string feature values into integers such that the However, I'm still a bit unclear on how to tokenize and read those values via the
So I think I'm missing how to create a FieldProcessor that that reads features and passes them through as integers from the CSV.
Also, just as a thought/aside about the API: I assume based on your response that there is no But I'm guessing there may be issues with carrying that mapping through the pipeline which make it less simple? Either way it shouldn't be too hard to do in my code if I can figure out how to read the feature IDs as ints. |
Also a quick note. I've tried using |
Also trying to extend
|
To deal with extracted duplicate features you should use
Is it possible to paste a line or two of the data? I think I'm missing something about how the data is setup. Tribuo's feature space is both named and implicitly sparse, so I think it should map pretty easily to your task, but it looks like the row processor is mangling your inputs in some way. In general it's not necessary to map feature names into id numbers, as Tribuo will do that for you because everything at the user level is named not numbered. The only thing that is actually numbered are cluster ids, and even then we could consider making them named to fit better with the rest of Tribuo (though that would be a breaking change and so would have to wait till the next major release). |
That will emit multiple features where each one has the same name (as it comes from the field name) and different values based on the value that was extracted. I think you probably want to emit a feature with the name |
Interesting. Wouldn't adding the Feature IDs actually create potentially duplicate features? E.g. 1+3 = 4 and 2+2 = 4. Absolutely!
As you can see, each row has any number of features and there's no specific order. |
Looking at the APIs you mentioned. |
It's cool you're using Tribuo on MtG data. Assuming those numbers are card ids then it might be simpler for you to process this in a different way, but given the format the data is in you can do this: var fieldProcessors = new HashMap<String, FieldProcessor>();
fieldProcessors.put("FEATURE_IDS",
new TextFieldProcessor("cards", new TokenPipeline(new SplitPatternTokenizer(","), 1, true)));
fieldProcessors.put("TYPE", new IdentityProcessor("format"));
var responseProcessor = new FieldResponseProcessor<>("blank", null, new ClusteringFactory());
var metadataExtractors = new ArrayList<FieldExtractor<?>>();
metadataExtractors.add(new IdentityExtractor("ID"));
metadataExtractors.add(new IdentityExtractor("NAME"));
var rowProcessor = new RowProcessor<ClusterID>(metadataExtractors, null, responseProcessor, fieldProcessors, Collections.emptySet());
var csvSource = new CSVDataSource<ClusterID>(csvInputPath, rowProcessor, false);
return new MutableDataset<ClusterID>(csvSource); Basically all I did was switch This is what the first example looks like:
|
You know Magic! (I guess I shouldn't be surprised) :) That's awesome. I'm now super excited. Do you actively play, or are just familiar? So... yes! I'm working on a new Metagame analysis for www.topdecked.com. I used to do R&D for Red Hat but this is my chosen career as of a year or so ago. (Got burned out.) I think I see what you did there. It looks like I was close-ish. I think the main difference is the I was trying to assign this to a field and that's where it was blowing up. Let me try this. |
Out of curiosity, what's the purpose of the But isn't that more for classification when you're either gathering possible outputs, or evaluating for a result output via an already trained dataset? I guess this is potentially just a downstream effect of the shared nature of this (actually very nice) generic dataset pipeline? |
Hot dog!
|
I play intermittently. I used to play a lot more back in the UK, but when I moved to the US it became much less frequent (plus my daughter isn't old enough to play it yet). Obviously this year I've been stuck playing Arena, though I did get to play a mystery booster draft at PAX East in February (though I did pretty badly).
It's partially an issue with a shared input pipeline, but sometimes there is a ground truth clustering that you want to measure the performance of the system against (e.g. when developing new clustering algorithms, or trying a clustering approach to some other supervised learning task), and so it is useful. We should probably make it simpler to turn off in this kind of use case though, as both the clustering and anomaly detection tasks are likely to have this issue. The way I used to turn it off is pretty esoteric and requires knowledge of the internal codepaths which aren't too well documented (but at least it's open source so you can read it). |
That's cool. I miss having co-workers. That's one downside of going solo on a project like this. Yeah, Arena just isn't the same as sitting down with some friends or going to a card shop. But it has its merits. I prefer paper myself as I've been playing since '97. I haven't played Arena in months, since I started kicking it into gear trying to get this project done and into the wild. Happy to give you (or anyone else you want) a beta account if you're at all interested in seeing this in action (but I digress.)
A |
Out of curiosity, do you guys have any packages in this library that will do "automatic" determination of K based on some kind of metrics? E.g. Elbow method or silhouette, etc? EDIT: I see the centroids already have a number of metrics/analysis methods provided. I'll check those out and it shouldn't be too hard to implement something like this. |
Yeah I think we might want to modify the constructors to make it simpler, but ensuring that it interacts properly with the configuration system will require a bit of thought.
We don't have any kind of hyperparameter optimization built in yet, but it's something we're interested in. Making it work across all the things Tribuo supports might be tricky so it requires some thought. The metrics are mostly about measuring against ground truth clusterings, so not applicable for your use case, we need to add some more which measure qualities of the clusters themselves. |
Gotcha. This is where the fun begins :) Thank you for all of your help. A simple metric based on whether each feature/card has a hypothetically valid quantity in a given cluster:
Not a very good cluster ;) Time for tuning! Related to optimization, it would be awesome if you could provide your own evaluation classes/functions to the
Where Just spitballing here :) All of this can be achieved now, of course, by writing a little extra code, and this is an awesomely powerful project already. |
We thought about user defined metrics when building the evaluation system, and decided against it for the first public release. The underlying design should allow user metrics to drop in when we enable them, but yes at the moment you'll need to write additional code to aggregate your own metrics into your own |
@Craigacp Sorry to bug you again. One more question about all this. I'm having a bit of confusion interpreting the centroid results. Now that the features are converted to integers and that's working. Where do I find each feature (by ID?) in the resultant clusters? I thought I was correctly assuming that the centroid vectors are indexed by feature ID, and that each average feature value is accessible via
If this doesn't return the feature value/score, where can I get that? Also, it seems like I feel like I'm missing something basic in the docs. Is this all explained somewhere I could look at without bothering you here? I've found this page: https://tribuo.org/learn/4.0/docs/packageoverview.html, the JavaDocs, and the tutorials. |
I suspect it's a gap in the docs. Unfortunately you've hit one of the places where Tribuo exposes its integer ids to the world. We should patch that to give a more user friendly view on it. The centroid vectors are The value of the centroid at a specific index is the point in that dimension (e.g. the quantity of that card).
There is no notion of top features for a K-Means model which is why it gives the empty map. As the task isn't supervised it's hard to say what feature contributes most to a particular clustering. I guess we could compute the features which separate the data the most, but that would only work for the training data, not any future points you passed in to determine the clustering. |
@Craigacp Okay, no problem. Thanks, it seems I'm on the right track. I appreciate the feedback!
I think integer IDs are fine as long as the docs or javadocs say what they are. Of course I'd never complain if they were more strongly typed, as you suggest :)
You mean the index of each element in the
So,
I guess that makes sense. I guess I was expecting something like frequency of feature occurrence across all clusters, which I can certainly calculate myself. |
Tribuo's whole thing is that you should never need to know those ids (apart from when importing an external model, and it's unavoidable there), so we should definitely have a method that returns something better. Probably a
Yep.
Yeah, though you'll need to be a little careful with the sparsity, as it's implicitly zero copies of a card which doesn't appear, so your stats might be a little skewed. BTW there is a K-Means++ initialisation in the main branch, it'll be in the next feature release (i.e. 4.1.0), but at the moment we don't have a timeline for when that will be. The code base in main should always be working though, so if you want to try it out feel free. |
@Craigacp Thanks! I'd definitely be interested in trying that. Looking for it now. Thank god for Maven. Built and installed locally in ~1min, no build issues whatsoever. Do you, perchance, deploy nightly snapshots anywhere? |
We don't deploy nighties and are unlikely to because our release processes aren't setup to move that fast. |
We're trying to make sure that Tribuo is always straightforward to build so hopefully people can just build main themselves if they want the latest bits. |
@Craigacp Makes sense. I think you've achieved that goal :) Unless you're using an organizational level staging repository with OSSRH at Sonatype (or privately hosted repos), I'll grant it's rather cumbersome to push staging artifacts and have to log in to do the manual close/release process. I've been continuing to experiment with the KMeans++ algorithm, and it seems to be working a little better than KMeans. The centroids seem slightly very slightly more accurate, but as you mentioned, I think I'm having an issue with sparsity in the data. There are too many features/cards and too few in each deck, and I don't think the centroids aren't far enough apart to be reliably distinguishable. There's a lot of noise in the clusters. There will be several features with reasonable values that make sense, then a bunch of others that I know to be incorrect. I noticed there are some 'sparsify()' methods in the resultant cluster objects, but I'm assuming that only cleans up the clusters after they've been selected, and there's no built-in way to do trimming/weighting of sparse data during training? Also, please stop me if this is too many questions, or if there's a different medium I should be using. I do appreciate your help, but I don't want to be a pain. |
Update. Looks like things are working better than I thought. It turns out I was still confused about the The feature ID assigned to the actual After I implemting a custom E.g:
ID Once I sorted this out, it actually looks like the clusters are working quite accurately, and it appears that while sparseness may still be having an effect, it's not nearly as pronounced as I initially thought. Now I'm getting results that are much more in line with previous K-Means algorithms I've tried, and seemingly fewer outliers, in a fraction of the time, with support for card quantities as feature values. Note, I have implemented pruning of values below a quantity of 1 (or near one), since it does not make sense to have less than one copy of a card, and cards that appear with less than 1 copy imply that the feature is an outlier in the cluster. A dataset of 500 randomly selected decks from the Legacy format resulted in an initial optimal K of 56, though I still need to do more tweaking to my accuracy/evaluation/quality metrics. Example output clusters:
So I think things are on a good track. |
Well the notion of distance when some elements are not present is hard to define, so we implicitly set missing elements to zero. Otherwise you can end up with degenerate cases where two decks don't have any card overlap, and the distance between them is undefined.
At the moment we've not got a chat or discussion platform setup, so github issues or the mailing list are it, and the issue is fine. I'm glad you've managed to get it working. Yes, Tribuo's feature ids are completely disconnected from the id numbers that you pass in, as Tribuo treats the id numbers you pass in as strings and renumbers the features itself (based on the lexicographic ordering of the feature strings). This is why we should add a method to |
|
We've also merged in an empty response processor implementation for use when loading clustering, anomaly detection or other datasets where you don't expect there to be a ground truth output. I'm going to close this issue now as I think we've patched the usability issues you hit. Open a fresh one if you hit others, or re-open this if you think it's not quite covered by PRs #99 and #98. |
Is your feature request related to a problem? Please describe.
Still a newbie to this library, so thanks for bearing with me.
Right now, the documentation shows how to run K-Means clustering on an auto-generated data set of Gaussian clusters. This is great, as it shows K-Means is possible, but (unless I'm missing something) it does not show the steps to input real data. (It mentions
You can also use any of the standard data loaders to pull in clustering data.
but I don't see where that's documented).I've figured out how to load a CSV file of features and metadata (thanks to your new Colunmar tutorial), but I can't seem to infer how to connect this data with
KMeansTrainer
, or if that's even the right approach.Describe the solution you'd like
A clear and concise description/example of how to load real-world (non-autogenerated) data into the K-Means algorithm.
Describe alternatives you've considered
Looking through JavaDocs, but having trouble knowing what to focus on.
Additional context
The text was updated successfully, but these errors were encountered: