Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Issue 221/plot refactoring #224

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

ndphillips
Copy link
Owner

@ndphillips ndphillips commented May 24, 2024

Addressing #221

Function status

  • plot_level_bar()
  • plot_icon_array()
  • plot_tree()
  • plot_confusion_matrix()
  • plot_roc

Checked means

  • Function created
  • Implemented in plot.FFTrees()
  • Legacy code removed

@ndphillips ndphillips marked this pull request as draft May 24, 2024 11:19
@ndphillips
Copy link
Owner Author

ndphillips commented May 24, 2024

Phew I knew this would be a long slog of a PR but I didn't realize it would be so much! There are a lot of unexpected processes going on in this function that is making it hard to cleanly separate the plot into individual plotting functions.

That said, I've been able to create some initial functions for all of the plotting elements listed above and integrate them into plot.FFTrees().

The next big step is to make the individual functions more user friendly and with a consistent API between them. There are lots of decisions to be made about what the minimal arguments should be to each. For example, it's not clear to me which functions should require FFTrees objects as arguments and which should take data frames.

Currently, here's where I'm at with plot_fft(). In my first version, the main argument is a dataframe of level.stats (which I pull out of an FFTrees object using a new pluck_level_stats() helper function:

# Create an FFTrees object from the heartdisease data:
heart_fft <- FFTrees(formula = diagnosis ~.,
                     data = heart.train,
                     data.test = heart.test,
                     decision.labels = c("Healthy", "Disease"))

heart_fft |>
  pluck_level_stats(data = "test", tree = 1) |>
  plot_fft()
image

Will keep working on this when I find the time

@hneth
Copy link
Collaborator

hneth commented May 26, 2024

Yes, it's quite a challenge to break up the older plotting function into functional parts. But it's great that you're wondering about clean input arguments, rather than just dropping the entire FFTrees object into every function!

A tricky aspect of the complex plot is that some parts only require the FFT definition, whereas others require aspects of the current data or the stats from a particular evaluation. If the plotting functions only get the input parts they actually need, this requires helper functions that extract or translate parts of the FFTrees object into simpler data structures. I like the approach you're taking with pluck_level_stats(), and think it is similar to what I tried to do with read_fft_df() (which selects a single FFT from a set of FFT definitions and converts it into a tidy data format (in which each row represents a node). Having such helper functions allows to access and manipulate aspects of FFTs that otherwise remain encapsulated in the FFTrees object.

The plot_fft() function looks great, of course, but I first was surprised that it requires level_stats as an input, rather than just an FFT definition like the tidy data frame obtained from read_fft_df(). But then I realized that we often want to enrich the basic FFT representation with stats from its evaluation (like the icon arrays at exits). In a completely modular setup, plotting the tree itself could be entirely separated from its evaluation. This would allow for even simpler tree plotting functions, but would require additional functions for adding aspects of tree performance to the plot. Hence, using level_stats as input may be a good intermediate level of abstraction?

On a similar note, we should add some option for representing the handling NA values in the data to the plot. (For instance, we could note their number for each predictor at its node, and somehow mark or separate the icons that were classified from NA cases on the final node.) But when level_stats preserve their counts, this should be implementable from your current approach?

@ndphillips
Copy link
Owner Author

Related to this PR, I just drafted a design doc describing how we could refine the structure of FFTrees internal objects to improve consistency and facilitate better object <> function mappings https://github.com/ndphillips/FFTrees/wiki/%5BDraft%5D-FFTrees-Object-Design

@hneth
Copy link
Collaborator

hneth commented May 30, 2024

Thanks for a great analysis and overview (at #226)! But I'm also wondering whether we really want to risk a major re-write at this point? Basically, I see that there is a general trade off between objects, functions, and related data structures. It's certainly true that the current FFTrees object is large, partially inconsistent, and intransparent, but it also contains all the information needed in convenient list structures. Extracting and modifying parts of an object is now possible through dedicated helper functions (like the ones presently used to extract and edit FFT-definitions). Moving towards more of an OOP approach would have long-term benefits, of course, but would also require a complete re-write of the object's interactions (in creating and repeatedly evaluating FFTs on data). Although I applaud your enthusiasm, I'm also skeptical for 2 reasons:

  1. After the last major overhaul, many parts of the original version (e.g. applying new FFT definitions to data) were broken and had to be restored. I fear that revising the object structure would take tremendous efforts and have similar repercussions.

  2. Implementing a new structure of objects would still require managing and changing those objects. For instance, I've recently been experimenting with additional exits — e.g., for "undecided" cases — which would then require modifying the node, outcome, and cost aspects of objects and their corresponding interactions.

Overall, I see the deficits of our current setup, but as I'm also happy with many parts of FFTrees v2.0.0, I'm reluctant to jump into a major overhaul. Despite some enticing benefits, I feel that there are plenty of features that can still be added and improved within the current framework. And as I fear that restoring all present functionality would take months, rather than weeks, I'm wondering: Do we really want to fundamentally change a working system?

@ndphillips
Copy link
Owner Author

ndphillips commented May 30, 2024

Thanks for the thoughts Hans, I hear your concerns and definitely appreciate them:

Some fast initial responses to your 2 main concerns:

  1. Losing functionality. We definitely don't want to lose important functionality! Can you confirm if all of our critical functionality is captured within tests? If No, then perhaps a first step should be to update our tests!

  2. Managing and changing objects. It seems like you're concerned that this proposed re-write would prevent or hinder future development. Is that right? If so, I want to make sure we've captured those concerns in the design and reduced that risk. My strong hope is that this redesign will make future development easier, not harder. Can you capture some specific examples of functionality here or in https://github.com/ndphillips/FFTrees/wiki/%5B80%25%5D-FFTrees-Object-Design that you are concerned would be affected?

@hneth
Copy link
Collaborator

hneth commented May 30, 2024

Just a quick reply to your 2 questions:

  1. No, the most recent additions of functionality (i.e., A: extracting and editing tree descriptions and applying them to data, and B: creating and evaluating FFTs for datasets with NA values) are not covered by tests yet. Parts of A are used in examples and vignettes, I believe, but B is still too recent to be properly evaluated and documented.

  2. I actually think that well-designed objects would facilitate future changes, rather than make them more difficult. But I am afraid of the repercussions of overhauling the entire package, given its complexity and our scarcity of resources (mostly limited time).

The objects you envisage on https://github.com/ndphillips/FFTrees/wiki/%5B80%25%5D-FFTrees-Object-Design appear quite comprehensive, of course. Some additional slots for recording NA values (when they occur at nodes and eventually result in decisions) could easily be added. However, many of the listed objects could be defined and stored more sparsely. For instance, whenever we have the 4 outcomes of the 2x2 matrix (number of hi, mi, fa, cr) all accuracy-related measures (simple ones like n_decide, n_correct, n_error, all variants of acc, and the more specific measures of sens, spec, ppv, npv, dprime, etc.) can easily be derived from them. Hence, I don't think we always need to generate and store all of them in the objects, but rather provide functions for obtaining them when needed.

@ndphillips
Copy link
Owner Author

ndphillips commented May 31, 2024

Thanks Hans!

  1. I created an issue at Test needed for extracting and editing tree descriptions and applying to data #227 capturing the extracting and editing tree descriptions and applying them to data functionality that's missing tests. I believe it's super important that we have tests capturing functionality we don't want to break!

  2. I hear you that the changes like the ones I propose in https://github.com/ndphillips/FFTrees/wiki/%5B80%25%5D-FFTrees-Object-Design seem daunting and risky. However, I'm confident that as long as our package tests capture our core functionality, and nothing gets merged until it's reviewed, I'm up to the task!

  3. Thanks for the ideas regarding ways to improve the sparseness of the proposal. I've added a note in https://github.com/ndphillips/FFTrees/wiki/%5B80%25%5D-FFTrees-Object-Design#notes

Let's continue this discussion at #226

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants