Skip to content

Instantly share code, notes, and snippets.

@hfrick
Last active May 8, 2024 09:14
Show Gist options
  • Save hfrick/2774fb140ebe43e8a1fe855e5883c6f6 to your computer and use it in GitHub Desktop.
Save hfrick/2774fb140ebe43e8a1fe855e5883c6f6 to your computer and use it in GitHub Desktop.

More thoughts on data sets for post-processing

@simonpcouch and I have been brainstorming with regard to how and where to specify and make the dataset used to estimate a post-processor, currently dubbed the "potato set".

What to specify

  • The proportion of the data used for estimatation (preproc, model, post) that should be held back specificially for estimating the post-processor.
  • The method for how to split that data for estimation. This may need to be a time-based or grouped split rather than a random split. If we are in the context of resampling a workflow, it should most likely be the same method as used to make the resamples.

Where to specify

We suggest specifying those things as part of a workflow. We also considered specifying that in rsample, tailor, or tune.

  • A tailor can be fit on its own but does not need a notion of data splitting - it just needs the right data.
  • A workflow needs to fit the preprocessor and the model on one set and the post-processor on another, hence needs the notion of a data split. The proportion and the method are only needed if there is a tailor, hence they can be arguments to add_trailor().
  • The tuning functions already break up the fitting of the preprocessor and the model (to be able to efficiently parallelize with keeping low the number of times a preprocessor is fit) and currently also fit the tailor separately. The tuning functions could retrieve the splitting-related arguments from the workflow specification; the method can also be retrieved from the class of the resamples.
  • rsample would also be a possible place to specify all aspects of resampling, including the "inner potato split". However, workflowsets and stacking require the same set of resamples for all their members, i.e., it would be impossible to fit a stack (or workflowset) which includes members with and without post-processing, with their corresponding 3-way and binary splits for each resample.

Where to make the potato set

  • If used while fitting a workflow directly, the workflow will have to make the potato set.
    • Additional note on workflows: We need to change fit.workflow() to fit the stages for the preprocessor and the model with the (inner) analysis set and the stage for the post-processor with a different set, the potato set.
  • If used in tune, the tuning function currently makes the potato set and will continue to do so, see comment on tune in "Where to specify".

How to make the potato set

We suggest adding a method to each type of rsplit to rsample which makes a suitable 3-way split into "analysis for preproc+model", "potato set", and "assessment set".

  • rsplits store the entire data and the indices of the analysis set, in_id, and have a complement() method to construct the assessment set.
  • The new method would be sorta in spirit with the complement() method in the senes that it would allow us to make that 3-way split appropriately, when needed.
  • What we mean with appropriately is that it is based on the method used to make the rsplit. For example, if the rsplit is a bootstrap, the method should take into account that the analysis set may contain the same row multiple times. A potato split at random risks splitting up the multiple rows so that some land in the portion used to fit proprocessor+model and some in the potato set, i.e., the two datasets resulting from the potato split would share information and the post-processor is not properly resampled. Making it a method of the rsplit subclass allows access to the entire dataset used for estimation as well as the in_id information and we can identify and account for the duplicate rows.
@hfrick
Copy link
Author

hfrick commented May 8, 2024

Agree! That wasn't worded precisely enough with calling it a 3-way split. I only meant to indicate that we end up with 3 datasets but the splitting is meant to be two sequential binary splits as previously discussed 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment