@simonpcouch and I have been brainstorming with regard to how and where to specify and make the dataset used to estimate a post-processor, currently dubbed the "potato set".
- The proportion of the data used for estimatation (preproc, model, post) that should be held back specificially for estimating the post-processor.
- The method for how to split that data for estimation. This may need to be a time-based or grouped split rather than a random split. If we are in the context of resampling a workflow, it should most likely be the same method as used to make the resamples.
We suggest specifying those things as part of a workflow. We also considered specifying that in rsample, tailor, or tune.
- A tailor can be fit on its own but does not need a notion of data splitting - it just needs the right data.
- A workflow needs to fit the preprocessor and the model on one set and the post-processor on another, hence needs the notion of a data split. The proportion and the method are only needed if there is a tailor, hence they can be arguments to
add_trailor()
. - The tuning functions already break up the fitting of the preprocessor and the model (to be able to efficiently parallelize with keeping low the number of times a preprocessor is fit) and currently also fit the tailor separately. The tuning functions could retrieve the splitting-related arguments from the workflow specification; the method can also be retrieved from the class of the resamples.
- rsample would also be a possible place to specify all aspects of resampling, including the "inner potato split". However, workflowsets and stacking require the same set of resamples for all their members, i.e., it would be impossible to fit a stack (or workflowset) which includes members with and without post-processing, with their corresponding 3-way and binary splits for each resample.
- If used while fitting a workflow directly, the workflow will have to make the potato set.
- Additional note on workflows: We need to change
fit.workflow()
to fit the stages for the preprocessor and the model with the (inner) analysis set and the stage for the post-processor with a different set, the potato set.
- Additional note on workflows: We need to change
- If used in tune, the tuning function currently makes the potato set and will continue to do so, see comment on tune in "Where to specify".
We suggest adding a method to each type of rsplit
to rsample which makes a suitable 3-way split into "analysis for preproc+model", "potato set", and "assessment set".
rsplit
s store the entire data and the indices of the analysis set,in_id
, and have acomplement()
method to construct the assessment set.- The new method would be sorta in spirit with the
complement()
method in the senes that it would allow us to make that 3-way split appropriately, when needed. - What we mean with appropriately is that it is based on the method used to make the
rsplit
. For example, if thersplit
is a bootstrap, the method should take into account that the analysis set may contain the same row multiple times. A potato split at random risks splitting up the multiple rows so that some land in the portion used to fit proprocessor+model and some in the potato set, i.e., the two datasets resulting from the potato split would share information and the post-processor is not properly resampled. Making it a method of thersplit
subclass allows access to the entire dataset used for estimation as well as thein_id
information and we can identify and account for the duplicate rows.
Agree! That wasn't worded precisely enough with calling it a 3-way split. I only meant to indicate that we end up with 3 datasets but the splitting is meant to be two sequential binary splits as previously discussed 👍