plpxsk/new-workflow.Rmd

## new-workflow.Rmd
---
title: "A new analysis workflow"
output: github_document
---

# Organize your data processing program with MECE pieces

*MECE = Mutually exclusive, collectively exhaustive. From McKinsey*

## Summary

When processing an input dataset, instead of creating many copies of it with
names like data1, data2, data3, which has its problems, instead create mutually
exclusive pieces, and then just merge them together at the end.


# Details

You often input a dataset and then need to manipulate it

This is often how you are taught this in school.

This is sort of what that looks like:

```{r}
asl <- read.csv("asl.csv")

asl1 <- asl %>% mutate(newvar=oldvar/12)

asl2 <- asl1 %>% mutate(usubjid = pt)

asl_final <- asl2
```

Problems with this approach:

  * it is hard to keep track of all of these pieces
  * if things change, you have to rename all the numbers

# A better approach

For a better approach, at each step, create a mutually exclusive data frame that contains
only what you need, and at the end, merge all the pieces together. Use
informative names for these pieces.

advantages

  * no need to constantly reorder and rename pieces that end in numbers


Here is a real example:


```{r}
asl <- get_csv("data/clinical/asl.csv")

## STEP1 : process the input dataset using MECE pieces

## one piece:
asl_study_flags <- asl %>%
    select(usubjid, studyid) %>%
    mutate(...)
    select(-studyid)

## another piece:
asl_new_censor_vars <- asl %>%
    select(usubjid, oscnsr, pfscnsr) %>%
    mutate(...)

## another piece:
asl_biomarker_flags <- asl0 %>%
    select(usubjid) %>%
    left_join(...)
    mutate(...)

## STEP 2: at the end, join the mutually exclusive pieces
asl_edited <- asl %>%
    left_join(asl_study_flags) %>%
    left_join(asl_biomarker_flags) %>%
    left_join(asl_new_censor_vars)
```
	---
	title: "A new analysis workflow"
	output: github_document
	---

	# Organize your data processing program with MECE pieces

	MECE = Mutually exclusive, collectively exhaustive. From McKinsey

	## Summary

	When processing an input dataset, instead of creating many copies of it with
	names like data1, data2, data3, which has its problems, instead create mutually
	exclusive pieces, and then just merge them together at the end.


	# Details

	You often input a dataset and then need to manipulate it

	This is often how you are taught this in school.

	This is sort of what that looks like:

	```{r}
	asl <- read.csv("asl.csv")

	asl1 <- asl %>% mutate(newvar=oldvar/12)

	asl2 <- asl1 %>% mutate(usubjid = pt)

	asl_final <- asl2
	```

	Problems with this approach:

	* it is hard to keep track of all of these pieces
	* if things change, you have to rename all the numbers

	# A better approach

	For a better approach, at each step, create a mutually exclusive data frame that contains
	only what you need, and at the end, merge all the pieces together. Use
	informative names for these pieces.

	advantages

	* no need to constantly reorder and rename pieces that end in numbers


	Here is a real example:


	```{r}
	asl <- get_csv("data/clinical/asl.csv")

	## STEP1 : process the input dataset using MECE pieces

	## one piece:
	asl_study_flags <- asl %>%
	select(usubjid, studyid) %>%
	mutate(...)
	select(-studyid)

	## another piece:
	asl_new_censor_vars <- asl %>%
	select(usubjid, oscnsr, pfscnsr) %>%
	mutate(...)

	## another piece:
	asl_biomarker_flags <- asl0 %>%
	select(usubjid) %>%
	left_join(...)
	mutate(...)

	## STEP 2: at the end, join the mutually exclusive pieces
	asl_edited <- asl %>%
	left_join(asl_study_flags) %>%
	left_join(asl_biomarker_flags) %>%
	left_join(asl_new_censor_vars)
	```