pgomez1dpu/ExampleOfLikelihoods.Rmd

## ExampleOfLikelihoods.Rmd
---
title: "OTEs and Effectiveness"
author: "Pablo Gomez (PSY)"
date: "February 8, 2016"
output: pdf_document
---


The first step is to measure whatever can be easily measured. This is OK as far as it goes. The second step is to disregard that which can't be easily measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can't be measured easily really isn't important. This is blindness. The fourth step is to say that what can't be easily measured really doesn't exist. This is suicide.

—Daniel Yankelovich


#Super brief intro

I ran these simulations essentially because I wanted to understand how big of an issue it is to assess faculty's teaching competencies based on the OTEs alone. Some of the ideas are taken from this paper:

http://www.stat.berkeley.edu/~stark/Preprints/evaluations14.pdf


Sadly, my field has a tradition of falling into **quantitative fallacies**. We tend to have a procedural disposition towards data analysis that favors mindless calculations, simplistic interpretations, and a willingness to believe that just because there is a quantitative result, we understand the underlying processes of interest.

For years, I have seen the uncritical acceptance of the quantitative fallacy during tenure and promotion cases, and I figured that a critical examination of the interpretability of OTEs was in order.

Even a brief examination of the literature on OTEs shows that the question of the relationship between the evaluations and teaching effectiveness has been extensively studied, and that at best, there is a moderate relationship between evaluations and teaching effectiveness.

The goal of the present simulations is to examine the interpretability of OTEs. For these simulations, the assumed correlation between evaluations and effectiveness is $r=.4$ (the highest correlation reported in the literature to my knowledge).  Given this relationship, how can we interpret deviations from the mean? What can we learn about differences among faculty members?

The code is provided so that the curious reader can reproduce the simulation and modify the assumptions.

Needed packages.

```{r, echo=FALSE}

library(hexbin)
library(truncnorm)
```

First generate the simulated data: Generate the ```eval``` distribution, then the $r=.4$ correlated effectiveness; then normalize effectiveness into ```z.effect```

```{r}
eval<-rtruncnorm(500000, a=1,b=5,mean=4.4,sd=.7)


correlatedValue = function(x, r){
  r2 = r**2
  ve = 1-r2
  SD = sqrt(ve)
  e  = rnorm(length(x), mean=0, sd=SD)
  y  = r*x + e
  return(y)
}

effectiveness <- correlatedValue(eval, .4)
z.effect      <- scale(effectiveness)
```


Plot the teaching z-score of effectiveness against the eval score


```{r}
bin<-hexbin(eval,z.effect,xbins=60)
plot(bin, xlab="Eval", ylab="z.Effectiveness")

```


Pick 4 types of courses based on their effectiveness...

1. Courses that are below average in their effectiveness (regardless of **eval**)

2. Courses that are 1 SD below average --or "problematic courses"-- in their effectiveness (regardless of **eval**)

3. Courses that are 2 SD below average --or "bad courses"-- in their effectiveness(regardless of **eval**)


4. Course that are above average in their effectiveness (regardless of **eval**)


These would be their distributions of ```evals```


```{r}
evalbybelowavg <- eval[z.effect<0]
evalbyproblem  <- eval[z.effect< -1]
evalbybad      <- eval[z.effect< -2]
evalbyavobeavg <- eval[z.effect>0]


plot(density(evalbyavobeavg),lwd=3, col="black", xlim=c(1,5))
lines(density(evalbybelowavg),lwd=3,col="gray")
lines(density(evalbybad),lwd=3,col="orange")
lines(density(evalbyproblem),lwd=3,col="red")

```


Now we can  evaluate the likelihood of an ```eval``` score coming from the **bad** group, relative to that same score coming from the **above average** group.  Akin to a Bayes Factor (BF from now on).

```{r}
below   <- approxfun(density(evalbybelowavg))
problem <- approxfun(density(evalbyproblem))
bad     <- approxfun(density(evalbybad))
good    <- approxfun(density(evalbyavobeavg))
```

For example, for an average score of 3: $BF=\frac{p(eval=3|BelowAvgEffectiveness)}{p(eval=3|AbovAvgEffectiveness)}$


```{r, echo=TRUE}
below(3)/good(3)
```

This shows that a score of 3 in the eval is twice as likely to come from a *below average* course (in terms of effectiveness), than from an *above average* course.


Below, I generate the BF as a function the eval score for the comparison between Effectiveness < mean $vs.$ Effectiveness > mean (gray line), problematic $vs.$ Above mean (red line), and bad $vs.$ Above mean (orange line).   Note that the red and orange lines are not quite appropriate as the relevant ratio should be bad/not-bad and problematic/not-problematic.

```{r}
BFbel  <-array(dim=17)
BFprob <-array(dim=17)
BFbad  <-array(dim=17)

for(i in 1:17){

  score<- seq(1,5,.25)[i]
  BFbel[i]  <-below(score)/good(score)
  BFprob[i] <-problem(score)/good(score)
  BFbad[i]  <-bad(score)/good(score)
}

plot(seq(2,5,.25),BFbel[5:17],type="l",col="gray")
lines(seq(2,5,.25),BFprob[5:17],type="l",col="red")
lines(seq(2,5,.25),BFbad[5:17],type="l",col="orange")

```

# Conclusion


Indeed, evaluations lower than the mean are more likely to come from courses that are less than average in their effectiveness. But even under the most optimistic assumptions about the strength of relationship between teaching effectiveness ($r=.4$), evaluation scores provide evidence barely worth mentioning in favor/against the hypothesis of the course being effective/not-effective. According to the Jeffrey's guidelines $BF < 3$ are ``barely worth mentioning".

This correlation coefficient between teaching effectiveness and evaluations used in this simulation is large.  It is numerically similar to the correlation between IQ of parents and IQ of their children.  Suppose that we want to make decisions about the children based on their IQ but all we can do is measure the IQ of the parents.  We could spend a lifetime discussing the best possible set of tools to measure the IQ of parents; but at the end of the day, even with the best instrument the interpretability of any difference is questionable at best, and most likely just inappropriate.

Perhaps an anecdotal example that shows the disconnect between OTEs and quality of instruction could illustrate the problem that we face when we use OTEs to evaluate teaching effectiveness.  let's suppose we have two instructors:

*Instructor A*: ``The p value is the probability of the data being due to chance alone, so 1-p is the probability of our intervention being working as expected. Mathematically, it is the  $p(H_0)$"

*Instructor B*: ``The p value is a conditional probability: it is the probability of our test statistic, or a more extreme one, occurring if the null hypothesis was true. It answers the question: what is the probability of my findings (or more extreme ones), given the then Null hypothesis is right. Mathematically, it is $p(T>t | H_0)$.  "

The explanation by Instructor A is simpler, more concise, more intuitive, easier to understand, but it is also factually incorrect.  Would instructor B get better ratings?  Even the most enthusiastic proponent of OTEs should doubt that Instructor B will be rated as explaining things better than Instructor A.

OTEs do not measure learning (or at least not very well). OTEs might be a great way tomeasure motivation and effort (which is important). Indeed we would rather have motivated and happy students. But when the sole method to evaluate teaching is related to these hedonic  outcomes, as important as they are, any rational faculty member will adjust her/his teaching to that incentive system.


#What about good looks?


Ratings in Ratemyprofessor.com correlate with ratings in formal Evals $r = .68$...  so 46% of variance is shared by the two measures.  The hot chilli (how attractive the instructor is) correlates at r=.64 with overall quality in RMP.com (40% of the variance)...  While indeed, it is possible that the shared variance between hotness and formal Evals is zero... most likely it is significantly higher than that... up to a whopping upper limit of 40%!


``The 102 professors ranked as least attractive in the sample had an average quality rating of 2.14, and an average easiness rating of 2.20. Meanwhile the 99 “hottest” profs had an average quality score of 4.43, and an easiness rating of 3.5."

https://www.insidehighered.com/news/2006/05/08/rateprof

http://pareonline.net/getvn.asp?v=12&n=6

Sadly, those of us not graced with the type of genes that lead to higher chilli peppers, cannot do much about it.


Having said all of this, maybe the solution is to:

a) Stop pretending that we care about teaching effectiveness beyond the happiness of the paying public.

OR

b) Stop pretending that the evals measure anything other than motivation and pleasantness of the experience (as important as that might be).

I think B is more desirable, which in my humble opinion lead to two corollaries:

1. The questions should address only those factors

2. We need different systems to provide for accountability and assessment of the other aspects of teaching effectiveness. At least in our department, they are NOT taken into account at all....  Instead we have had serious discussions about "teaching problems" when people are 0.2 below the mean for that quarter in an ordinal scale!!!!  (That does make my blood boil!!!!)
	---
	title: "OTEs and Effectiveness"
	author: "Pablo Gomez (PSY)"
	date: "February 8, 2016"
	output: pdf_document
	---


	The first step is to measure whatever can be easily measured. This is OK as far as it goes. The second step is to disregard that which can't be easily measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can't be measured easily really isn't important. This is blindness. The fourth step is to say that what can't be easily measured really doesn't exist. This is suicide.

	—Daniel Yankelovich



	#Super brief intro

	I ran these simulations essentially because I wanted to understand how big of an issue it is to assess faculty's teaching competencies based on the OTEs alone. Some of the ideas are taken from this paper:

	http://www.stat.berkeley.edu/~stark/Preprints/evaluations14.pdf


	Sadly, my field has a tradition of falling into quantitative fallacies. We tend to have a procedural disposition towards data analysis that favors mindless calculations, simplistic interpretations, and a willingness to believe that just because there is a quantitative result, we understand the underlying processes of interest.

	For years, I have seen the uncritical acceptance of the quantitative fallacy during tenure and promotion cases, and I figured that a critical examination of the interpretability of OTEs was in order.

	Even a brief examination of the literature on OTEs shows that the question of the relationship between the evaluations and teaching effectiveness has been extensively studied, and that at best, there is a moderate relationship between evaluations and teaching effectiveness.

	The goal of the present simulations is to examine the interpretability of OTEs. For these simulations, the assumed correlation between evaluations and effectiveness is $r=.4$ (the highest correlation reported in the literature to my knowledge). Given this relationship, how can we interpret deviations from the mean? What can we learn about differences among faculty members?

	The code is provided so that the curious reader can reproduce the simulation and modify the assumptions.

	Needed packages.

	```{r, echo=FALSE}

	library(hexbin)
	library(truncnorm)
	```

	First generate the simulated data: Generate the ```eval``` distribution, then the $r=.4$ correlated effectiveness; then normalize effectiveness into ```z.effect```

	```{r}
	eval<-rtruncnorm(500000, a=1,b=5,mean=4.4,sd=.7)


	correlatedValue = function(x, r){
	r2 = r**2
	ve = 1-r2
	SD = sqrt(ve)
	e = rnorm(length(x), mean=0, sd=SD)
	y = r*x + e
	return(y)
	}

	effectiveness <- correlatedValue(eval, .4)
	z.effect <- scale(effectiveness)
	```


	Plot the teaching z-score of effectiveness against the eval score



	```{r}
	bin<-hexbin(eval,z.effect,xbins=60)
	plot(bin, xlab="Eval", ylab="z.Effectiveness")

	```


	Pick 4 types of courses based on their effectiveness...

	1. Courses that are below average in their effectiveness (regardless of eval)

	2. Courses that are 1 SD below average --or "problematic courses"-- in their effectiveness (regardless of eval)

	3. Courses that are 2 SD below average --or "bad courses"-- in their effectiveness(regardless of eval)


	4. Course that are above average in their effectiveness (regardless of eval)



	These would be their distributions of ```evals```


	```{r}
	evalbybelowavg <- eval[z.effect<0]
	evalbyproblem <- eval[z.effect< -1]
	evalbybad <- eval[z.effect< -2]
	evalbyavobeavg <- eval[z.effect>0]



	plot(density(evalbyavobeavg),lwd=3, col="black", xlim=c(1,5))
	lines(density(evalbybelowavg),lwd=3,col="gray")
	lines(density(evalbybad),lwd=3,col="orange")
	lines(density(evalbyproblem),lwd=3,col="red")

	```


	Now we can evaluate the likelihood of an ```eval``` score coming from the bad group, relative to that same score coming from the above average group. Akin to a Bayes Factor (BF from now on).

	```{r}
	below <- approxfun(density(evalbybelowavg))
	problem <- approxfun(density(evalbyproblem))
	bad <- approxfun(density(evalbybad))
	good <- approxfun(density(evalbyavobeavg))
	```

	For example, for an average score of 3: $BF=\frac{p(eval=3\|BelowAvgEffectiveness)}{p(eval=3\|AbovAvgEffectiveness)}$


	```{r, echo=TRUE}
	below(3)/good(3)
	```

	This shows that a score of 3 in the eval is twice as likely to come from a below average course (in terms of effectiveness), than from an above average course.


	Below, I generate the BF as a function the eval score for the comparison between Effectiveness < mean $vs.$ Effectiveness > mean (gray line), problematic $vs.$ Above mean (red line), and bad $vs.$ Above mean (orange line). Note that the red and orange lines are not quite appropriate as the relevant ratio should be bad/not-bad and problematic/not-problematic.

	```{r}
	BFbel <-array(dim=17)
	BFprob <-array(dim=17)
	BFbad <-array(dim=17)

	for(i in 1:17){

	score<- seq(1,5,.25)[i]
	BFbel[i] <-below(score)/good(score)
	BFprob[i] <-problem(score)/good(score)
	BFbad[i] <-bad(score)/good(score)
	}

	plot(seq(2,5,.25),BFbel[5:17],type="l",col="gray")
	lines(seq(2,5,.25),BFprob[5:17],type="l",col="red")
	lines(seq(2,5,.25),BFbad[5:17],type="l",col="orange")

	```

	# Conclusion


	Indeed, evaluations lower than the mean are more likely to come from courses that are less than average in their effectiveness. But even under the most optimistic assumptions about the strength of relationship between teaching effectiveness ($r=.4$), evaluation scores provide evidence barely worth mentioning in favor/against the hypothesis of the course being effective/not-effective. According to the Jeffrey's guidelines $BF < 3$ are ``barely worth mentioning".

	This correlation coefficient between teaching effectiveness and evaluations used in this simulation is large. It is numerically similar to the correlation between IQ of parents and IQ of their children. Suppose that we want to make decisions about the children based on their IQ but all we can do is measure the IQ of the parents. We could spend a lifetime discussing the best possible set of tools to measure the IQ of parents; but at the end of the day, even with the best instrument the interpretability of any difference is questionable at best, and most likely just inappropriate.

	Perhaps an anecdotal example that shows the disconnect between OTEs and quality of instruction could illustrate the problem that we face when we use OTEs to evaluate teaching effectiveness. let's suppose we have two instructors:

	Instructor A: ``The p value is the probability of the data being due to chance alone, so 1-p is the probability of our intervention being working as expected. Mathematically, it is the $p(H_0)$"

	Instructor B: ``The p value is a conditional probability: it is the probability of our test statistic, or a more extreme one, occurring if the null hypothesis was true. It answers the question: what is the probability of my findings (or more extreme ones), given the then Null hypothesis is right. Mathematically, it is $p(T>t \| H_0)$. "

	The explanation by Instructor A is simpler, more concise, more intuitive, easier to understand, but it is also factually incorrect. Would instructor B get better ratings? Even the most enthusiastic proponent of OTEs should doubt that Instructor B will be rated as explaining things better than Instructor A.

	OTEs do not measure learning (or at least not very well). OTEs might be a great way tomeasure motivation and effort (which is important). Indeed we would rather have motivated and happy students. But when the sole method to evaluate teaching is related to these hedonic outcomes, as important as they are, any rational faculty member will adjust her/his teaching to that incentive system.


	#What about good looks?


	Ratings in Ratemyprofessor.com correlate with ratings in formal Evals $r = .68$... so 46% of variance is shared by the two measures. The hot chilli (how attractive the instructor is) correlates at r=.64 with overall quality in RMP.com (40% of the variance)... While indeed, it is possible that the shared variance between hotness and formal Evals is zero... most likely it is significantly higher than that... up to a whopping upper limit of 40%!


	``The 102 professors ranked as least attractive in the sample had an average quality rating of 2.14, and an average easiness rating of 2.20. Meanwhile the 99 “hottest” profs had an average quality score of 4.43, and an easiness rating of 3.5."

	https://www.insidehighered.com/news/2006/05/08/rateprof

	http://pareonline.net/getvn.asp?v=12&n=6

	Sadly, those of us not graced with the type of genes that lead to higher chilli peppers, cannot do much about it.


	Having said all of this, maybe the solution is to:

	a) Stop pretending that we care about teaching effectiveness beyond the happiness of the paying public.

	OR

	b) Stop pretending that the evals measure anything other than motivation and pleasantness of the experience (as important as that might be).

	I think B is more desirable, which in my humble opinion lead to two corollaries:

	1. The questions should address only those factors

	2. We need different systems to provide for accountability and assessment of the other aspects of teaching effectiveness. At least in our department, they are NOT taken into account at all.... Instead we have had serious discussions about "teaching problems" when people are 0.2 below the mean for that quarter in an ordinal scale!!!! (That does make my blood boil!!!!)