Skip to content

Instantly share code, notes, and snippets.

@MartinMacharia
Last active September 20, 2017 06:36
Show Gist options
  • Save MartinMacharia/7683d3a39bbf311de1d92d53cd197069 to your computer and use it in GitHub Desktop.
Save MartinMacharia/7683d3a39bbf311de1d92d53cd197069 to your computer and use it in GitHub Desktop.
LME4 Plantwise
---
output: word_document
---
# Plantwise data analysis
Data from PlantWise Kenya
## Comments on data
+ ClinicFormID doesn't show in my file when opened in Excel (no text, just empty cells) - could there be more missing fields?
+ A few entries have no entry number. Makes no difference to my analysis, but it would be good to have a unique identifier (ClinicFormID would be ideal)
+ Area planted has too many levels. Few entries have a two number structure
+ Some columns include Unknown. We should treat it according to the meaning as NA
+ FarmerTelephone has several mistakes (different size,no numbers..). Anyway, I don't think we could make use it for analysis...
+ Standardise units planted <- all to Acres
+ Crops aren't standardized (some spelling).
+ Crops is a very extensive list. We should group them in interesting classification for analysis.
## Comments on data cleaning
```{r echo=FALSE, message=F}
##########
# Prepare the environment for the analyses
# install.packages("gdata")
# install.packages("nlme")
# install.packages("lme4")
require(gdata)
require(nlme)
require(lme4)
# Set working directory
#setwd("C:/Users/gonzalez-moreno/Documents/3_plantwise/")
#Changed to mine
setwd("C:/users/machariam/Desktop/PlantWise")
# Read data file
#data=read.xls("Sri Lanka 160322 RE.xls") # Read xls is problematic as you need to install and locate perl
# data=read.xls("Ghana160321.xls") #
data=read.csv("KE_allvalidationoutcomes_06_04_20162BF.csv")
names(data)#view variable headers
###############
# Clean data
# summary(data)
data=subset(data,ClinicCode!="NA")
data=subset(data,PlantDoctor!="NA")
data=subset(data,Crop!="NA")
summary(data$Gender) # noanswer and NA
data=subset(data,!is.na(data$Gender))
data=subset(data,Gender!="NOANSWER")
data$Year <- as.factor(data$Year) # there are only two years so better treat it is as a factor
data=subset(data,!is.na(data$Year)) # remove rows with uknown year
summary(data$Validity.of.recommendation)
data$validity <- data$Validity.of.recommendation
# reclassify in two levels
levels(data$validity) <- c(NA,"ACCEPT","ACCEPT","ACCEPT","REJECT","REJECT","REJECT")
levels(data$validity)
data=subset(data,!is.na(data$validity)) # remove rows without validation
summary(data$validity)
data$validity <- relevel(data$validity,"REJECT") # change order so 1 is accept and 0 is reject
data$validity
data$validity <- as.numeric(data$validity)-1
data$validity
# summary(data$validity)
```
+ Clean data to delete rows without Clinic,Doctor,Gender, Crop or validation
+ After cleaning we have a total of `r nrow(data)` rows in the dataset
## Validation changes depending on time and gender?
```{r message=F}
summary(data)
################
# Analyses
# A simple linear model, which test against too many units (all residuals), thus incorrectly inflating significance of treatments
model.linear <- glm(data$validity~data$Gender+data$Year,family=binomial(link="logit"))
summary(aov(model.linear))
summary(model.linear)
anova(model.linear, test="Chisq")
exp(coef(model.linear))
# What should the response variable be? First start with acceptance of recommendation, as measure of quality. Later also look at recommendation of Class III pesticides (first as one group, then individually).
# What should the fixed factors be? For example Gender (as this is a key interest of our donors), Year (as we may hypothesise that the qualty of the plant doctors changes over time)
# What should the error structure, or unit of replication be? Clinic (unit) would be the clearest structure to control similar effect within units. Plant doctor could be also interesting but it will require poolishing the data and it might be a extremely complex model.
# Generalized linear model with binomial distribution
model.mix=glmer(validity~Gender+Year+(1|ClinicCode),data=data,family=binomial(link="logit"))
summary(model.mix)
# coef(model.mix)
table(factor(data$ClinicCode) )
# wide variability in number of units across clinics.
```
+ Gender isn't significant in the model but year it is. With 2014 showing higher acceptance
+ We found similar coefficientes in the mixed and fixed model but CI have been reduced (avoid pseudoreplication)
# Plant clinics questions
## Validation per Clinic respond differently to time?
```{r}
######################
# Considering time changes per Clinic
require(arm) # standard errors of random effects
model.mix1=glmer(validity~Year+(1+Year|ClinicCode),data=data,family=binomial(link="logit"))
summary(model.mix1)
coef <- coef(model.mix1) # random effect + average effect coefficent (From fixef)
# fixef(model.mix1)
# ranef(model.mix1)
# Plot Change in validation 2013-2014 per clinic
se<- se.ranef(model.mix1)$ClinicCode[,2]
names <- row.names(coef$ClinicCode)
means <- coef$ClinicCode[,2]
lower <- means - se[]
upper <- means + se
ClinicTime <- data.frame(names,means,lower,upper)
ClinicTime <- ClinicTime[order(ClinicTime$means),]
plot(ClinicTime$means,ylim=range(ClinicTime$lower,ClinicTime$upper),xaxt = "n",xlab="Plant Clinics",ylab="Change in validation 2013/2014")
axis(1, at=1:length(names), labels=ClinicTime$names,las=3)
abline(a=0,b=0)
segments(1:length(names),ClinicTime$lower,1:length(names),ClinicTime$upper)
# More number of clinics that improved in 2014
interaction.plot(data$Year,data$ClinicCode,data$validity,bty="l",lwd=2,las=1)
# Similar plot to show variability across clinics
se
```
+ There is wide variability in the performance of Clinics between 2013 and 2014. However, the average trend is improvement.
+ Including crops in the model gives errors probably because of the large number of combinations.
## Validation per Clinic depends on diversity of crops within the Clinic?
```{r}
######################
# Diversity of crops
data_clinics <- unique(data[,c("ClinicCode","Crop")]) # identify unique combinations
div <- aggregate(data_clinics$Crop,by=list(data_clinics$ClinicCode),length)
names(div) <- c("ClinicCode","CropDiversity")
data <- merge(data,div,by="ClinicCode")
# Considering simple model
model.mix2 <- glm(validity~Year+CropDiversity+Gender,data=data,family=binomial(link="logit"))
stepAIC(model.mix2)
# Considering random effects
data$CropDiversity <- scale(data$CropDiversity) #to facilitate convergence
model.mix3 <- glmer(validity~Year+CropDiversity+(1|ClinicCode),data=data,family=binomial(link="logit"))
summary(model.mix3)
# coef(model.mix3)
```
+ Crop diversity calculated as richness of crops per Clinic has no significant effec once we control per clinic as random.
## Other research ideas
+ network aspect of knowledge dissemination through the Plantwise chatroom Telegram, which would visualise how this "social network" promotes and improves plant health knowledge?
+ clustering of plant doctors - do the ones that interact, work in the same region or learn together have similar recommendations, or is this due to the similarity of pests or crops?
+ does the arrival/detection of a new pest in a country or region increase of downloads of factsheets, suggesting that Plantwise is a successful model to avail required knowledge?
+ column Distribution gives an indication of severity of problem - when do farmers go to the clinics and is there a gender or national difference?
+ do the problem type and diagnosis change over time? Insects are most prevalent, but also easiest to see. If the frequency of the other types increases over time, this may be an indication that plant doctors become better
+ women (and men?) complained that the proposed solutions in Sri Lanka were too difficult or costly to implement - is there variation among countries with respect to the frequency of proposed measures, and can this lead to improved recommendations for plant doctors?
+ personal trajectories - does the quality of the recommendations increase with experience, and is there a significant effect of training events, indicating the value of plant doctor training?
```{r}
interaction.plot(data$Year,data$PlantDoctor,data$validity,bty="l",lwd=2,las=1) # Note that lines are drawn only for plantdoctors active in both years
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment