Skip to content

Instantly share code, notes, and snippets.

@nopps07

nopps07/blog.md Secret

Created March 12, 2021 09:10
Show Gist options
  • Save nopps07/6b4c492ff444b9416336f13e26814c06 to your computer and use it in GitHub Desktop.
Save nopps07/6b4c492ff444b9416336f13e26814c06 to your computer and use it in GitHub Desktop.
title author date output
Logistic Regression Tutorial - Reserach Guideline
Gunho Lee
2021 3 10
html_document

https://gist.github.com/504f0ef1f046c2e741046449e8a70b1f

https://gist.github.com/10f204a856cb5357aa5f6b022aa69197

Welcome to the tutorial!

This article is written for those who do not have experiences with Logistic Regression in R. If you are familiar with the theory and if you are looking for more advanced techniques, I do recommend you to search much nicely explained articles on Medium!

Furthermore, the article might assist undergraduates who have never conducted research in their academic life. I have tried to make it very simple so that those freshmen could grasp a basic idea on it.

I would like to provide a general research approach from the beginning to the end very briefly. I am going to explain each by each as if I introduce my own story.

What is your hobby?

You might wonder why I ask this question. The reason is very simple. I normally find my research topics based on what I enjoy. Anyway, let's get to the point. I am a huge fan of red wine. I can't even count how much wine I have been consuming since the pandemic.

Hence, I have started to wonder what decides the taste of Red Wine. I am not an expert, and I have never been to a wine farm either. Let's say we are just curious. Nothing more than that. Do you feel why I add these words here? Because research does not need to be something massive that would likely affect entire world. If we want to draw a paint, the work starts with just a small dot. That's the main point of this article.

Before we dive into real analysis (meaning we type some codes and visualize some graphs), it is very important to fully grasp what we want to achieve by this whole research project. Although I have used the word achieve, it does not need to be huge as I repeat. What I would like to learn is very very very simple, that could be formulated as:

What is(are) the most significant factor(s) to decide the quality of Red Wine?

It is very simple and concise. Now we have finished, in fact, the most important step in your quantitative research project. Let's move on the next procedure.

Literature review

What does it mean? Well, let’s be honest, we should not be the one who tries to add our efforts into this topic. In other words, there should be sufficient amounts of studies which have already analyzed Red Wine and its determinant elements. These information would definitely be helpful to broaden our views and narrow our focus to specific elements.

To save our time, I have collected some of primary data from published articles and journals.

Talking about the Quality of wine is not an easy task since everyone has a different standard to define what is Quality. Not only are the chemical substances important but also wine laws and rules also play a crucial role in quality testing. For instance, the grape growing region is indeed a considerable indicator of producing quality wine around the world. In the French term, the soil and climate conditions are highly essential to earn a certified quality wine label. In the context of chemistry, mouthfeel features could be improved by higher alcohol concentration, and quinine sulphate also showed a substantial impact on taste and mouthfeel attributes.

Alright! Based on the literature review, we have discovered that alcohol and sulphates seem to determine the taste of Red Wine significantly! But, Why is it important? Because it is highly crucial for us to have known this research background before we conduct an analysis. In different words, we could test if these findings match the analysis with our dataset, or will there be a different outcome?

Searching Data

Now it is time to find a dataset for the research.

If you have ever tried to search free datasets on Kaggle or other websites, you might have encountered red & white variants of the Portuguese "Vinho Verde" wine. link

Exploratory Data Analysis (EDA)

Before looking up the overview of the data, let's first import it!!

https://gist.github.com/63fa1a5c8018060e388826cb9ebaee2e

As soon as I have imported data, the next thing I would do is to check how the data looks like. Normally I use these cool and handy functions: "head" "str" "summary"

https://gist.github.com/f285606b9df2473d4a2e5e60c6cbc52f

https://gist.github.com/884832aad851804c3e044c47a266536e

https://gist.github.com/11cfd22d870aca4755159b494e6fc98c

They have given us a nicely sorted overview of the dataset. Now we are interested in quality variable since it would anyway be the dependent variable! In other words, we would like to have a look at how to predict quality given the other variables!

https://gist.github.com/a62b91b4e8aa138a657b7036b08c1127

https://gist.github.com/c49a043d75c6be9c3402613d78800d2e

These codes would be useful to visualise the correlations between the variables in case you want to clarify.

Now, let's decide what sort of statistical analysis we will (and can) use for further statistical analysis. What would you prefer? What method would be most applicable to the case?

Some of you might say multiple regression because we might want to predict the score of quality?

Well, it is not actually a good idea. Let's check WHY

https://gist.github.com/c7e06e66e69a2a785d7202272c6d1ade

Do you get the reason? There are only 6 possible outputs in quality, meaning it is not a proper dependent variable for multiple regression. Ideally, it should be more than 20 values at least. Then what should we do?

Logistic Regression

Logistic Regression can be an alternative that we can consider, and it is what we are going to use for this research. I am not going to explain the details of Logistic Regression, but the important thing to know is that it is used for a binary case, which we will create by data manipulation.

Remember! Our dataset does not have any binary variables now! How can we convert the current quality variable to the binary variable?

Data Manipulation

You may have realised that it has already taken much time and efforts to arrive here despite the fact that we have not even applied some statistical method to the dataset.

Let's be honest. This is the reason why Data Scientist is being called UnSexy job, nevertheless, we have to enjoy this process since if the input is trash, the output will be trash as well no matter how fancy techniques we utilise.

https://gist.github.com/03f986be73a53c2dc4131d461eeed9e9

https://gist.github.com/add576a027708c1e0f77b525edfb8bb3

https://gist.github.com/f435007a04bd315abdf267e8ee98436e

I'd like to divide the values of quality into the two groups:

  1. "BAD" i) quality from 3 to 5
  2. "GOOD" i) quality from 6 to 8

Remember you can always set a different criteria. It depends on your choice as always.

Excellent! Now we have the cleaned and correctly manipulated data for the analysis.

Logistic Regression

https://gist.github.com/a0214c37d654569d96c75ca8525d1662

The first thing we need to do is to split the dataset into a train set and a test set. Let's check if we have specified the train and the test set in a correct order.

https://gist.github.com/32601bbe4b52f77a4eaf1c4cb97dba67

https://gist.github.com/0ac3b71a8ec21accc9e7132fd06edd03

Great! The codes seem to have done their jobs well.

https://gist.github.com/abbd716ec0188f51936713b6d42179ac

We apply glm to the train set to check the correlations between the predictors and the dependent variable (quality). You may feel confused with all those weird numbers if you do not have knowledge of statistics, but it is okay. We are going to focus on the p-values to check WHICH FACTORS ARE SIGNIFICANT.

WHAT IS THE MEANING OF STATISTICAL SIGNIFICANCE??

It means that if predictor A is significant (if its p-value is sufficiently low given the confidence level), its effect on the dependent variable cannot be ignored.

As we have already known from the literature reivew, acidity, sulphates and alcohol are indeed statistically significant to quality. Great! Our findings match our primary study.

Now it is time to build a new model with the significant factors only. (Note: chlorides and free.sulfur.dioxide show significance too, therefore it may be better to check the differences in case they are included. But here I am going to use only more significant variables for simplicity.)

https://gist.github.com/f809de5a57cd0aea4255827ee613c4cb

https://gist.github.com/d71cf9267788a4968ab2cb2be0ad0311

https://gist.github.com/0b872fb8b0954acd699b3d2aab2151a3

https://gist.github.com/7cee00cc59df7e3528609faa6f7c59f2

The visualization helps us to understand the confusion matrix a little bit better. We have to keep in mind that our readers will be mostly the ones who have no knowledge of statistics. Therefore, it is very crucial to make everything simple so that even children can grasp what we want to convey.

https://gist.github.com/c099bc920238bb7ee34fd28315f2afd4

The accuracy test indicates that our model has approximately 74.687% accuracy, meaning that our train set matches 74.687% of the values in the test set.

In statistical terms, the result between 70 to 80% is considered Acceptable. Of course, we have plenty ways of improving the model in order to get much higher accuracy, however, this tutorial does not cover those deeper analysis as mentioned in the beginning.

Conclusion

We have discovered that some factors in the dataset are indeed determinant elements to determine the quality of wine. However, as our accuracy level could be improved, we can say that having the variables is not enough to exactly pinpoint which wine is considered Good or Bad.

Based on our primary research, we can guess there should be differences between the regions of each wine, and the temperatures of each region, and so on and on. It would have been better if the dataset had contained those information too. Let's say it is the limitation of our research, which should be addressed in future research.

Summary

During the tutorial, we have briefly discussed how research can be conducted from the beginning to the end.

Let me summarise what we have discussed so far.

To do list

  1. Find any research topic within the boundary of your interest (music, sports, yoga, whatever)
  2. Decide what you would like to learn from the research and formulate a research question accordingly
  3. Find relevant datasets for your research
  4. Conduct literature review to collect primary information
  5. Check if your dataset is cleaned well so it can do proper analysis, if not, do Data manipulation
  6. Apply a proper statistical method and interpret the result carefully
  7. Draw your conclusion and specify limitations and suggestions for future research

I do hope you have earned a small insight from the article. If so, I am inclined to believe that it will remind you of the basic steps to build a great research approach in your interests. Thanks for your time on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment