Skip to content

Instantly share code, notes, and snippets.

@amanjaiman
Created December 18, 2019 15:20
Show Gist options
  • Save amanjaiman/b8f98cb5a9993972231a8703e2734f3d to your computer and use it in GitHub Desktop.
Save amanjaiman/b8f98cb5a9993972231a8703e2734f3d to your computer and use it in GitHub Desktop.

Data Science - A Walk-Through of a Tutorial

These days it seems like everyone is enthralled by data science. There is data all around us, as I'm sure you've heard, and people are taking advantage of public data sources to do analytics and predictions. So how can someone new survive in this new fast-growing (and competitive) field?

image here

Turns out it's very easy to get started. If you are interested in learning about data science, there are plenty of resources online where you can learn. Towards Data Science is a great Medium platform where you can read and listen about trending topics in the field. You can view other work being done and work alongside tutorials to get a better understanding of how to get your hands dirty with data.

Kaggle is a great resource for data enthusiasts as well! Kaggle provides users with free data sources (published by other users) and courses in everything you need to master working with data. They have a list of ongoing competitions that you can enter to test out your newly developed skills. Most importantly, for new users, they have notebooks, built on the Kaggle platform. Here, you can look at other people's work and learn how to approach new tasks. Once you feel ready enough, start your own notebook on Kaggle and get started!

The best way to learn data science is through tutorials. These tutorials are put together by people that want to solve a task for whatever reason, and they go step by step through their entire process. Recently, my friends and I made a tutorial as a final project for our data science class. In the tutorial, we work with Airbnb data from New York City, and try to look at different features as predictive variables for price. In this article, we'll be walking through not only that tutorial, but what to expect when reading other tutorials.

Follow along with our full tutorial! Check it out on my website: amanjaiman.github.io/nyc-airbnb-data/


The first part of the tutorial deals with importing the right libraries, finding data to work with and collecting that data. For us, that is fairly straight-forward, as we work with data available on Kaggle. Because we are using a Kaggle notebook, we can simply get the data from their file system. If this isn't the case, you can download the data in many different formats and then import it using a library.

id name host_id host_name neighbourhood_group ...
2539 Clean & quiet apt home by the park 2787 John Brooklyn ...
2595 Skylit Midtown Castle 2845 Jennifer Manhattan ...
3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan ...
... ... ... ... ... ...

Here we had a csv file that we open into a dataframe. A dataframe is a 2d array. Think of it as a table of data, where each row corresponds to a new entry, and each column is a different variable. Most of the data you will work with will be in a similar format because it is easy to work with.

After getting the data, we walk through what this data actually means, and what it contains. In our case, we talk about each of the columns and it's relevance. We also take a look what specific data points we're working with. For example, we deal with 5 different neighbourhood groups (Brooklyn, Manhattan, Queens, Staten Island, Bronx), and three different room types (Shared, Private, Entire Home).

Next steps, my favorite, exploratory data analysis. This is the process of looking at the data deeper, and visualizing the data in easy to understand ways.

two image examples here larger map

Here we can see the price distribution for each neighbourhood. As we expect, Manhattan has a higher mean price, which makes sense because most people are visiting Manhattan than the other 4 boroughs.

Also as a extra graphic, we decided to use plotly and make the actual map, with data points colored in for the price. Manhattan clearly is more red than the other areas.

We decided to do a little bit more (this was our final project), and did some natural language processing. That's a term that's been gaining in popularity. NLP is the act of looking at written or spoking language and figuring out meaning behind the words and sentences we use. We looked at the top 25 words hosts used when naming their property, as well as the general sentiment behind their name, and how that correlates to price.

If you've been following along with our tutorial, you'll see that we're now at the Predicting Price section. This leads into the third part of the tutorial, machine learning, of some sort. This is the most complex section of the tutorial, and deals with trying to predict one of the variables using the others. For our data set, we thought it would be appropriate to predict price from the numerical variables (we also encode some of the categorical variables with numeric values).

first model graph

Our first model didn't perform the greatest, which we attributed to the outliers in the data, and some of the assumptions the model makes. Seeing this, and doing some stats analysis on it, we decide to perform a log transformation on price. We see that that makes our model better! Our residuals are now normally distributed. We then look at the specific impact of the different variables on the price, and remove the ones that we find are not helping our model. We're left with the final model.

model image

Once we summarize our findings, the tutorial is finished! We've been able to walk through the entire process step by step: finding data and figuring out what we're working with, exploratory data analysis, and machine learning. Hopefully the tutorial helps you understand the process of working with data, and this should serve as a tool for what to expect when coming across a tutorial. Once you understand why the different steps are so important, it's much easier to approach a new problem yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment