Skip to content

Instantly share code, notes, and snippets.

@Ethan-Arrowood
Last active May 25, 2018 07:51
Show Gist options
  • Save Ethan-Arrowood/fc8098fba4b881a7a04b7fdd56eadab0 to your computer and use it in GitHub Desktop.
Save Ethan-Arrowood/fc8098fba4b881a7a04b7fdd56eadab0 to your computer and use it in GitHub Desktop.
DataAnalytics.md

Data Analytics Assignment

Nana Tsujikawa & Ethan Arrowood

Nana's Source Code repository

Ethan's Source Code repository

Table Of Contents

Part 1: Data Preparation and Preprocessing

We selected the Videogame review set. It contains 231780 reviews. The attributes available in the data set include the unique item number (asin), the ratings (helpful, overall), metadata about the review (reviewText, reviewTime, summary, unixReviewTime), and metadata about the reviewer (reviewerID, reviewerName). The helpful rating metric is stored as an array; the first element is the number of helpful up-votes and the second element is the number of not-helpful down-votes. We will use the ratings and review metadata to analyze the sentiment of the reviews and we will use the time metadata for studying the cyclical events as well as influence of critical events.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# import data
json_reader = pd.read_json('data/reviews_Video_Games_5.json', lines=True, chunksize=1000)

# instantiate data frame
df = pd.DataFrame(columns=['reviewerID', 'asin', 'reviewerName', 'helpful', 'reviewText', 'overall', 'summary', 'unixReviewTime', 'reviewTime'])

# process data
num_chunks = 0
for chunk in json_reader:
    df = pd.concat([df, chunk])
    num_chunks+=1
    

print('Number of reviews: %d' % len(df))
print('Number of chunks: %d' % num_chunks)
print(df.info())

Output:

Number of reviews: 231780
Number of chunks: 232
Int64Index: 231780 entries, 0 to 231779
Data columns (total 9 columns):
asin              231780 non-null object
helpful           231780 non-null object
overall           231780 non-null object
reviewText        231780 non-null object
reviewTime        231780 non-null object
reviewerID        231780 non-null object
reviewerName      228967 non-null object
summary           231780 non-null object
unixReviewTime    231780 non-null object
dtypes: object(9)
memory usage: 17.7+ MB

Data is loaded via Pandas .read_json method using chunkloading. This returns an iterable FileReader object containing DataFrames as chunks of data. We then concatenate all of these chunks into a single DataFrame. Using the provided .info() and .head() methods we are able to learn more about the data set and preview the first few reviews.

# remove reviewTime column
if 'reviewTime' in df.columns:
    df.drop(columns=['reviewTime'], inplace=True)

# show na/null values
null_data = df[df.isnull().any(axis=1)]
print(null_data[:5])

# the only column worth filling is the 'reviewerName'
df.fillna(value={'reviewerName': 'Missing Reviewer Name'}, inplace=True)

Due to the existence of the unixReviewTime column, we deemed the standard reviewTime column unnecessary and thus dropped it from the DataFrame df. Remaining data is cleaned by filling all of the NA values with a default value. In our data set the only missing data was reviewerName; it was replaced with the string "Missing Reviewer Name". We were able to prove this using the # show na/null values code block from the code snippet above.

Concluding our data preparation and processing we hypothesize there will be a positive correlation between positive-sentiment reviews and high ratings. We also hypothesize a unique trend between the ratings and time of year. We expect to see more activity during the holiday season of every year, particularly around November to February, as well as the weekend among the seven days of the week. We expect this to correlate with the overall average rating in which it will be higher as well as seeing an increase in word count for reviews during these times.

Part 2: Data Analysis and Interpretation

The data analysis section is broken up into main parts: Sentiment Analysis and Cyclical Event Analysis. All graphs are presented as images and are produced using the provided code snippets.

Sentiment Analysis

By Ethan Arrowood

The sentiment analysis was produced using NLTK and TextBlob libraries. All graphs were created using Numpy and MatPlotLib. To begin the analysis I imported textblob and analyzed the sentiment using their standard sentiment analyzer.

Standard TextBlob Sentiment Analyzer

from textblob import Blobber
import matplotlib.pyplot as plt
tb = Blobber
df['sentiment'] = df.apply(lambda row: tb(row['reviewText']).sentiment, axis=1)
data_list = [(review['overall'], review['sentiment']) for i, review in df.iterrows()]

avg_pol_vs_ovrl = dict()
for entry in data_list:
    if entry[0] in avg_pol_vs_ovrl:
        avg_pol_vs_ovrl[entry[0]] += [entry[1].polarity]
    else:
        avg_pol_vs_ovrl[entry[0]] = [entry[1].polarity]

overall_ratings = list(range(1, 6))
average_polarity = [np.average(v) for k, v in avg_pol_vs_ovrl.items()]

plt.plot(overall_ratings, average_polarity)
plt.title('Average Polarity vs. Overall Rating')
plt.xlabel('Overall Rating')
plt.ylabel('Average Polarity')

Average Polarity vs. Overall Rating

My first analysis compares the average polarity and the overall rating of reviews. The average polarity comes from the .sentiment.polarity property from the sentiment column, and the overall rating comes from the overall column of the data frame.

data = [v for k, v in avg_pol_vs_ovrl.items()]
plt.boxplot(data)
plt.title('Polarity vs. Overall Rating')
plt.xlabel('Overall Rating')
plt.ylabel('Polarity')

I then plotted just the data using a box-and-whisker plot to produce the following graphic: Polarity vs. Overall Rating

Naive Bayes Analyzer

For my final analysis I used a Naive Bayes Analyzer from NLTK and produced a graph that shows the number of Positive/Negative reviews per overall rating. This analysis gave far better results than the previous analyzer; the results will be discussed further in Part 3.

import nltk
# nltk.download('punkt')
from textblob import Blobber
from textblob.sentiments import NaiveBayesAnalyzer
import matplotlib.pyplot as plt
from collections import Counter

t_df = df

tb = Blobber(analyzer=NaiveBayesAnalyzer())

t_df.loc[:,'sentiment'] = t_df.apply(lambda row: tb(row['reviewText']).sentiment, axis=1)

data_list = [(review['overall'], review['sentiment']) for i, review in t_df.iterrows()]

class_vs_ovrl = dict()

for entry in data_list:
    if entry[0] in class_vs_ovrl:
        class_vs_ovrl[entry[0]] += [entry[1].classification]
    else:
        class_vs_ovrl[entry[0]] = [entry[1].classification]

count_dict = { 1: None, 2: None, 3: None, 4: None, 5: None }
for k, v in class_vs_ovrl.items():
    count_dict[k] = Counter(v)
    
neg_counts = [v['neg'] for k, v in count_dict.items()]
pos_counts = [v['pos'] for k, v in count_dict.items()]

print(count_dict)

N = 5
ind = np.arange(N)
width = 0.35

p1 = plt.bar(ind, neg_counts, width)
p2 = plt.bar(ind, pos_counts, width, bottom=neg_counts)

max_of_counts = max(neg_counts) + max(pos_counts)

plt.ylabel('Classification (Pos | Neg)')
plt.title('Classification by Overall Rating')
plt.xlabel('Overall Rating')
plt.xticks(ind, ('1', '2', '3', '4', '5'))
plt.legend((p1[0], p2[0]), ('Neg', 'Pos'))

plt.show()

Classification vs Overall Rating

Cyclical Event Analysis

By Nana Tsujikawa

For this section, I have put together a series of cyclical data based around months, and days of the week. The main library used was DateTime. With the help of this library, I was able to retrieve the day of the week, months, and year for the Unix Time column that was in the dataset. Additionally, MatPlotLib and Numpy was used to create the graphs.

In order to categorise the information into the specific type of time periods, I was required to sort the rows into months, or days using the `unixReviewTime’ column. For this I simply used a two-dimensional array; One for the month or day of the week, and the other for the rows that belong to them. Furthermore, in section C of Part 2, I used a dictionary to store the rows for the years.

def getDay(unix_time): #Monday is 0 and Sunday is 6
    d = date.fromtimestamp(unix_time)
    return d.weekday()

def getRowsforEachDay(imported_list):
    days = [ [] for i in range(7) ]
    for row in imported_list:
        day = getDay(df['unixReviewTime'].iloc[row])
        days[day].append(row)
    return days
    
def getMonth(unix_time): #January is 1 and December is 12
    d = date.fromtimestamp(unix_time)
    return d.month

def getRowsforEachMonth(imported_list):
    months = [ [] for i in range(12) ]
    for row in imported_list:
        month = getMonth(df['unixReviewTime'].iloc[row])
        months[month - 1].append(row)
    return months

Before I started the analysis, I created an overall graph of the number of reviews made over the years the set of data had available. These years include middle of 1999 till middle of 2014. To avoid the line graph from creating a major dip from 17,500 to 0 towards the end, I decided to exclude 2014 from this set. I used the column unixReviewTime for this section and simply counted the number of rows per year.

def getNumberofItemPerList(imported_list):
    count = []
    for i in range(len(imported_list)):
        count.append(len(imported_list[i]))
    return count

Number of reviews made per year

The spikes are a sequential pattern mainly around the beginning and end of the year which shows the increase in the number of reviews that were made.

To further expand on the spikes during, I examined the trend over the twelve month period of all the years from 1999, to 2014. I used the column values unixReviewTime and overall for this section. The average overall rating was calculated in a separate function and returned an array of the average rating for each month.

def getAverageOverallRating(imported_list):
    average = []
    for i in range(len(imported_list)):
        average.append(df[['overall']].iloc[imported_list[i]].mean(axis=0))
    return average

Average overall rating

For the next set of analysis, I studied the user activity and their involvement in reviewing products, particularly at the effort put in the reviews over the weekend. I used the column values unixReviewTime and reviewText for this section. I used a function getAverageReviewWordCount() to return an array of the average words in the review text (ReviewText). I had to extract each review into a separate array before calculating the mean of that array. Looking back, I could have separated these into different functions.

def getAverageReviewWordCount(imported_list):
    average_word_count = []
    for i in range(len(imported_list)):
        review_text_count = []
        review_text_count.clear()
        for j in imported_list[i]:
            temp = df['reviewText'].iloc[j].split()
            review_text_count.append(len(temp))
        average_word_count.append(np.mean(review_text_count))
    return average_word_count

average word count of reviews

Influence of Important Events

By Nana Tsujikawa

In this section, I used a dictionary to store the rows for the years. I did this because the years for this data set was unknown, unlike the days of the week with 7, or months with 12 values.

def getYear(unix_time):
    d = date.fromtimestamp(unix_time)
    return d.year

def getRowsforEachYear():
    year_dict = {} #global variable 
    for i in range(len(df)):
        year = getYear(df['unixReviewTime'].iloc[i])
        if year in year_dict:
            year_dict[year].append(i)
        else:
            year_dict[year] = [i]
    return year_dict

Prior to beginning my analysis for the influence of events, I researched gaming history and trends over the year 1999 till 2014. In the graph, Number of Reviews Made Per Year it can be seen that there is little to no significant increase in the early 2000’s. This caught my attention to the status on PC gaming in the 2000's.

Considering the data set are PC games, I decided to examine the trend in the number of reviews over the 2000's.

number of review made per year

As there was no significant activity during the years 2002 to 2006, I graphed all these together to show the relationship among all five years over the 12 month period. I used the function getValuesfromKeyInDict() shown below to convert the values (rows) of a key (year) to a list, then I used getRowsforEachMonth() to separate the rows into months. Finally, using the function getNumberofItemPerList() to get the number of reviews per month.

def getValuesfromKeyInDict(imported_dict, key):
    values = []	
    for row in imported_dict[key]:
        values.append(row)
    return values

number of reviews made per month

Upon further research showed 2007 being a major PC gaming historical year due to certain game releases. To show the slow increase in the years 2007 – 2009, keeping 2006 in there to show the difference. To produce this graph, I used the same method explained earlier with the Number of Reviews Made Per Month (2002 - 2006) graph.

number of reviews made per month

Part 3: Evaluation

The sentiment analysis revealed interesting qualities about the data. The polarity metric shows that the 5-star reviews tend to be pretty neutral. This could be from the fact that someone giving an honest, positive review about a product will also include some of the negative aspects as well. It is no surprise that the 1-star reviews had a negative polarity as most poor-reviews tend to be overly critical and. The box-and-whisker plot reveals the complexity of this data set as for all 5 ratings the median polarity was around 0.00 to 0.25. Once again, notice the 1-star reviews median is negative and the other rating's median polarity are all positive. Furthermore, each set of reviews (set by overall rating) contained 1.00 and -1.00 polarity scores.

Switching to a Naive Bayes sentiment analyzer produced additional analytics not represented by the polarity score. It showed a dramatic increase in positive reviews over the set of overall ratings. 5-star reviews had approximately 100,000 positive reviews while the 1-star and 2-star reviews had approximately 10,000. The Naive Bayes analyzer produced a percentage positive and percentage negative figure for each review and then classified the review as positive or negative based on the overall maximum. The chart in the analysis utilizes the classification property. This graph also shows that users are more likely to give a product a 5-star rating; even if the sentiment of their review is negative. There were approximately 20,000 negative 5-star reviews which is more than both the 1- and 2- star review sets.

During the holiday season, around December to February, we expected to see an increase in reviews and perhaps a trend in overall rating of games generally. After analyzing the graph Number of Reviews Made Per Year, there is a noticeable spike in every year with the number of reviews that are made, predominantly on the bold vertical lines of the grid. The graph, Average Overall Rating, shows in more detail that around December to February, there is a higher overall average rating. Comparing January with 4.14 to September with 4.04, there is a difference of 0.10 between the average.

We also looked at the activity over days of the week. Our hypothesis is that over the weekend, the users would have more time to conduct a thorough review on products, thus resulting in a higher average in word count for the reviews.

However, as shown in the graph Average Word Count of Reviews, I was surprised to see it was the complete opposite. It shows a massive decrease in review activity during the weekend. My assumption is that the users are more likely to spend their weekend playing the game than reviewing the product. It is shown from the graph that the number of reviews that are made are below 30,000 on weekdays compared to a Tuesday with over 36,000 reviews made. Additionally, the average number of words in a review is slightly lower by 14 on Friday compared to the Tuesday. A pre-existing marketing tactic commonly used called the “Free Weekend”, as the title self explains, allows users to trial a game for free over the weekend. This type of tactic is clearly something that reaches out to users to help get a taste of their product. It also allows the user to evaluate over the weekdays whether or not it is a product they wish to purchase.

A major event that influenced PC gaming was the release of consoles that became popular in the early 2000’s. The PlayStation 2 released in 2000 had 155,000,000 sales, The GameCube released in 2001 with 21,740,000 and the Xbox in 2001 with 24,000,000 sales. Due to the release over the years of 2000 till 2001, consoles gained more popularity which meant an overall decline for PC gaming.

I was able to find that between the years 2002 to 2006, there were no rise in the number of reviews, which can be seen in the graph Number of Reviews Made Per Month (2002 – 2006). However, in 2007 PC gaming accelerated with new games in shooter and strategy genre. Some of many key releases include: The Orange Box, Bioshock, CoD4, WoW: The Burning Crusade which gained a little more attention to PC gaming. The graph Number of Reviews Made Per Month (2006 - 2009) shows that 2007 broke the early 2000's pattern and the number of reviews increase slowly over the years.

Based off these findings in Part 2 section b and c, it is clear that the user activity has a pattern over certain periods. Given the data, the amount of reviews that were made in the time period, correlate with the compared data. We also see more activity from users that spend time writing more descriptive reviews on average.

Reflection

The selected data set offered a lot in regards to data analytics potential. One negative aspect was simply how many records there were. Processing took a significant amount of time and some of the statistics were difficult to generate with such a large data set. Random sampling could have been used but then the analysis accuracy would have been skewed. Some of the positive aspects of the data includes the use of UNIX time; this made it easy to run cyclical event analysis.

Furthermore, a more in depth sentiment analysis could have been achieved with a analyzer trained on consumer product review data instead of movie review data. The graph produced from the Naive Bayes analyzer highlights the sheer volume of reviews processed and thus supports the above-mentioned issue.

The analysis has direct implications for consumers and users. When leaving a review, it its important to be as accurate as possible with the overall rating as well as to be verbose in both the positive and negative aspects of the product. This also implies that users considering purchasing a product should be cognoscente of all review scores, and that there are positive and negative connotations.

Written with StackEdit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment