Skip to content

Instantly share code, notes, and snippets.

@mindyng
Created June 10, 2017 02:55
Show Gist options
  • Save mindyng/e1962f21ceb1be8c164cc6795029434e to your computer and use it in GitHub Desktop.
Save mindyng/e1962f21ceb1be8c164cc6795029434e to your computer and use it in GitHub Desktop.
The very first step after downloading and unzipping the dataset was to import all 8 separate .csv files and format them
as individual pandas data frames. Each data frame would have a review per row. Each data frame would have 4 different
columns (from left to right): “Review Score”, “Tail of Review URL”, “Review Title” and “Review Text”.
All reviews were combined into one big dataframe to make data wrangling easier- such as applying functions on it.
Then columns: “Review Score” and “Review Text” were separated out as their own variables since these would be the main
objects handled in the Machine Learning algorithm.
Given that each “Review Text” had HTML tags, pre-processing was done on the text in order to prepare it for use in the
Machine Learning algorithm. Since each entry in the “Review Text” column was a string object, string methods were applied
in order to strip, replace and translate the string to get it into pure text form - no HTML tags nor punctuation.
Each processed string was then recombined into one big list with all reviews from all books.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment