nathanhinchey/imdb_data.md

## imdb_data.md

      
    Raw
  

              imdb_data.md
            
          
    #Data Exploration App for IMDb Metadata
##Overview
The basic concept of the app is to take imdb metadata and allow visual data exploration with it. The data available includes movie titles, the year they were made, the IMDb rating, and the number of user votes (for ratings).
##Front End Appearance
Users will be presented with two main ways to use the app
Ratings by year

Users can select what genres they are interested in -- either conjunction or disjunction -- and what range of years they want to see. A scatterplot will then be generated (using D3.js) to show how the quality of movies in their genre selections have changed over time.
A vs B Comparison

Users can select from a variety of categorizations -- years or year ranges, genres, ratings, or number of user votes. Those categories will be used to set up to comparisons between two.

EXAMPLE1: science fiction/action films in the '80s VS science fiction/action films in the '90s
EXAMPLE2: romance films rated at least 6.0/10 VS action films rated below 4.0/10.

Then, they get to select what metrics they want to compare. So, perhaps they would compare ratings, or how many were made in a given set of years. This app should allow a wide variety of comparisons, and we're trusting that users will be able to recognize if they are making a useless comparison, so it will allow for some strange or even useless comparisons (e.g. comparing the ratings of films that you have selected based on ratings)

EXAMPLE1: See a color-coded bar for comparing percent of high/medium/low ratings for those two decades of scifi.
EXAMPLE2: See what percentage of each of those sets was made before 1975.

Architecture

The back end of the app will be built in Flask using Flask-RESTful. It will provide an API that is consumed by a React.js app, using D3.js to handle charts and graphs. It will be hosted using AWS Elastic Beanstalk (because I don't want to worry too much over the details). It will use Amazon RDS to handle the SQL database containing the collected metadata. Because this database won't need to be written to very often, all of the columns except for number of votes and rating will be indexed for faster lookup speed.
I've made a quick prototype of the parser that will populate the SQL database. I used sqlite3 in memory as a stand in for an actual Amazon database because that was the easiest to play with locally, but every SQL feature I've used is universal, so it would be trivial to port it to a different SQL solution.