Kerollmops/export-imdb-sqlite.md

## export-imdb-sqlite.md

      
    Raw
  

              export-imdb-sqlite.md
            
          
    We extract the movie titles along with the crew names and much more.
You can also look at what BurntSushi have done for renaming series titles.
The original IMDb datasets (TSVs) can be found on there official documentation.
Download and Construct an SQLite from the IMDB dataset

We use the imdb-sqlite python script to help us do that. The command line generates an SQLite database in the imdb.db file.
pip install imdb-sqlite
imdb-sqlite
Export a CSV from the SQLite

You must first install the sqlite3 command line and then open it.
apt install sqlite
sqlite3 imdb.db
You will then be able to do classic SQL requests to the database like the following one.
But don't forget to setup the sqlite current session to export the lines to a CSV file.
.headers on
.mode csv
.output titles.csv
SELECT
	title AS 'localized-title',
	region AS 'region',
	is_original_title AS 'is-original-title',
	titles.primary_title AS 'primary-title',
	titles.title_id AS 'imdb-id',
	titles.type AS 'type',
	titles.premiered AS 'date',
	titles.is_adult AS 'is-adult',
	titles.runtime_minutes AS 'runtime-minutes',
	titles.genres AS 'genres',
	GROUP_CONCAT(DISTINCT people.name) AS 'crew-names',
	ratings.rating AS 'rating',
	ratings.votes AS 'votes-count',
	-- this is the unique document id
	hex(titles.primary_title || "-" || COALESCE(region, 'ALL')) AS 'uid'
FROM
	akas
	INNER JOIN titles ON akas.title_id = titles.title_id
	LEFT JOIN crew ON akas.title_id = crew.title_id
	LEFT JOIN people ON crew.person_id = people.person_id
	LEFT JOIN ratings ON akas.title_id = ratings.title_id
WHERE
	titles.type = "movie"
GROUP BY
	uid;
Transform the CSV file into a JSON stream

The CSV file contains some fields that are comma separated (i.e. crew-names, genres), it would be better to make it real arrays.
cargo install csv2ndjson-lite
cat titles.csv | csv2ndjson-lite --arrays 'genres' 'crew-names' --numbers 'rating' 'is-original-title' 'date' 'is-adult' 'runtime-minutes' 'votes-count' > out.ndjson