Skip to content

Instantly share code, notes, and snippets.

@Kerollmops
Last active January 23, 2023 16:15
Show Gist options
  • Save Kerollmops/75ffe0d3fd521deb00a42db48416b187 to your computer and use it in GitHub Desktop.
Save Kerollmops/75ffe0d3fd521deb00a42db48416b187 to your computer and use it in GitHub Desktop.
Export a nd-JSON/JSON stream from the multiple IMDB TSVs

We extract the movie titles along with the crew names and much more. You can also look at what BurntSushi have done for renaming series titles. The original IMDb datasets (TSVs) can be found on there official documentation.

Download and Construct an SQLite from the IMDB dataset

We use the imdb-sqlite python script to help us do that. The command line generates an SQLite database in the imdb.db file.

pip install imdb-sqlite
imdb-sqlite

Export a CSV from the SQLite

You must first install the sqlite3 command line and then open it.

apt install sqlite
sqlite3 imdb.db

You will then be able to do classic SQL requests to the database like the following one. But don't forget to setup the sqlite current session to export the lines to a CSV file.

.headers on
.mode csv
.output titles.csv
SELECT
	title AS 'localized-title',
	region AS 'region',
	is_original_title AS 'is-original-title',
	titles.primary_title AS 'primary-title',
	titles.title_id AS 'imdb-id',
	titles.type AS 'type',
	titles.premiered AS 'date',
	titles.is_adult AS 'is-adult',
	titles.runtime_minutes AS 'runtime-minutes',
	titles.genres AS 'genres',
	GROUP_CONCAT(DISTINCT people.name) AS 'crew-names',
	ratings.rating AS 'rating',
	ratings.votes AS 'votes-count',
	-- this is the unique document id
	hex(titles.primary_title || "-" || COALESCE(region, 'ALL')) AS 'uid'
FROM
	akas
	INNER JOIN titles ON akas.title_id = titles.title_id
	LEFT JOIN crew ON akas.title_id = crew.title_id
	LEFT JOIN people ON crew.person_id = people.person_id
	LEFT JOIN ratings ON akas.title_id = ratings.title_id
WHERE
	titles.type = "movie"
GROUP BY
	uid;

Transform the CSV file into a JSON stream

The CSV file contains some fields that are comma separated (i.e. crew-names, genres), it would be better to make it real arrays.

cargo install csv2ndjson-lite
cat titles.csv | csv2ndjson-lite --arrays 'genres' 'crew-names' --numbers 'rating' 'is-original-title' 'date' 'is-adult' 'runtime-minutes' 'votes-count' > out.ndjson
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment