We extract the movie titles along with the crew names and much more. You can also look at what BurntSushi have done for renaming series titles. The original IMDb datasets (TSVs) can be found on there official documentation.
We use the imdb-sqlite
python script to help us do that. The command line generates an SQLite database in the imdb.db
file.
pip install imdb-sqlite
imdb-sqlite
You must first install the sqlite3
command line and then open it.
apt install sqlite
sqlite3 imdb.db
You will then be able to do classic SQL requests to the database like the following one.
But don't forget to setup the sqlite
current session to export the lines to a CSV file.
.headers on
.mode csv
.output titles.csv
SELECT
title AS 'localized-title',
region AS 'region',
is_original_title AS 'is-original-title',
titles.primary_title AS 'primary-title',
titles.title_id AS 'imdb-id',
titles.type AS 'type',
titles.premiered AS 'date',
titles.is_adult AS 'is-adult',
titles.runtime_minutes AS 'runtime-minutes',
titles.genres AS 'genres',
GROUP_CONCAT(DISTINCT people.name) AS 'crew-names',
ratings.rating AS 'rating',
ratings.votes AS 'votes-count',
-- this is the unique document id
hex(titles.primary_title || "-" || COALESCE(region, 'ALL')) AS 'uid'
FROM
akas
INNER JOIN titles ON akas.title_id = titles.title_id
LEFT JOIN crew ON akas.title_id = crew.title_id
LEFT JOIN people ON crew.person_id = people.person_id
LEFT JOIN ratings ON akas.title_id = ratings.title_id
WHERE
titles.type = "movie"
GROUP BY
uid;
The CSV file contains some fields that are comma separated (i.e. crew-names
, genres
), it would be better to make it real arrays.
cargo install csv2ndjson-lite
cat titles.csv | csv2ndjson-lite --arrays 'genres' 'crew-names' --numbers 'rating' 'is-original-title' 'date' 'is-adult' 'runtime-minutes' 'votes-count' > out.ndjson