Skip to content

Instantly share code, notes, and snippets.

@FdelMazo
Forked from galli-leo/tmdbdump.py
Last active October 30, 2019 19:16
Show Gist options
  • Save FdelMazo/3f70dc5bb62cba3107b4f4ecc8e8889f to your computer and use it in GitHub Desktop.
Save FdelMazo/3f70dc5bb62cba3107b4f4ecc8e8889f to your computer and use it in GitHub Desktop.
Converting the entire TMDB database's jsons to a single csv. Usage: ./tmdbdump_to_csv.py && ./tmdbdump_to_csv_merge.sh
#!/usr/bin/env python
# 1. Download TMDB's database with @galli-leo's script
# https://gist.github.com/galli-leo/6398f9128ffc20af70c6c7eedfeb0a65
# 2. Run python3 tmdbdump_to_csv.py
import pandas as pd
import numpy as np
import json
import os
def jsonToDict(filename):
f = open(filename)
dic = json.loads(f.read())
dic['tmdb_id'] = os.path.basename(filename).split('.')[0]
f.close()
return dic
files = os.listdir('TMDBDUMP')
files.sort(key=lambda x: int(x.split('.')[0]))
dics = []
for i,f in enumerate(files, 1):
print("Dumping {} ({} of {})".format(f,i,len(files)))
dics.append(jsonToDict('TMDBDUMP/'+f))
print("Creating dataset...")
df = pd.DataFrame(dics)
df['tmdb_id'] = pd.to_numeric(df['tmdb_id'])
df = df.set_index('tmdb_id')
print("Sorting...")
df = df.sort_index()
fname = f"tmdbdump.csv"
print(f"Saving as {fname} ...")
df.to_csv(fname)
print("Done!")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment