Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hrishioa/08786a8689593aa390b209083cfaaf44 to your computer and use it in GitHub Desktop.
Save hrishioa/08786a8689593aa390b209083cfaaf44 to your computer and use it in GitHub Desktop.
Claude 2 and GPT-4 Analysis of the Movies dataset from Kaggle (https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/).

AI Object to DB

Claude-2

All the key information in this file was generated by an LLM. Treat it as a starting point, don't ever run auto-generated code without a sandbox unless you have checked it yourself.

The primary intent for this analysis is to go from an object to all the tools you need to use this library. The hardest part is the object-to-relational conversion, of converting a (potentially) nested json into flat tables that can be queried.

We will generate the following things:

  1. A typespec - We need good type definition for our incoming object. You might already have this, in which case make sure to provide a string version to the function to treat this as a starting point.
  2. DDL - The SQLite table descriptors that can hold the information in the object.
  3. Structured DDL - This is an internal type that is used by the library to add additional information - like example values, potential min and max ranges, which fields are visible to the LLM converting natural language to query, etc. Most of this is metadata intended for the library and the LLM, and are stripped out when creating the tables.
  4. ObjectToRow - This is ostensibly the hardest function out of the bunch, so treat it as a starting point. It should do the drudge work of writing a function to turn every field in the JSON into a flat string array that sqlite can ingest.

Typespec

interface Genre {
  id: number;
  name: string; 
}

interface ProductionCompany {
  name: string;
  id: number;
}

interface ProductionCountry {
  iso_3166_1: string;
  name: string;
}

interface SpokenLanguage {
  iso_639_1: string;
  name: string;
}

interface Movie {
  adult: string;
  budget: string;
  genres: Genre[];
  id: string;
  imdb_id: string;
  original_language: string;
  original_title: string;
  overview: string;
  popularity: string;
  production_companies: ProductionCompany[];
  production_countries: ProductionCountry[];
  release_date: string;
  revenue: string;
  runtime: string;
  spoken_languages: SpokenLanguage[];
  status: string;
  tagline: string;
  title: string;
  vote_average: string;
  vote_count: string;
}

SQL Tables

-- Main Movies Table
CREATE TABLE Movies (
  id TEXT, -- Movie ID
  adult TEXT, -- Is adult content
  budget_int INT, -- Budget as integer
  original_language TEXT, -- Original language
  original_title TEXT, -- Original title  
  overview TEXT, -- Overview description
  popularity_float FLOAT, -- Popularity as float
  release_date TEXT, -- Release date  
  revenue_int INT, -- Revenue as integer
  runtime_int INT, -- Runtime as integer 
  status TEXT, -- Status
  tagline TEXT, -- Tagline
  title TEXT, -- Title
  vote_average_float FLOAT, -- Vote average as float
  vote_count_int INT -- Vote count as integer  
);

-- Genres Table
CREATE TABLE Genres (
  movie_id TEXT, -- Corresponding movie ID
  genre_id INT, -- Genre ID
  genre_name TEXT -- Genre name
);

-- Production Companies Table
CREATE TABLE ProductionCompanies (
  movie_id TEXT, -- Corresponding movie ID 
  company_id INT, -- Company ID
  company_name TEXT -- Company name  
);

-- Production Countries Table 
CREATE TABLE ProductionCountries (
  movie_id TEXT, -- Corresponding movie ID
  country_code TEXT, -- Country code
  country_name TEXT -- Country name
);

-- Spoken Languages Table
CREATE TABLE SpokenLanguages (
  movie_id TEXT, -- Corresponding movie ID
  language_code TEXT, -- Language code
  language_name TEXT -- Language name
);

Structured DDL

const moviesTable: DDLTable = {
  name: 'Movies',
  columns: [
    {
      name: 'id',
      columnSpec: 'TEXT',
      description: 'Movie ID',
    },
    {
      name: 'adult',
      columnSpec: 'TEXT',
      description: 'Is adult content',
    },
    {
      name: 'budget_int',
      columnSpec: 'INT',
      description: 'Budget as integer',
    },
    {
      name: 'original_language',
      columnSpec: 'TEXT',
      description: 'Original language',
    },
    {
      name: 'original_title',
      columnSpec: 'TEXT', 
      description: 'Original title',
    },
    {
      name: 'overview',
      columnSpec: 'TEXT',
      description: 'Overview description',
    },
    {
      name: 'popularity_float',
      columnSpec: 'FLOAT',
      description: 'Popularity as float',
    },
    {
      name: 'release_date',
      columnSpec: 'TEXT',
      description: 'Release date',
    },
    {
      name: 'revenue_int',
      columnSpec: 'INT',
      description: 'Revenue as integer',
    },
    {
      name: 'runtime_int',
      columnSpec: 'INT',
      description: 'Runtime as integer',
    },
    {
      name: 'status',
      columnSpec: 'TEXT',
      description: 'Status',
    },
    {
      name: 'tagline',
      columnSpec: 'TEXT',
      description: 'Tagline',
    },
    {
      name: 'title',
      columnSpec: 'TEXT',
      description: 'Title',
    },
    {
      name: 'vote_average_float',
      columnSpec: 'FLOAT',
      description: 'Vote average as float',
    },
    {
      name: 'vote_count_int',
      columnSpec: 'INT',
      description: 'Vote count as integer',
    },
  ]
};

const genresTable: DDLTable = {
  name: 'Genres',
  columns: [
    {
      name: 'movie_id',
      columnSpec: 'TEXT',
      description: 'Corresponding movie ID',
      foreignKey: {
        table: 'Movies',
        column: 'id'
      }
    },
    {
      name: 'genre_id',
      columnSpec: 'INT',
      description: 'Genre ID',
    },
    {
      name: 'genre_name',
      columnSpec: 'TEXT',
      description: 'Genre name',
    }
  ]
};

const productionCompaniesTable: DDLTable = {
  name: 'ProductionCompanies',
  columns: [
    {
      name: 'movie_id',
      columnSpec: 'TEXT',
      description: 'Corresponding movie ID',
      foreignKey: {
        table: 'Movies',
        column: 'id'
      }
    },
    {
      name: 'company_id',
      columnSpec: 'INT',
      description: 'Company ID',
    },
    {
      name: 'company_name',
      columnSpec: 'TEXT',
      description: 'Company name',
    }
  ]
};

const productionCountriesTable: DDLTable = {
  name: 'ProductionCountries',
  columns: [
    {
      name: 'movie_id',
      columnSpec: 'TEXT',
      description: 'Corresponding movie ID',
      foreignKey: {
        table: 'Movies',
        column: 'id'
      }
    },
    {
      name: 'country_code',
      columnSpec: 'TEXT',
      description: 'Country code',
    },
    {
      name: 'country_name',
      columnSpec: 'TEXT',
      description: 'Country name',
    }
  ]
};

const spokenLanguagesTable: DDLTable = {
  name: 'SpokenLanguages',
  columns: [
    {
      name: 'movie_id',
      columnSpec: 'TEXT',
      description: 'Corresponding movie ID',
      foreignKey: {
        table: 'Movies',
        column: 'id'
      }
    },
    {
      name: 'language_code',
      columnSpec: 'TEXT',
      description: 'Language code',
    },
    {
      name: 'language_name',
      columnSpec: 'TEXT',
      description: 'Language name',
    }
  ]
};

Object to Rows function

import { Database, SqlValue } from 'sql.js';

function objectToRows(movies: Movie[]): string[][][] {

  const rows: string[][][] = [];

  movies.forEach(movie => {

    // Movies table
    const movieRow = [
      movie.id,
      movie.adult,
      movie.budget ? movie.budget.toString() : '0',
      movie.original_language,
      movie.original_title,
      movie.overview,
      movie.popularity ? movie.popularity.toString() : '0',
      movie.release_date,
      movie.revenue ? movie.revenue.toString() : '0',  
      movie.runtime ? movie.runtime.toString() : '0',
      movie.status,
      movie.tagline,
      movie.title,
      movie.vote_average ? movie.vote_average.toString() : '0',
      movie.vote_count ? movie.vote_count.toString() : '0'
    ];
    rows.push([movieRow]);

    // Genres table
    if (movie.genres) {
      movie.genres.forEach(genre => {
        const genreRow = [
          movie.id,
          genre.id,
          genre.name
        ];
        rows.push([genreRow]);  
      });
    }
    
    // Production companies table
    if (movie.production_companies) {
      movie.production_companies.forEach(company => {
        const companyRow = [
          movie.id,
          company.id,
          company.name  
        ];
        rows.push([companyRow]);
      });
    }

    // Production countries table
    if (movie.production_countries) {
      movie.production_countries.forEach(country => {
        const countryRow = [
          movie.id,
          country.iso_3166_1,
          country.name
        ];
        rows.push([countryRow]);
      });
    }

    // Spoken languages table
    if (movie.spoken_languages) {
      movie.spoken_languages.forEach(language => {
        const languageRow = [
          movie.id,
          language.iso_639_1,
          language.name
        ];
        rows.push([languageRow]);
      });
    }

  });

  return rows;
}

See [Examples] for an example of how to use this code to create an instance of the search engine.

Appendix

Input Object

{
  "adult": "False",
  "budget": "30000000",
  "genres": [
    {
      "id": 16,
      "name": "Animation"
    },
    {
      "id": 35,
      "name": "Comedy"
    },
    {
      "id": 10751,
      "name": "Family"
    }
  ],
  "id": "862",
  "imdb_id": "tt0114709",
  "original_language": "en",
  "original_title": "Toy Story",
  "overview": "Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",
  "popularity": "21.946943",
  "production_companies": [
    {
      "name": "Pixar Animation Studios",
      "id": 3
    }
  ],
  "production_countries": [
    {
      "iso_3166_1": "US",
      "name": "United States of America"
    }
  ],
  "release_date": "1995-10-30",
  "revenue": "373554033",
  "runtime": "81.0",
  "spoken_languages": [
    {
      "iso_639_1": "en",
      "name": "English"
    }
  ],
  "status": "Released",
  "tagline": "",
  "title": "Toy Story",
  "vote_average": "7.7",
  "vote_count": "5415"
}

Reasoning for table structure

Here is how I would structure SQLite tables to hold the movie data following the guidelines:

Main Table: Movies

This will be the main table holding the core movie information.

CREATE TABLE Movies (
  id TEXT, -- Movie ID
  adult TEXT, -- Is adult content
  budget_int INT, -- Budget as integer 
  original_language TEXT, -- Original language 
  original_title TEXT, -- Original title
  overview TEXT, -- Overview description
  popularity_float FLOAT, -- Popularity as float
  release_date TEXT, -- Release date
  revenue_int INT, -- Revenue as integer
  runtime_int INT, -- Runtime as integer
  status TEXT, -- Status 
  tagline TEXT, -- Tagline
  title TEXT, -- Title
  vote_average_float FLOAT, -- Vote average as float
  vote_count_int INT -- Vote count as integer
);
  • I used descriptive column names with suffixes to indicate the datatype where it may be ambiguous like integers or floats.
  • The core movie info is in this single table.

Genres Table

The genres data can be flattened into a single table since it's a simple 1-to-many relationship.

CREATE TABLE Genres (
  movie_id TEXT, -- Corresponding movie ID
  genre_id INT, -- Genre ID
  genre_name TEXT -- Genre name
);
  • This contains the genre information for each movie in a flattened form, with movie ID copied for each genre to maintain the relationship.

Production Companies Table

Same flattening approach for production companies:

CREATE TABLE ProductionCompanies (
  movie_id TEXT, -- Corresponding movie ID
  company_id INT, -- Company ID 
  company_name TEXT -- Company name
);

Production Countries Table

And same for production countries:

CREATE TABLE ProductionCountries (
  movie_id TEXT, -- Corresponding movie ID
  country_code TEXT, -- Country code
  country_name TEXT -- Country name  
);

Spoken Languages Table

And finally, the spoken languages:

CREATE TABLE SpokenLanguages (
  movie_id TEXT, -- Corresponding movie ID
  language_code TEXT, -- Language code
  language_name TEXT -- Language name
);

So in summary, one main Movies table containing the core info, and additional flattened tables for the 1-to-many relationships to simplify the structure. The movie_id column is used to maintain the relationship between the main Movies table and the other tables.

Some columns may need comments for additional clarity on things like _int suffixes indicating data type. But overall this provides a reasonably simple normalized structure for the data.

AI Object to DB

GPT-4

All the key information in this file was generated by an LLM. Treat it as a starting point, don't ever run auto-generated code without a sandbox unless you have checked it yourself.

The primary intent for this analysis is to go from an object to all the tools you need to use this library. The hardest part is the object-to-relational conversion, of converting a (potentially) nested json into flat tables that can be queried.

We will generate the following things:

  1. A typespec - We need good type definition for our incoming object. You might already have this, in which case make sure to provide a string version to the function to treat this as a starting point.
  2. DDL - The SQLite table descriptors that can hold the information in the object.
  3. Structured DDL - This is an internal type that is used by the library to add additional information - like example values, potential min and max ranges, which fields are visible to the LLM converting natural language to query, etc. Most of this is metadata intended for the library and the LLM, and are stripped out when creating the tables.
  4. ObjectToRow - This is ostensibly the hardest function out of the bunch, so treat it as a starting point. It should do the drudge work of writing a function to turn every field in the JSON into a flat string array that sqlite can ingest.

Typespec

type Movie = {
  adult: string;
  budget: string;
  genres: Array<{
    id: number;
    name: string;
  }>;
  id: string;
  imdb_id: string;
  original_language: string;
  original_title: string;
  overview: string;
  popularity: string;
  production_companies: Array<{
    name: string;
    id: number;
  }>;
  production_countries: Array<{
    iso_3166_1: string;
    name: string;
  }>;
  release_date: string;
  revenue: string;
  runtime: string;
  spoken_languages: Array<{
    iso_639_1: string;
    name: string;
  }>;
  status: string;
  tagline: string;
  title: string;
  vote_average: string;
  vote_count: string;
}

SQL Tables

-- Movies table holds the main information about each movie
CREATE TABLE Movies (
    id TEXT, -- Unique identifier for each movie
    adult TEXT, -- Indicates whether the movie is for adults or not
    budget TEXT, -- Budget of the movie
    imdb_id TEXT, -- Unique identifier for the movie in the IMDB database
    original_language TEXT, -- Original language of the movie
    original_title TEXT, -- Original title of the movie
    overview TEXT, -- Brief description of the movie
    popularity TEXT, -- Popularity score of the movie
    release_date TEXT, -- Release date of the movie
    revenue TEXT, -- Revenue of the movie
    runtime TEXT, -- Runtime of the movie
    status TEXT, -- Status of the movie
    tagline TEXT, -- Tagline of the movie
    title TEXT, -- Title of the movie
    vote_average TEXT, -- Average vote of the movie
    vote_count TEXT -- Vote count of the movie
);

-- Genres table holds the genres of each movie
CREATE TABLE Genres (
    movie_id TEXT, -- Foreign key linking to the Movies table
    genre_id INTEGER, -- ID of the genre
    genre_name TEXT -- Name of the genre
);

-- ProductionCompanies table holds the production companies of each movie
CREATE TABLE ProductionCompanies (
    movie_id TEXT, -- Foreign key linking to the Movies table
    company_id INTEGER, -- ID of the production company
    company_name TEXT -- Name of the production company
);

Structured DDL

export const Movies: DDLTable = {
  name: "Movies",
  columns: [
    {
      name: "id",
      columnSpec: "TEXT",
      description: "Unique identifier for each movie",
      visibleToLLM: true
    },
    {
      name: "adult",
      columnSpec: "TEXT",
      description: "Indicates whether the movie is for adults or not",
      visibleToLLM: true
    },
    {
      name: "budget",
      columnSpec: "TEXT",
      description: "Budget of the movie",
      visibleToLLM: true
    },
    {
      name: "imdb_id",
      columnSpec: "TEXT",
      description: "Unique identifier for the movie in the IMDB database",
      visibleToLLM: true
    },
    {
      name: "original_language",
      columnSpec: "TEXT",
      description: "Original language of the movie",
      visibleToLLM: true
    },
    {
      name: "original_title",
      columnSpec: "TEXT",
      description: "Original title of the movie",
      visibleToLLM: true
    },
    {
      name: "overview",
      columnSpec: "TEXT",
      description: "Brief description of the movie",
      visibleToLLM: true
    },
    {
      name: "popularity",
      columnSpec: "TEXT",
      description: "Popularity score of the movie",
      visibleToLLM: true
    },
    {
      name: "release_date",
      columnSpec: "TEXT",
      description: "Release date of the movie",
      visibleToLLM: true
    },
    {
      name: "revenue",
      columnSpec: "TEXT",
      description: "Revenue of the movie",
      visibleToLLM: true
    },
    {
      name: "runtime",
      columnSpec: "TEXT",
      description: "Runtime of the movie",
      visibleToLLM: true
    },
    {
      name: "status",
      columnSpec: "TEXT",
      description: "Status of the movie",
      visibleToLLM: true
    },
    {
      name: "tagline",
      columnSpec: "TEXT",
      description: "Tagline of the movie",
      visibleToLLM: true
    },
    {
      name: "title",
      columnSpec: "TEXT",
      description: "Title of the movie",
      visibleToLLM: true
    },
    {
      name: "vote_average",
      columnSpec: "TEXT",
      description: "Average vote of the movie",
      visibleToLLM: true
    },
    {
      name: "vote_count",
      columnSpec: "TEXT",
      description: "Vote count of the movie",
      visibleToLLM: true
    }
  ]
};

export const Genres: DDLTable = {
  name: "Genres",
  columns: [
    {
      name: "movie_id",
      columnSpec: "TEXT",
      description: "Foreign key linking to the Movies table",
      foreignKey: {
        table: "Movies",
        column: "id"
      },
      visibleToLLM: true
    },
    {
      name: "genre_id",
      columnSpec: "INTEGER",
      description: "ID of the genre",
      visibleToLLM: true
    },
    {
      name: "genre_name",
      columnSpec: "TEXT",
      description: "Name of the genre",
      visibleToLLM: true
    }
  ]
};

export const ProductionCompanies: DDLTable = {
  name: "ProductionCompanies",
  columns: [
    {
      name: "movie_id",
      columnSpec: "TEXT",
      description: "Foreign key linking to the Movies table",
      foreignKey: {
        table: "Movies",
        column: "id"
      },
      visibleToLLM: true
    },
    {
      name: "company_id",
      columnSpec: "INTEGER",
      description: "ID of the production company",
      visibleToLLM: true
    },
    {
      name: "company_name",
      columnSpec: "TEXT",
      description: "Name of the production company",
      visibleToLLM: true
    }
  ]
};

Object to Rows function

import { Movie } from './types';

function objectToRows(movies: Movie[]): string[][][] {
  const movieRows: string[][] = [];
  const genreRows: string[][] = [];
  const productionCompanyRows: string[][] = [];

  movies.forEach(movie => {
    // Movies table
    movieRows.push([
      movie.id || '',
      movie.adult || '',
      movie.budget || '',
      movie.imdb_id || '',
      movie.original_language || '',
      movie.original_title || '',
      movie.overview || '',
      movie.popularity || '',
      movie.release_date ? new Date(movie.release_date).toISOString() : '',
      movie.revenue || '',
      movie.runtime || '',
      movie.status || '',
      movie.tagline || '',
      movie.title || '',
      movie.vote_average || '',
      movie.vote_count || ''
    ]);

    // Genres table
    if (movie.genres) {
      movie.genres.forEach(genre => {
        genreRows.push([
          movie.id || '',
          genre.id.toString() || '',
          genre.name || ''
        ]);
      });
    }

    // ProductionCompanies table
    if (movie.production_companies) {
      movie.production_companies.forEach(company => {
        productionCompanyRows.push([
          movie.id || '',
          company.id.toString() || '',
          company.name || ''
        ]);
      });
    }
  });

  return [movieRows, genreRows, productionCompanyRows];
}

See [Examples] for an example of how to use this code to create an instance of the search engine.

Appendix

Input Object

{
  "adult": "False",
  "budget": "30000000",
  "genres": [
    {
      "id": 16,
      "name": "Animation"
    },
    {
      "id": 35,
      "name": "Comedy"
    },
    {
      "id": 10751,
      "name": "Family"
    }
  ],
  "id": "862",
  "imdb_id": "tt0114709",
  "original_language": "en",
  "original_title": "Toy Story",
  "overview": "Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",
  "popularity": "21.946943",
  "production_companies": [
    {
      "name": "Pixar Animation Studios",
      "id": 3
    }
  ],
  "production_countries": [
    {
      "iso_3166_1": "US",
      "name": "United States of America"
    }
  ],
  "release_date": "1995-10-30",
  "revenue": "373554033",
  "runtime": "81.0",
  "spoken_languages": [
    {
      "iso_639_1": "en",
      "name": "English"
    }
  ],
  "status": "Released",
  "tagline": "",
  "title": "Toy Story",
  "vote_average": "7.7",
  "vote_count": "5415"
}

Reasoning for table structure

Given the guidelines, we can structure the data into three tables: Movies, Genres, and ProductionCompanies.

  1. Movies Table: This table will hold the main information about the movie.
CREATE TABLE Movies (
    id TEXT,
    adult TEXT,
    budget TEXT,
    imdb_id TEXT,
    original_language TEXT,
    original_title TEXT,
    overview TEXT,
    popularity TEXT,
    release_date TEXT,
    revenue TEXT,
    runtime TEXT,
    status TEXT,
    tagline TEXT,
    title TEXT,
    vote_average TEXT,
    vote_count TEXT
);
  1. Genres Table: This table will hold the genres of each movie. We will include the movie id as a foreign key to link it back to the Movies table.
CREATE TABLE Genres (
    movie_id TEXT, -- Foreign key linking to the Movies table
    genre_id INTEGER,
    genre_name TEXT
);
  1. ProductionCompanies Table: This table will hold the production companies of each movie. We will include the movie id as a foreign key to link it back to the Movies table.
CREATE TABLE ProductionCompanies (
    movie_id TEXT, -- Foreign key linking to the Movies table
    company_id INTEGER,
    company_name TEXT
);

In the Movies table, the id column will be the unique identifier for each movie. The adult column will indicate whether the movie is for adults or not. The budget column will hold the budget of the movie. The imdb_id column will hold the unique identifier for the movie in the IMDB database. The original_language column will hold the original language of the movie. The original_title column will hold the original title of the movie. The overview column will hold a brief description of the movie. The popularity column will hold the popularity score of the movie. The release_date column will hold the release date of the movie. The revenue column will hold the revenue of the movie. The runtime column will hold the runtime of the movie. The status column will hold the status of the movie. The tagline column will hold the tagline of the movie. The title column will hold the title of the movie. The vote_average column will hold the average vote of the movie. The vote_count column will hold the vote count of the movie.

In the Genres table, the movie_id column will hold the id of the movie. The genre_id column will hold the id of the genre. The genre_name column will hold the name of the genre.

In the ProductionCompanies table, the movie_id column will hold the id of the movie. The company_id column will hold the id of the production company. The company_name column will hold the name of the production company.

The movie_id in the Genres and ProductionCompanies tables will need comments to indicate that they are foreign keys linking back to the Movies table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment