Skip to content

Instantly share code, notes, and snippets.

@pathouse
Last active April 11, 2023 20:09
Show Gist options
  • Save pathouse/7862553 to your computer and use it in GitHub Desktop.
Save pathouse/7862553 to your computer and use it in GitHub Desktop.
IMDb pro has a database of more than 22,000 production companies. Unfortunately, their searching and sorting tools are pretty poor. This is a very simple first draft of a script for scraping those pages for info and writing the results to a csv file. This is 4401 pages of data (50 companies per page) so it takes a while. An IMDb pro subscription…
require 'mechanize'
require 'csv'
PAGES = (1..4401)
LOGIN_PAGE = "https://secure.imdb.com/signup/v4/login"
BASE_URL = 'http://pro.imdb.com/companies/type-production'
def page_url(page)
return "" if page == 1
"?start=#{(page - 1) * 50}"
end
def pro_login(agent)
puts "EMAIL: "
uname = gets.chomp
puts "\nPASSWORD: "
password = gets.chomp
page = agent.get LOGIN_PAGE
form = page.form("f")
form.login = uname
form.password = password
agent.submit(form)
end
agent = Mechanize.new
pro_login(agent)
`touch production_companies.csv`
spreadsheet = CSV.open("production_companies.csv", "w")
spreadsheet << ["RANK", "COMPANY", "LOCATION", "CONTACT"]
added_count = 0
puts "\nLOCATION: "
location = gets.chomp
PAGES.each_with_index do |number, index|
page = agent.get (BASE_URL + page_url(number))
puts "-- PAGE #{index+1} of 4401 --"
dark_rows = page.search("tr.chartdark")
light_rows = page.search("tr.chartlight")
all_rows = dark_rows + light_rows
all_rows.each do |row|
data = row.search("td")
if data[2].text =~ /#{location}/
spreadsheet << data.map {|d| d.text}
added_count += 1
puts "#{data[1]} added to list. (##{added_count})"
end
end
end
spreadsheet.close
@dcts
Copy link

dcts commented Apr 11, 2023

This is outdated. Does anyone have a script that works in 2023? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment