Skip to content

Instantly share code, notes, and snippets.

@skyme5
Last active September 10, 2021 17:07
Show Gist options
  • Save skyme5/8855b521312a72df208e71bb4bb0e736 to your computer and use it in GitHub Desktop.
Save skyme5/8855b521312a72df208e71bb4bb0e736 to your computer and use it in GitHub Desktop.
Extract ISBN from PDF files using pdftotext utility
#!/usr/bin/env ruby
# frozen_string_literal: true
# Scan PDF files for ISBN Number using pdftotext.
#
# This ruby script will scan pdf files and extract ISBN
# Number and append it to the filename (`%filename%_[ISBN].pdf`)
#
# Run this script in folder
#
require 'fileutils'
require 'logger'
require 'tmpdir'
require 'lisbn'
logger = Logger.new($stdout)
logger.level = Logger::DEBUG
list = Dir.entries('.', encoding: 'UTF-8').select do |e|
File.extname(e).downcase == '.pdf' && !e.match?(/_\[\d+\]\.pdf/)
end
Dir.mktmpdir('isbn') do |tmp|
list.each do |pdf_file|
page_count = `pdfinfo "#{pdf_file}"`.scan(/Pages:[^\r\n\d]+(\d+)/).flatten.first.to_i
text_f = File.join(tmp, "#{pdf_file}_f.txt")
text_l = File.join(tmp, "#{pdf_file}_l.txt")
_ = `pdftotext -l 20 -enc "UTF-8" "#{pdf_file}" "#{text_f}"`
_ = `pdftotext -f #{page_count - 20} -l #{page_count} -enc "UTF-8" "#{pdf_file}" "#{text_l}"`
isbn = (File.read(text_f, encoding: 'UTF-8') + File.read(text_l, encoding: 'UTF-8'))
.scan(/(?:ISBN)?(?:-)?(?:10|13)?(?:[:-]+|10|13)?([\d\-X]{10,19})/)
.flatten
next if isbn.empty?
isbn = isbn.flatten.map { |e| e.gsub(/[^\dX]+/, '') }
isbn_check = isbn.select { |e| Lisbn.new(e).valid? }
if !isbn_check.empty?
new_name = pdf_file.gsub(/.pdf/i, "_[#{Lisbn.new(isbn_check.first).isbn13}].pdf")
logger.info(format('ISBN Found -> "%s"', pdf_file))
logger.info(format('RENAME_FILE "%s" -> "%s"', pdf_file, new_name))
FileUtils.mv(pdf_file, new_name)
else
logger.error(format('ISBN_INVALID [%s] -> %s', isbn, pdf_file))
end
end
puts 'Done with the task. Enter to exit ...'
gets
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment