HKGx/an-article-about-identifying-scenes-from-a-movie.md

## an-article-about-identifying-scenes-from-a-movie.md

      
    Raw
  

              an-article-about-identifying-scenes-from-a-movie.md
            
          
    Introduction

I've competed in Google Code-In and I spotted a neat task from CCExtractor.
Task's title was "How can we identify movies based on scenes in them?" and I'm going to answer that question.
First thoughts

The first thing that came to my mind was to split a video into frames using FFmpeg. After splitting we perhaps could test our input against those frames.
But that's a disastrous idea!
Splitting video on each frame is just a waste of our precious disk space.
24-minute long video when split on each frame ended... jamming up my entire drive.
And then we have to deal with comparing the frames. How in the world are you gonna do that?!
The better idea

But maybe we can think of something better?
Maybe we can split it once every n frames and try to increase tolerance of our algorithm?
Sure thing! There are algorithms called image hashes that work that way.
I found a nicely done article about perceptual hashing. I read it a few times and then I found a Python library that does the work!
Now, how are we going to store those hashes?
For likable performance, I presume we can have a table in a database that's holding a hash as a 64-bit primary key and a list of movie names.
It kinda works

I was able to build a small prototype. Its performance isn't the best, because it's just a prototype, BUT IT'S WORKING!
It properly identified that slightly oversaturated image is not very different than the base one.
Other methods

Do they exist? Of course. Developers regularly find new ways to do something!
A somewhat similar to finding animes based on the scene is used in https://trace.moe. They're using a color layout descriptor algorithm for this.
And there is even an ML approach to this problem and it works. During my further investigation, I stumbled upon a research paper that used Deep Learning to extract the features from an image (link).
Stranger approach

Apropos Machine Learning, possibly we can think of another way to accomplish our task? Maybe some more theoretical one?
I'm not an expert on the matter (to be fair, I've never tried ML in my life), but what if we could train our algorithm to just recognize movies.
I'm probably totally overcomplicating the idea, but what if?
What if we could teach our program to recognize movies and say from which movie it came from?
Yeah, what if...

  
## config.py
IMG_DIR = "/home/user/some-dir/frames/"
HASHES_DB = "/home/user/some-dir/some-file.db"
TARGET = "/home/user/some-dir/frames/certain-frame.jpg"

## main.py
import config
from PIL import Image
import imagehash
from os import listdir
from os.path import isfile, join
import numpy as np
import sqlite3

db = sqlite3.connect(config.HASHES_DB)
c = db.cursor()

# it's text because i'm too lazy to parse it to int

c.executescript("""CREATE TABLE IF NOT EXISTS "Hash" (
  "id" INTEGER PRIMARY KEY AUTOINCREMENT,
  "hash" TEXT NOT NULL,
  "movie" TEXT NOT NULL
)
""")

imgs = [f for f in sorted(listdir(config.IMG_DIR)) if isfile(join(config.IMG_DIR, f)) and f.endswith(".jpg")]

hashes = []

for i in imgs:
    """
    For testing purpose you should restrict the loop to go through only few images as it takes a long time to do.
    Maybe try to parallelize it in the future? multiprocessing might come handy
    """
    img = Image.open(join(config.IMG_DIR, i))
    h: np.ndarray = imagehash.phash(img)
    print(f"name: {i}")
    hashes.append((str(h),))

c.executemany('INSERT INTO Hash(hash, movie) VALUES(?, "DR STONE EPISODE 24")', hashes)

target = imagehash.phash(Image.open(config.TARGET))

# shows us difference between images
for idx, row in enumerate(c.execute("SELECT * from Hash")):
    print(f"curr idx: {idx+1}")
    print(target - imagehash.hex_to_hash(row[1]))

db.close()

## requirements.txt
numpy
ImageHash
pillow
	IMG_DIR = "/home/user/some-dir/frames/"
	HASHES_DB = "/home/user/some-dir/some-file.db"
	TARGET = "/home/user/some-dir/frames/certain-frame.jpg"
	import config
	from PIL import Image
	import imagehash
	from os import listdir
	from os.path import isfile, join
	import numpy as np
	import sqlite3

	db = sqlite3.connect(config.HASHES_DB)
	c = db.cursor()

	# it's text because i'm too lazy to parse it to int

	c.executescript("""CREATE TABLE IF NOT EXISTS "Hash" (
	"id" INTEGER PRIMARY KEY AUTOINCREMENT,
	"hash" TEXT NOT NULL,
	"movie" TEXT NOT NULL
	)
	""")

	imgs = [f for f in sorted(listdir(config.IMG_DIR)) if isfile(join(config.IMG_DIR, f)) and f.endswith(".jpg")]

	hashes = []

	for i in imgs:
	"""
	For testing purpose you should restrict the loop to go through only few images as it takes a long time to do.
	Maybe try to parallelize it in the future? multiprocessing might come handy
	"""
	img = Image.open(join(config.IMG_DIR, i))
	h: np.ndarray = imagehash.phash(img)
	print(f"name: {i}")
	hashes.append((str(h),))

	c.executemany('INSERT INTO Hash(hash, movie) VALUES(?, "DR STONE EPISODE 24")', hashes)

	target = imagehash.phash(Image.open(config.TARGET))

	# shows us difference between images
	for idx, row in enumerate(c.execute("SELECT * from Hash")):
	print(f"curr idx: {idx+1}")
	print(target - imagehash.hex_to_hash(row[1]))

	db.close()