ekreutz/pixel_aspect_ratios.md

## pixel_aspect_ratios.md

      
    Raw
  

              pixel_aspect_ratios.md
            
          
    Pixel aspect ratios - a potential issue in object detection?

Updated: Jan 25, 2021
Quick write-up about pixel aspect ratios in video files, and how they might affect performance in an object detection pipeline. I found very little good information about this online, so I decided to write this to summarize my findings.
Tl;dr: scroll down to the "final solution" code, for how to deal with this issue, using Python/OpenCV.

Topics:

Pixel aspect ratios
Why it is (possibly) a problem in object detection
How to deal with it when using OpenCV and Python
When even VLC gets it wrong... Special case.

Background - not all pixels are created equal

In short: for some, especially older, videos the pixels of the image are not square with a 1:1 ratio. Instead, the pixels will be "stretched out" before they're shown on a display. This image from wikipedia highlights the phenomenon.

In the above example both images have the storage aspect ratio (SAR) 1:2, since the number of pixels is 4×8. However, the right image has the display aspect ratio (DAR) 1:1 and pixel aspect ratio (PAR) 2:1. The formula here is:
SAR × PAR = DAR
For an image in a film, the conversion that needs to happen might look like this (SAR is exaggerated here to show the effect):

Note! Some sources (eg. OpenCV) refers to the storage aspect ratio as picture aspect ratio and the pixel aspect ratio as sample aspect ratio. This is very confusing, since the acronyms PAR and SAR get flipped... But the computation remains the same.
Examples of PAR-values I got in my testing:

1:1 (this is by far the most common)
59:54
901:768
263:256
415744:415125 (what???)


Problem for an object detector?

This might seem trivial. Let's quickly cover it regardless. If the aspect ratios are off and the detector wasn't trained on stretched images, it might affect performance. However:

Most (NN) object detectors use a static square input size like 512×512.

Thus, if the images are scaled to a square size in the end, it might not matter? Almost, but: in preprocessing we mostly use padding instead of brute-force scaling. As such, in object detection we should deal with pixel aspect ratios like this:

In conclusion: we should properly deal with pixel aspect ratios before doing object detection on frames from a video file.

How to deal with it using Python and OpenCV.

Problem: OpenCV doesn't deal with the problem on its own.
Required pip packages: numpy, opencv-python

Example of how not to do it, naive approach: (click to expand)
import cv2
import numpy as np

cap = cv2.VideoCapture("films/my-favorite-film.mp4")

# Compute storage aspect ratio (SAR)
storage_w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
storage_h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
sar = storage_w / storage_h
print(f"Storage aspect ratio: {sar:.2f} (resolution: {storage_w}×{storage_h})")

# Mostly we would just do this.
# `frame_img` will have the shape (storage_h, storage_w, 3)
_, frame = cap.read()
frame_img: np.array = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

# WRONG! Here we pass the SAR image.
objects = my_object_detector.pad_and_predict(frame_img)

cap.release()


Example of how it can be done: (click to expand)
import cv2
import numpy as np

cap = cv2.VideoCapture("films/my-favorite-film.mp4")

# Compute storage aspect ratio (SAR)
storage_w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
storage_h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
sar = storage_w / storage_h
print(f"Storage aspect ratio: {sar:.2f} (resolution: {storage_w}×{storage_h})")

# Compute pixel aspect ratio (PAR)
# NOTE: OpenCV calls this "sample aspect ratio", hence why it says SAR...
# We're still gonna call it PAR though, like Wikipedia.
numerator = cap.get(cv2.CAP_PROP_SAR_NUM) or 1.0
denominator = cap.get(cv2.CAP_PROP_SAR_DEN) or 1.0
par = numerator / denominator

# Compute the required resolution with DAR aspect ratio.
dar = sar * par
dar_h = storage_h
dar_w = round(dar_h * dar)
print(f"Display aspect ratio: {dar:.2f} (resolution: {dar_w}×{dar_h})")

# Read image in storage resolution
_, frame = cap.read()

# If needed, scale it to DAR!
if dar_w != storage_w:
    frame = cv2.resize(frame, (dar_w, dar_h))
    assert frame.shape[0] == dar_h and frame.shape[1] == dar_w

frame_img: np.array = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

# CORRECT! Now we can perform object detection.
# Here, `frame_img` will have the shape (dar_h, dar_w, 3)
objects = my_object_detector.pad_and_predict(frame_img)

cap.release()


Bonus: can we do better than VLC?

While investigating this, I used VLC media player as a reference. The best media player out there should be able to always show movies correctly right? Perhaps not...
It turns out that:

The above approach is as good as VLC, works 99% of the time.
VLC and the above approach, sometimes get it wrong.
There is more metadata to be read sometimes. And we can do better if we use it.

Example of a special case like this:

In the image above 3 examples are shown for a movie with extra metadata; weird stuff going on. The left image is what we would get if we never consider pixel aspect ratios, the naive approach. The middle image is using the above example approach or VLC, this stretches the image too far. The last image is using pymediainfo, which can handle the additional metadata better in this video file.
Final, best solution:
Required pip packages: numpy, opencv-python, pymediainfo

Click to show code:
import cv2
import numpy as np
from pymediainfo import MediaInfo

cap = cv2.VideoCapture("films/my-favorite-film.mp4")

storage_w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
storage_h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

try:
    # --------- Using MediaInfo -----------
    video_info_json_str = MediaInfo.parse(file, output="JSON")
    tracks = json.loads(video_info_json_str)["media"]["track"]
    video_info = next(track for track in tracks if track["@type"].lower() == "video")
    dar_string = video_info["DisplayAspectRatio_String"]

    if ":" in dar_string:
        num, den = [float(s) for s in dar_string.split(":")]
        dar = num / den
    else:
        dar = float(dar_string)
except:
    # ------ "VLC" approach as fallback -------
    sar = storage_w / storage_h
    numerator = cap.get(cv2.CAP_PROP_SAR_NUM) or 1.0
    denominator = cap.get(cv2.CAP_PROP_SAR_DEN) or 1.0
    par = numerator / denominator
    dar = sar * par

# Compute the required resolution with DAR aspect ratio.
dar_h = storage_h
dar_w = round(dar_h * dar)
print(f"Display aspect ratio: {dar:.2f} (resolution: {dar_w}×{dar_h})")

# Read image in storage resolution
_, frame = cap.read()

# If needed, scale it to DAR!
if dar_w != storage_w:
    frame = cv2.resize(frame, (dar_w, dar_h))
    assert frame.shape[0] == dar_h and frame.shape[1] == dar_w

frame_img: np.array = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

# CORRECT! This is even better than the last solution!
# Here, `frame_img` will have the shape (dar_h, dar_w, 3)
objects = my_object_detector.pad_and_predict(frame_img)

cap.release()


Tags: pixel aspect ratio, sample aspect ratio, picture aspect ratio, display aspect ratio, storage aspect ratio, object detection, face detection, video aspect ratio, opencv, python, compute display aspect ratio.