Skip to content

Instantly share code, notes, and snippets.

@sazio

sazio/blog.md Secret

Created August 28, 2020 14:21
Show Gist options
  • Save sazio/1dbb96d830b99fd56f464e4b56c7c6dd to your computer and use it in GitHub Desktop.
Save sazio/1dbb96d830b99fd56f464e4b56c7c6dd to your computer and use it in GitHub Desktop.

Here it follows a brief introduction to the Deep Fake Challenge from Kaggle.

"We are already at the point where you can't tell the difference between deepfakes and the real thing," Professor Hao Li, University of Southern California

Facebook has announced it will remove videos modified by artificial intelligence, known as deepfakes, from its platform.

https://gist.github.com/5b15c65b19880011ee553f217a1e056d

https://gist.github.com/3abf66b830b960ab91d8b1230472488f

Kaggle is an AirBnB for Data Scientists – this is where they spend their nights and weekends. It’s a crowd-sourced platform to attract, nurture, train and challenge data scientists from all around the world to solve data science, machine learning and predictive analytics problems. It has over 536,000 active members from 194 countries and it receives close to 150,000 submissions per month. Started from Melbourne, Australia Kaggle moved to Silicon Valley in 2011, raised some 11 million dollars from the likes of Hal Varian (Chief Economist at Google), Max Levchin (Paypal), Index and Khosla Ventures and then ultimately been acquired by the Google in March of 2017. Kaggle is the number one stop for data science enthusiasts all around the world who compete for prizes and boost their Kaggle rankings. There are only 94 Kaggle Grandmasters in the world to this date.

Do you know that most data scientists are only theorists and rarely get a chance to practice before being employed in the real-world? Kaggle solves this problem by giving data science enthusiasts a platform to interact and compete in solving real-life problems. The experience you get on Kaggle is invaluable in preparing you to understand what goes into finding feasible solutions for big data.

Fine, fine, fine but what do we do on Kaggle? We Learn

Deepfakes are fakes generated by deep learning. So far so easy.

This usually means someone used a generative model like an AutoEncoder or most likely a Generative Adversarial Network, short GAN. GANs are technically two networks that work against each other, illustrated below. The artist (generator) draws its inspiration from a noise sample and creates a rendering of the data you are trying to generate with said GAN. The private investigator (discriminator) randomly gets assigned real and fake data to investigate.

The learning process is collaborative. The generator gets better at fooling the discriminator and the discriminator gets better at figuring out which data is real and which isn't. In mathematical terms they are learning until a Nash equilibrium is reached, which means neither can learn new tricks and get better. They're a really cool concept and even used in scientific simulation at CERN.

You can probably guess that they can be tricky to train, due to so many moving parts. This has become a very popular area of research, warranting a GAN Zoo of all named GANs. Some important stuff you may want to check out if your interested are keywords like Wasserstein GANs, Gradient Penalization, Attention, and in this context Style Transfer (namely face2face).

GAN from PhD thesis.

It sounds absurd, I know. Here you can find some more practical examples, why don't you play with them for a while?

Official Challenge on Kaggle

Official Website

  • I strongly encourage you to start first with the official Getting Started guide here.

  • What is the goal of the Deepfake Detection Challenge? According to the FAQ "The AI technologies that power deepfakes and other tampered media are rapidly evolving, making deepfakes so hard to detect that, at times, even human evaluators can’t reliably tell the difference. The Deepfake Detection Challenge is designed to incentivize rapid progress in this area by inviting participants to compete to create new ways of detecting and preventing manipulated media."

  • In this Code Competition:

    • CPU Notebook <= 9 hours run-time, GPU Notebook <= 9 hours run-time on Kaggle's P100 GPUs, No internet access enabled
    • External data is allowed up to 1 GB in size. External data must be freely & publicly available, including pre-trained models
  • This code competition's training set is not available directly on Kaggle, as its size is prohibitively large to train in Kaggle. Instead, it's strongly recommended that you train offline and load the externally trained model as an external dataset into Kaggle Notebooks to perform inference on the Test Set. Review Getting Started for more detailed information.

Scoring

Submissions are scored on log loss:

logloss

where:

  • n is the number of videos being predicted
  • y^i is the predicted probability of the video being FAKE
  • yi is 1 if the video is FAKE, 0 if REAL
  • log() is the natural (base e) logarithm

https://gist.github.com/5f0f9e78681b172b388f537fd994fdaf

Data

  • We have a bunch of .mp4 files, split into compressed sets of ~10GB a piece. A metadata.json accompanies each set of .mp4 files, and contains filename, label (REAL/FAKE), original and split columns, listed below under Columns.
  • The full training set is just over 470 GB (Yeah it's huge !).

References: https://deepfakedetectionchallenge.ai/faqs

Dataset Description

There are 4 groups of datasets associated with this competition.

Training Set: This dataset, containing labels for the target, is available for download for competitors to build their models. It is broken up into 50 files, for ease of access and download. Due to its large size, it must be accessed through a GCS bucket which is only made available to participants after accepting the competition’s rules. Please read the rules fully before accessing the dataset, as they contain important details about the dataset’s permitted use. It is expected and encouraged that you train your models outside of Kaggle’s notebooks environment and submit to Kaggle by uploading the trained model as an external data source.

Public Validation Set: When you commit your Kaggle notebook, the submission file output that is generated will be based on the small set of 400 videos/ids contained within this Public Validation Set. This is available on the Kaggle Data page as test_videos.zip

Public Test Set: This dataset is completely withheld and is what Kaggle’s platform computes the public leaderboard against. When you “Submit to Competition” from the “Output” file of a committed notebook that contains the competition’s dataset, your code will be re-run in the background against this Public Test Set. When the re-run is complete, the score will be posted to the public leaderboard. If the re-run fails, you will see an error reflected in your “My Submissions” page. Unfortunately, we are unable to surface any details about your error, so as to prevent error-probing. You are limited to 2 submissions per day, including submissions with errors.

Private Test Set: This dataset is privately held outside of Kaggle’s platform, and is used to compute the private leaderboard. It contains videos with a similar format and nature as the Training and Public Validation/Test Sets, but are real, organic videos with and without deepfakes. After the competition deadline, Kaggle transfers your 2 final selected submissions’ code to the host. They will re-run your code against this private dataset and return prediction submissions back to Kaggle for computing your final private leaderboard scores.

https://gist.github.com/7c0d10ca66b424152857752a6045583e

https://gist.github.com/449cb61a4e9ded56aed2417469731259

Review of Data Files Accessible within kernel

Files

  • train_sample_videos.zip - a ZIP file containing a sample set of training videos and a metadata.json with labels. the full set of training videos is available through the links provided above.
  • sample_submission.csv - a sample submission file in the correct format.
  • test_videos.zip - a zip file containing a small set of videos to be used as a public validation set. To understand the datasets available for this competition, review the Getting Started information.

Metadata Columns

  • filename - the filename of the video
  • label - whether the video is REAL or FAKE
  • original - in the case that a train set video is FAKE, the original video is listed here
  • split - this is always equal to "train".

https://gist.github.com/fe800c25ff6f47a8d773ab60b0f97dcc

https://gist.github.com/6faf6c8b8502f4a8bce43d4da78684d7

Detection Starter Kit

A quickstart guide on DeepFakes: "DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection

This CPU-only kernel is a Deep Fakes video EDA. It relies on static FFMPEG to read/extract data from videos.

  • It extracts meta-data. They help us to know frame rate, dimensions and audio format (we can forget leak of "display_ratio" as it will be fixed).
  • It extracts frames of videos as PNG.
  • It extracts audio track as AAC (disabled).
  • It compares a few face detectors (OpenCV HaarCascade, MTCNN). More to come (Yolo, BlazeFace, DLib, Faced, ...).
  • It provides basic statistics on faces per video, face width/height and face detection confidence. It computes an average face width/height.

We notice that face detection (with OpenCV currently) is far from being perfect. An additional stage to clean-up detected faces is required before training a model! Maybe some kind of votes/ensemble with different detectors would help.

In this kernel you will see also some interesting edge cases of face detection:

  • Face detected on a t-shirt.
  • Face detected on a background board.
  • Face detected inside a face.

FFMPEG and FFPROBE

https://gist.github.com/4f25e8348502fc53109d0bc7012abe7e

What is ffprobe indeed? Basically, ffprobe gathers information from multimedia streams and prints it in human - and machine - readable fashion.

https://gist.github.com/572046a2ea22f7b7fde808c9f01b27cb

https://gist.github.com/abbda0c77dfe217109fd7f0d611929b4

https://gist.github.com/e4afeb14469491c9148bc349c701675f

https://gist.github.com/bd952b6a348c79659f4b4bbea14c91ae

A few info on bitrate

https://gist.github.com/24b1b181512e0f3741ae2fc56cc6d188

https://gist.github.com/dfbd0a7576502edcd9b90c4eb91aafc7

Frames Extraction

https://gist.github.com/e06e8ffa94a5637745c643d6bffef877

https://gist.github.com/49388416824b10714d7713ccaf6e1197

https://gist.github.com/8bc2b9483e6f5a0b30a1d07fd593a317

https://gist.github.com/e64cd4cb433c5a4dc1cb6912577fb568

https://gist.github.com/a7e28981b5ea21edc215675b78de77be

https://gist.github.com/dcc32bc68d098af30001bf8f8ebf2bb7

https://gist.github.com/2ebe5f735d0a4025408f7f5d4741c57e

https://gist.github.com/e7df386adc08551c4913f44ab8989c6f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment