Deepfake Detection of Media using Deep Neural Networks
3 different neural networks are used to detect any deformity/irregularity in media based on the person’s face, audio and body language.
Face Deepfake Detection
The face deepfake model uses a Maximum Margin Object Detector (to extract the face) followed by a Temporal Neural Network for classification.
Voice Deepfake Detection
Input audio from media is converted into a spectrogram using the librosa library, and then fed to the model which comprises of ResNet50V2 followed by a Temporal Convolutional Network, which predicts whether the given audio is deepfake or not.
Body Lanugage Deepfake Detection
Frames are extracted from the input video at the rate of 5 fps which is passed to YOLOv3 to detect full body persons in it. The full body persons is cropped out of the frame to a size of 300x300 pixels. This serves as input to the TCN model that predicts if the frame is a deepfake or not.