CNN Based Deepfake Detection

Fine Tuned VGG-CNN Based Deepfake Detection

Github: link

(Writeup in "Additional info", poster at bottom of page)

Who:

Kamyar Mirfakhraie - kmirfakh

Jason Pien - jpien

Joseph Ricciardulli - jriccia2

Alex Wick - awick2

Introduction:

In recent years, strikingly realistic photos of celebrities and other public figures have been surfacing on the internet. While these deepfakes can be used harmlessly and for fun, they can also contribute to the spread of misinformation on the internet. Some of these deepfakes can be unrecognizable to the untrained eye, and in an effort to combat this, the paper Deepfake Video Detection Using Convolutional Neural Network by Karandikar et al. (2020) outlines a deepfake detection model that is based on VGG16 with pre-trained imagenet weights followed by a fine tuned custom head. By detecting faces and analyzing the detected features using a deep learning approach, the proposed method can identify deepfakes with a high success rate.

Related Work:

Are you aware of any, or is there any prior work that you drew on to do your project?

The main architecture of our model relies on VGG16, a CNN model created in 2015 that won the imagenet challenge for object identification and classification. VGG16 is a very deep CNN that took two to three weeks to initially train, with the total size of the imagenet weights coming in above 500MB. This architecture was groundbreaking for object detection and serves as the basis for the proposed architecture.

Initially, we planned on implementing DefakeHop++, another approach to deepfake detection, however it was not entirely deep learning based.

DefakeHop++ builds upon DefakeHop, which was created in 2021 with the goal of reducing the model size and training time while keeping high performance in deepfake detection. It extracts a large number of features from three facial regions using PixelHop++ (https://arxiv.org/abs/2002.03141), a successive subspace learning (SSL) method for object classification created in 2020, refines them using feature distillation modules, and feeds the distilled features to a tree boosting system called XGBoost (https://arxiv.org/abs/1603.02754).

Additionally, DefakeHop++ draws upon and learns from previous models in deepfake detection. Most state-of-the-art deepfake detectors use deep neural networks, and as a result have millions of parameters. An example of this is the winning model of the DFDC Kaggle challenges in 2020, which has 432M parameters (https://github.com/selimsef/dfdc_deepfake_challenge). This model was improved on by Heo et al. in 2021, by concatenating it with a Vision Transformer (VIT), which added 86M parameters (https://arxiv.org/abs/2104.01353). Sun et al. proposed a new method that same year to reduce the model size and improve the efficiency that trains a two-stream RNN to learn temporal information of landmark positions in videos that are calibrated by neighbor frames. Because they only consider the position information and not the image imformation, this resulted in a model size of 180K parameters (https://arxiv.org/abs/2104.04480). In another model, Tran et al. applied MobileNet to different facial regions and InceptionV3 to the entire faces, resulting in 26M parameters (https://www.mdpi.com/2076-3417/11/16/7678).

Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching.

The paper "Unmasking DeepFakes with Simple Features" by Ricard Durall et al. investigates a method for detecting deepfakes using classical frequency domain analysis followed by a basic classifier as opposed to training a deep CNN. This requires significantly less labeled data compared to other methods. Results showed a high success rate, achieving perfect classification accuracy under certain conditions with minimal training samples. This study effectively demonstrates a scalable and efficient approach for detecting DeepFakes. This gave us the idea to see how effective we could make a deep-learning model with as few parameters as possible. We believe that small yet effective methods for deepfake detection have more practical uses; such as being able to run deepfake detections on your phone.

In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”–if you stumble across a new implementation later down the line, add it to this list.

The paper does not have any current implementations that we have found.

Data:

The paper uses the CelebDF dataset. CelebDF v1 and v2 are datasets of 408 real and 795 fake videos, and 890 real and 5638 fake videos respectively. Real videos are from YouTube while the fakes are created by an advanced version of DeepFake. (generation 2) The average length of these videos is around 13 seconds.

The paper does not go into much detail about the preprocessing stage, merely noting that preprocessing "involves face alignment and extraction." What this may look like is using some sort of face detector to find faces in the images and then use landmarks on those faces to align them.

Because we will be re-implementing a paper, we will need to test our model on a dataset that was not used in the original paper. For this we can use the FaceForensics++ dataset, but we also tested our model on the combined FaceForensics++ and CelebDFv2 datasets.

Methodology:

What is the architecture of your model?

The model proposed in the paper has two parts. The first part is VGG16 with the pre-trained imagenet weights. VGG16 is a CNN architecture with thirteen convolutional layers, five max pooling layers, and three dense layers. The second part is a custom head with batch normalization, dropout, and a two node dense layer. For an optimizer the paper proposes the use of Adam optimizer, and for loss, binary cross entropy is used.

How are you training the model?

The proposed method uses Transfer learning, that is, we will use the vgg16 base with pre loaded imagenet weights and freeze that part of the model. We will then train the weights of the custom head with the CelebDF dataset as to save time and energy.

If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here.

The most difficult part will probably be fine tuning the hyper parameters of the model, as we do not really have a baseline for what they should be, and the model will most likely take hours to train so results will be realized very slowly.

If you are doing something new, justify your design. Also note some backup ideas you may have to experiment with if you run into issues.

We are not doing something new, however the paper that we originally chose is an extension of DefakeHop that examines more facial features, adds supervision in discriminant features selection, and has only 238K parameters. This extension has yet to be fully implemented which may prove to be more challenging than anticipated. In the event that we are unable to fully implement DefakeHop++, it seems that it would be perfectly feasible just to implement the original DefakeHop as a backup plan.

EDIT: We scrapped the DefakeHop implementation and chose to implement a CNN-based deepfake detector instead.

Metrics:

What experiments do you plan to run?

We will run our model on the CelebDF dataset and the FaceForensics++ dataset to assess its performance on each generation. We will also look into generating our own deepfake videos and supply our own real videos for more independent data.

For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate? We will define success on if we can achieve a similar validation accuracy to what was detailed in the paper. Success will also include learning what the limitations of VGG and Transfer learning based deepfake detection are. Ultimately, we are trying to learn about implementing research papers and navigating new ways of implementing DL architectures.

If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model.

The success metric used by the paper was accuracy on the validation set, specifically, achieving greater than 70% test accuracy. They qualified this benchmark by noting that the test data was 30% of the original data.

What are your base, target, and stretch goals?

Our base goal is to reimplement the model. Our target goal is to improve the model and achieve a higher validation accuracy. Our stretch goal is to be able to generalize the architecture that we create to be able to accurately detect deepfakes in everyday life. eventually we would like to look into ways in which our model can be made lighter, perhaps straying away from the VGG16 base.

Ethics:

What broader societal issues are relevant to your chosen problem space?

Super realistic fake videos are increasingly appearing on social media platforms as the deepfake techniques used to create them become more powerful and readily available for use by anyone without any special editing skills. Deepfakes can impersonate anyone, including extremely powerful figures such as world leaders, government officials, corporate executives, and celebrities, convincing online users that they are making statements they never actually made or doing things they never actually did.

The spread of disinformation online for the purposes of political manipulation and propaganda is a pressing societal issue that is being exacerbated by the emergence of deepfakes. For example, in 2022, a deepfake of Ukrainian President Volodymyr Zelenskyy calling on his soldiers to lay down their weapons and surrender to the fight against Russia circulated social media and was uploaded to a Ukrainian news site by hackers (https://www.npr.org/2022/03/16/1087062648/deepfake-video-zelenskyy-experts-war-manipulation-ukraine-russia). This is just one example of how deepfakes can be used to spread powerful disinformation about major world events, misleading the public and eroding trust in the legitimacy of real statements made by figures in the geopolitical public sphere.

Another societal issue relevant to deepfakes is an individual’s right to publicity (i.e. their name, image, and likeness). The ability of deepfakes to impersonate someone without their permission creates issues surrounding the lack of consent by the subject and the potential for defamation. An example of this is the spread of pornographic deepfakes involving Taylor Swift on social media in early 2024 (https://www.cbsnews.com/news/taylor-swift-deepfakes-online-outrage-artificial-intelligence/). This controversy created more public awareness about the questions surrouding the morality and legality of deepfake pornogaphy and impersonation in general, which has affected everyone from celebrities to ordinary people.

Why is Deep Learning a good approach to this problem?

Deep Learning will play an essential role in the mitigation of the consequences of deepfakes. The issues arising from deepfakes will be best addressed by a multifaceted approach, including legal frameworks, public awareness campaigns, ethical guidelines for the use deepfake technologies, and capable deepfake detection technology—this is where deep learning comes in. Deep learning algorithms are great at analyzing visual content and can be trained to detect discrepancies that distinguish deepfakes from authentic videos. These models, like the CNN model we’re recreating, can pick up on subtle cues in images and videos, such as unnatural blinking patterns, facial expressions, or inconsistencies in lighting and background, which might not be noticeable to the human eye. Social media platforms, news agencies, and other organizations receiving and publishing content can use these deepfake detection models to detect and label content accordingly. It could be up to the platform whether they publicly label the content as a deepfake or prevent the content from being published altogether.

Division of labor:

TBD - we’ll decide as we get started with the implementation, but it will be evenly distributed.

Built With

python

Submitted to

CSCI 1470/2470 Deep Learning (Spring 2024)

Created by

I so far have done all the preprocessing in preprocessing.py and downloaded the FaceForensics++ dataset for deepfakes. I later also downloaded and preprocessed the merging of the FF++ dataset with Celeb DF v2. I have also developed the custom CNN model that achieved the best results 85%.

Jason Pien
Joseph Ricciardulli
Alexander Wick
Kamyar Mirfakhraie