Introduction

In our project, we tackle the growing challenge of telling apart code created by artificial intelligence (AI) systems from code crafted by humans. This distinction is crucial for ensuring software safety, upholding academic honesty, and grasping how AI aids in coding. With AI's rapid progress in generating code, our mission is to pinpoint the origin of code snippets accurately. Drawing on prior studies, we're set to investigate and possibly refine ways to distinguish between AI and human coding habits effectively. We see this task as a classification issue, where we aim to label code snippets as either made by AI or created by humans.

Related Work and Tools for Spotting AI-Created Code

Our research builds on earlier efforts to spot AI-crafted text, diving into studies and articles to understand the current techniques and approaches. We're particularly interested in tools developed to identify AI-generated code (AIGC Detectors), which are key in keeping academic standards, avoiding code copying, and evaluating the output quality of AI coding helpers.

GPTZero: Originally designed to spot text from GPT-3, this tool can also be tweaked to check code by looking at textual features that might apply to code's syntax and structure.
Sapling: A machine learning-based tool that could theoretically tell AI-made code from human-written code by examining patterns and structures.
GPT-2 Detector: Targets content made by GPT-2 by looking at statistical features of text, a method that could also be used for code.
DetectGPT: Specifically finds content made by GPT models, including code, by focusing on unique patterns of creation.
GLTR (Giant Language Model Test Room): Uses statistical methods to find AI-made text, which can help in spotting code written by AI through the detection of unusual patterns.

These tools represent the forefront of distinguishing AI-created code from human-written code, and our project aims to build on this foundation.

Data

We will use two sets of dataset:

Human-written code:
- We will use the "github-code" dataset from HuggingFace, which contains code scraped from GitHub.
- The dataset has a total size of 1TB, but we will only use a subset of it.
- We will specify the language as C.
- The dataset supports streaming API with size labels, which should make preprocessing easier.
AI-generated code:
- We will use the "ForumAI" dataset, which contains over 109,757 sample C files generated using GPT-3.5-Turbo.
- We will label this dataset as AI-generated.
- We will use a subset of the dataset, ensuring that the AI-generated code samples are in C to match the human-written code.
- The combined dataset will be preprocessed by randomly sampling an equal number of code snippets from both datasets to create a balanced dataset.

Methodology

Architecture: We will build a transformer-based model from scratch, specifically designed for understanding C code.
Training: We will train the model on a dataset of human-written and AI-generated C code snippets sourced from Hugging Face. The model will learn to classify code as either human-written or AI-generated.
Hardest part: Ensuring the dataset is balanced and representative of various C coding styles, libraries, and domains. We may need to preprocess the code snippets to normalize formatting and remove any identifying information.
Backup ideas: If the transformer model struggles to achieve high accuracy, we can experiment with other architectures like RNNs or CNNs, or incorporate additional features such as code complexity metrics or abstract syntax tree (AST) representations.

Metrics

Success: High accuracy in classifying C code snippets as human-written or AI-generated.
Experiments:
- Evaluate the model's performance on a held-out test set.
- Analyze the model's performance across different C libraries and coding styles.
- Investigate the model's ability to generalize to unseen AI-generated C code.
- Accuracy is an appropriate metric for this binary classification task.
- Base goal: 75% accuracy; Target goal: 85% accuracy; Stretch goal: 90% accuracy

Ethics

Dataset concerns:
- The Hugging Face dataset may have biases towards certain coding styles or libraries, which could affect the model's generalizability.
- We need to ensure that the AI-generated code snippets in the dataset are created using diverse methods and not just from a single source, to avoid biasing the model.
Consequences of mistakes:
- False positives (human-written code classified as AI-generated) could lead to unfair scrutiny or mistrust of developers' work.
- False negatives (AI-generated code classified as human-written) could allow AI-generated code to pass undetected, potentially introducing vulnerabilities or biases into software systems.
- It's important to communicate the limitations and potential risks of the tool to users and emphasize that it should be used as a supportive tool rather than a sole determinant of code authenticity.

Division of Labor

Teammate 1: Data Collection and Preprocessing

Collect human-written and AI-generated C code snippets from Hugging Face
Preprocess the code snippets (normalize formatting, remove identifying information)
Split the dataset into train, validation, and test sets

Teammate 2: Model Development

Design and implement the transformer-based model architecture
Work with the Data Collection and Preprocessing team to iterate on the model based on data insights

Teammate 3: Model Development