Fake Essay Detection

Preprocessing

Inspiration

This Project contains a machine learning model designed to detect whether an essay was written by a student or by a large language model (LLM). I was inspired to build something that can be a supportive tool in this Generative AI era.

What it does

This project develop a reliable tool to help educators and content creators identify potentially machine-generated essays.

How I built it

Methodology 1.Preprocessing:

Text Cleaning: Removal of noise, normalization (e.g., lowercasing). Tokenization: Using Byte-Pair Encoding (BPE) for efficient handling of out-of-vocabulary words. 2.Feature Engineering:

TF-IDF Vectorization: Converting text into numerical representations, emphasizing the importance of words within a document. 3.Modeling: Ensemble Classifier: Combining multiple models (Multinomial Naive Bayes, SGDClassifier, LightGBM, CatBoost) with weighted voting for robust predictions. 4.Evaluation: Metric: ROC-AUC score to assess performance on imbalanced datasets.

Accomplishments that I'm proud of

The model achieved an accuracy of 0.95% and a ROC-AUC score of 0.986 on the test set.

What's next for Fake Essay Detection

Contributions to improve this project are welcome. Please follow these guidelines:

Open an issue: Discuss new features or potential improvements. Let's change and bring something responsible in this rapid development era of tech.

Built With

3.x
catboost
datasets
kaggle
lightgbm
numpy
pandas
python
scikit-learn
transformers

Updates

Tafar Mab started this project — Apr 21, 2024 10:40 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.