Inspiration
This Project contains a machine learning model designed to detect whether an essay was written by a student or by a large language model (LLM). I was inspired to build something that can be a supportive tool in this Generative AI era.
What it does
This project develop a reliable tool to help educators and content creators identify potentially machine-generated essays.
How I built it
Methodology 1.Preprocessing:
Text Cleaning: Removal of noise, normalization (e.g., lowercasing). Tokenization: Using Byte-Pair Encoding (BPE) for efficient handling of out-of-vocabulary words. 2.Feature Engineering:
TF-IDF Vectorization: Converting text into numerical representations, emphasizing the importance of words within a document. 3.Modeling: Ensemble Classifier: Combining multiple models (Multinomial Naive Bayes, SGDClassifier, LightGBM, CatBoost) with weighted voting for robust predictions. 4.Evaluation: Metric: ROC-AUC score to assess performance on imbalanced datasets.
Accomplishments that I'm proud of
The model achieved an accuracy of 0.95% and a ROC-AUC score of 0.986 on the test set.
What's next for Fake Essay Detection
Contributions to improve this project are welcome. Please follow these guidelines:
Open an issue: Discuss new features or potential improvements. Let's change and bring something responsible in this rapid development era of tech.
Built With
- 3.x
- catboost
- datasets
- kaggle
- lightgbm
- numpy
- pandas
- python
- scikit-learn
- transformers
Log in or sign up for Devpost to join the conversation.