The universities providing the datasets inspired us to train ML-models to automatically rate the quality of unseen emails. The rating used in the data consists of 5 classes (steps). An email with a rating of 1 is poorly written and does not meet any of the quality criteria. An email with a rating of 5 must meet all required criteria of step 5 (formally written, appropriate subject, etc.). Steps 2, 3, 4 required a subset of the criteria of Step 5. A step and criteria prediction accuracy above 70% was targeted as a model with >70% prediction accuracy is of practical use in educational studies.
What it does
We tried several approaches:
- Build a GPT-3 prompt to determine the score and the met and unmet criteria of unseen emails
- Use Sagemaker autopilot to find well performing models
- "Manually" train a model on TF-IDF as text-feature (Model: GradientBoosting)
- Fine-Tune a pretrained transformer to predict scores
How we built it
- Experimented at the GPT-3 Playground to find an appropriate prompt
- Adjusted the dataset(s) and started autopilot
- Based on the best performing autopilot model, the manual approach was developed
- This part was not completed due to technical and permission problems with AWS
Challenges we ran into
- Loading and cleaning the data
- Learn how to use the tools (especially AWS Sagemaker), as no one on the team had previous experience with AWS Sagemaker
- No major problems here as most problems (mainly dataset related) were solved in 1.
- AWS IAM permission problems prevented to start the training
Accomplishments that we're proud of
- Working together as a team
- Everyone on the team had different competencies
- Everyone was able to contribute from his area of knowledge
- Due to united efforts we were able to train a model with an accuracy of 86.42% (autopilot 71% accuracy). This accuracy can be used by the providers of the datasets in actual studies.
What we learned
- How to use aws and aws sagemaker
- That data preperation, setup of dev-environments and data loading and processing take a lot of time. These things could be done upfront. That would allow everyone to concentrate on solving the provides challenge right from the start.
What's next for Essay-Scoring
- Evaluate GPT-3 Performance (with developed prompt)
- Fine-Tune pre-trained transformer and compare performance with "classic approach"
- Find models that are able to score essay from different tasks with an appropriate (>70%) accuracy