The universities providing the datasets inspired us to train ML-models to automatically rate the quality of unseen emails. The rating used in the data consists of 5 classes (steps). An email with a rating of 1 is poorly written and does not meet any of the quality criteria. An email with a rating of 5 must meet all required criteria of step 5 (formally written, appropriate subject, etc.). Steps 2, 3, 4 required a subset of the criteria of Step 5. A step and criteria prediction accuracy above 70% was targeted as a model with >70% prediction accuracy is of practical use in educational studies.

What it does

We tried several approaches:

  1. Build a GPT-3 prompt to determine the score and the met and unmet criteria of unseen emails
  2. Use Sagemaker autopilot to find well performing models
  3. "Manually" train a model on TF-IDF as text-feature (Model: GradientBoosting)
  4. Fine-Tune a pretrained transformer to predict scores

How we built it

  1. Experimented at the GPT-3 Playground to find an appropriate prompt
  2. Adjusted the dataset(s) and started autopilot
  3. Based on the best performing autopilot model, the manual approach was developed
  4. This part was not completed due to technical and permission problems with AWS

Challenges we ran into

  1. Loading and cleaning the data
  2. Learn how to use the tools (especially AWS Sagemaker), as no one on the team had previous experience with AWS Sagemaker
  3. No major problems here as most problems (mainly dataset related) were solved in 1.
  4. AWS IAM permission problems prevented to start the training

Accomplishments that we're proud of

  1. Working together as a team
    • Everyone on the team had different competencies
    • Everyone was able to contribute from his area of knowledge
  2. Due to united efforts we were able to train a model with an accuracy of 86.42% (autopilot 71% accuracy). This accuracy can be used by the providers of the datasets in actual studies.

What we learned

  1. How to use aws and aws sagemaker
  2. That data preperation, setup of dev-environments and data loading and processing take a lot of time. These things could be done upfront. That would allow everyone to concentrate on solving the provides challenge right from the start.

What's next for Essay-Scoring

  • Evaluate GPT-3 Performance (with developed prompt)
  • Fine-Tune pre-trained transformer and compare performance with "classic approach"
  • Find models that are able to score essay from different tasks with an appropriate (>70%) accuracy
Share this project: