The COVID 19 pandemic made us realize that seamless virtual learning is as important as face-to-face teaching. Transitioning from traditional means of education, the pen, and the paper to entirely online teaching in such a short span was challenging. Being a group of 4 students, we realize the turbulence we faced while transitioning to just virtual learning. With a lack of personal touch and interaction, it is harder for teachers to monitor student's performance. It is taxing and not feasible to check the progress, provide feedback, and solve every doubt for every student. Sometimes, teachers are also not well-trained or well-equipped to use online tools. For students, it is the absence of meeting friends, networking with professionals and like-minded people, and the monotony of being in front of the screen and learning without that eye contact. Inspired by our circumstances, we thought of how to aid in making the lives of students easier? How to use the exciting machine learning techniques to predict what is stopping students' completion of the courses virtually? We understand that the dataset we are working on is from PRE-covid times, but we also understand that data never lies. There is a pattern to why students fail while learning virtually. Covid was the situation that personalized this issue. Predicting the grade of students at early stages can help the mentor to monitor students' learning progress. Mentors could give special attention to students forecasted to fail the course. As prevention is better than cure, the mentor could encourage and alert students consistently to keep them progressing. The main concern is not just getting a higher score by the student, but also to develop the students' flair. Marks are a part to testify one's understanding in the course, but not entirely a part of skill development. A deeper understanding and practical application of the course is also an essential factor that acts as a tool in solving real-world problems. The mentor can customize teaching, their focus, and attention on particular areas to aid the students. The prediction helps mentors redevelop and incorporate new teaching standards by taking feedback from students, as at the end of the day, mentors also have a part in the success or failure of their students. Without all of the non-verbal queues a teacher receives in a classroom, it’ll be really important for teachers to use every tool available to help provide early intervention for struggling students. And those who struggle early might very well end up top performers in the long run.

What it does

Without all of the non-verbal queues a teacher receives in a classroom, it’ll be really important for teachers to use every tool available to help provide early intervention for struggling students. We have built a model that predicts the grades of the students. And those who struggle early might very well end up top performers in the long run. This model can help in changing the lives of students for the better. It also enforces to track student’s progress at regular intervals to ensure positive results and recommend courses to students based on their academic history.

How we built it

Step 1: Collect the data

It required finding an inclusive database that suited our need to solve the problem of students during virtual learning. The Open University Learning Analytics dataset was used as the index. It consists of seven tables containing data on student behaviors, demographics, and the modules in which they are enrolled. This data collection contains information on 32,593 Open University students enrolled in seven different modules (OU). The information gathered would be used to classify the characteristics that influence students' success.

Step 2: Data Preparation

It is essential to combine different datasets and pre-process them since the data is scattered over several tables. After that null/NaN values are deleted, and all missing values are replaced. All of the datasets are then combined after data processing to form the primary dataset, which we have then used to formulate our data model. The data is also translated to categorical data because it has attributes like 'Higher education'. We used the technique of label encoder for this purpose. For instance, we have replaced the 'grades' column with numbers from 0 to n-1.

Step 3: Choose the training model

In this step, we decided to use tree-based models because some advantages of this kind of model can capture complex non-linear relationships. Models used are Decision Tree, Gradient Boosting, XGBoost, RandomForest, and CatBoost. We tried to predict the students' scores from their learning behavior during online classes using machine learning. From the score, we also tried to see whether they pass or fail the courses too.

Step 4: Train the data

Step 5: Evaluate the accuracy/precision of the already trained model

CatBoost classifier model performed the best. Although its accuracy scores (0.795, SD = 0.005) are similar to the GB model (0.79, SD = 0.004), the GB model performs with slightly less variance between the scores during cross-validation as shown by lower standard deviation.

Step 6: Prediction/Inference

Challenges we ran into

The initial step to choose the problem was demanding, given the requirement of keeping the topic related to university in mind. Followed by finding the dataset to work upon took a lot of our initial time, as we were looking after large inclusive datasets to train the model well without any bias, and finding open data that suits our requirements and problem was taxing. Afterward, understanding the dataset, familiarizing ourselves with the data, and preprocessing it before training was challenging and required in-depth knowledge of the schema, and structuring of the data. Then after cleaning the data, we took some time to find the correlated attributes, finding the suitable training model which will help in predicting the grades took a significant time, and after training evaluating the accuracy and hyper tuning the parameters was laborious.

Accomplishments that we're proud of

  1. Dealing with the essential issue of Virtual Learning for students within the given time frame of three days and trying to create a means to solve it, required deep analysis, effort, and a will to create a positive impact in society. We can’t solve all the problems that are faced by students during virtual learning by building this model, but setting our foot in the right direction and working towards it is what we are proud of.
  2. Without all of the non-verbal queues a teacher receives in a classroom it’ll be really important for teachers to use every tool available to help provide early intervention for struggling students. And those who struggle early might very well end up top performers in the long run. This model can help in changing the lives of students for the better
  3. Our model predicts the grade of the students with pretty good accuracy of 79.5%.
  4. Finding an inclusive dataset that comprised factors such as age, disability, gender, region, and highest education achieved also helped in reducing the bias while analyzing the grades of students in all the spheres. We are proud that we considered all these factors while finding the dataset because anyone can be a student. Everyone has the right to learn, to educate, be it young or old, abled or differently-abled, male or female, beginner or advanced. Training the model as much inclusively is what we aimed and accomplished.

What we learned

Analysis to understand the distribution of the Quasi-identifying attributes. The six main attributes identified are gender, disability, age, highest education, region, imd band. It can help to understand the final result. Based on our visualization, we came up with the following analysis about the Quasi-identifying attributes.

  1. Gender does not play a significant role in the outcome, as it is almost identical in all four scenarios of Pass, Fail, Withdrawn, and Distinction.
  2. Students with disabilities have a higher exit rate and a lower pass rate.
  3. Students under the age of 35 have a higher risk of removal. The percentage of people who pass and the distinction that they get rise as they get older.
  4. As the IMD rate rises, the number of people who succeed rises while the percentage of people who fail falls.
  5. If the students have no previous formal education, the withdrawal risk is exceptionally high. In the case of post-graduation backgrounds, the degree of distinction is slightly higher.
  6. There is no discernible distinction between all regions, so it is not a defining factor. In virtual learning, it's harder for teachers to monitor student's performance. Our model predicts student grades for early intervention by teachers to assist students at risk of fail or drop-out.

The best model for the classification task was the CatBoost classifier (0.795 accuracy score on cross-validation)

What's next for Predict Student Grade in Online Class for Early Intervention

  1. We will ensure that the prediction stays between the teacher, the student, and the university. It can be a bitter experience if other students in the class are aware of the predictions which might lead to insecurities, depression, or anxiety. The prediction model is for aiding struggling students to pass the course.
  2. For future reference, we will try to create open data with more inclusive options for gender that would incorporate the extension of choices including transgender, Not sure, Non-binary/non-conforming, agender, etc.
  3. After our prediction model is successfully integrated, we will make sure to take feedback from teachers as well as students. We plan to put together the feedback to model the significance of different variables on student performance from different perspectives.
Share this project: