Inspiration

This project was inspired by the Kaggle competition "Linking Writing Processes to Writing Quality". The goal was to predict the quality score of essays based on the logs of user inputs like keystrokes and mouse clicks during the writing process. We were motivated to explore the relationship between writing behavior and the resulting quality of the written piece. Understanding this link could provide valuable insights for improving writing skills and educational strategies.

What it does

Our project takes the logs of user inputs during essay writing as input data. It then processes and analyzes this data to extract meaningful features that capture various aspects of the writing process. Using these engineered features, we train machine learning models to predict the quality score assigned to each essay. The models learn patterns and correlations between the writing behavior and the essay scores, allowing them to make predictions on new, unseen essay logs.

How we built it

We approached the project in several stages:

  1. Data Exploration and Feature Engineering: We began by thoroughly exploring the provided dataset. We generated 27 new feature columns to capture relevant information such as word count, writing time, events per minute, text change count, and most frequent activity. This step allowed us to gain insights into the data and create meaningful representations for the models.

  2. Data Aggregation and Reduction: To streamline the data and make it more manageable, we reduced the dataset from over 8 million rows to 2,470 rows, where each row corresponded to an individual essay. This aggregation step preserved the essential information while reducing computational complexity.

  3. Correlation Analysis: We performed a correlation analysis to identify highly correlated pairs of features (r > 0.9). To avoid multicollinearity, we randomly dropped one feature from each strongly correlated pair, resulting in the omission of 8 out of the 27 features.

  4. Data Preprocessing: We applied several key preprocessing steps to prepare the data for model training. This included handling missing values through imputation, encoding categorical features using one-hot encoding, normalizing the features using MinMaxScaler, and addressing class imbalance using the SMOTETomek technique.

  5. Model Training and Evaluation: We experimented with various machine learning models for both classification and regression tasks. For classification, we tried Random Forest, LGBM, and XGBoost, but the accuracy was low. We then shifted our focus to regression models. We trained a Logistic Regression model with default parameters, achieving an RMSE score of 0.992 on the Kaggle test set. We also trained LightGBM models with and without adjusted parameters, obtaining RMSE scores of 0.785 and 0.793.

Challenges we ran into

During the project, we encountered a few challenges and bottlenecks:

  1. Model Selection: Initially, we were unsure about the appropriate models to use for this task. We experimented with both classification and regression approaches before realizing that regression models yielded better results.

  2. Input Feature Completeness: We recognized that the provided input features might not capture all the relevant aspects of the writing process. For example, additional features related to pauses, such as total pauses longer than specific durations or average pause duration, could potentially improve the model's performance.

  3. Feature Importance Analysis: We planned to conduct a thorough feature importance analysis to gain insights into the most influential factors affecting essay quality. However, due to time constraints, we couldn't delve deep into this aspect.

Accomplishments that we're proud of

Despite the challenges, we achieved several accomplishments that we are proud of:

  1. Comprehensive Data Exploration and Feature Engineering: We invested significant effort in exploring the dataset and engineering meaningful features. Creating 27 new feature columns allowed us to capture valuable information about the writing process.

  2. Effective Data Preprocessing: We successfully applied various preprocessing techniques to handle missing values, encode categorical features, normalize the data, and address class imbalance. These steps contributed to improving the quality of the input data for the models.

  3. Promising Model Performance: Although we faced challenges, we managed to train models that achieved competitive RMSE scores on the Kaggle test set. The Logistic Regression model and the LightGBM models showed promising results, indicating the potential for further improvement.

What we learned

Throughout this project, we gained valuable insights and learned several lessons:

  1. Importance of Feature Engineering: We learned the significance of thorough data exploration and feature engineering. Creating relevant and informative features can greatly impact the performance of machine learning models.

  2. Iterative Model Development: We realized the importance of iterating through different modeling approaches and adjusting parameters to find the best-performing model. Experimenting with both classification and regression models helped us identify the most suitable approach for this task.

  3. Collaboration and Communication: Working in a team of two highlighted the importance of effective collaboration and communication. We learned to divide tasks, share insights, and support each other throughout the project.

What's next for Linking Writing Processes to Writing Quality

Moving forward, there are potential directions for enhancing and expanding this project:

Exploring Advanced Models: Experimenting with more advanced machine learning models, such as deep learning architectures or ensemble methods, could potentially improve the prediction performance. These models might be able to capture more complex patterns and relationships in the data.

Built With

Share this project:

Updates