Inspiration
To analyze each student's ability on each topic and provide reports or suggestions, we must first determine which questions are included in which topic. However, we find it difficult to categorize these questions into each topic because we also need to be familiar with the topic in order to classify them precisely. Furthermore, because we need to add a large number of questions to the platform, the process is complicated and time-consuming. To reduce the time and complexity of the process, we decided to create an AI program to classify the questions.
What it does
A program that detect each question inputs then analyze and categorize into its specific subject and/or topic.
How we built it
Use the Natural language processing combine with Python language, Google Sheets, Google Colab, and library including Pandas, Scikit-learn and PyThaiNLP.
Here is the step by step explanation on how to build the model.
- Data preparation - Before we begin coding, we must first prepare the data. I chose to prepare my data in Google Sheets.
- Import data and dependencies - To get started with the coding part, we will need to import pandas, PyThaiNLP, and Scikit-learn. Additionally, we also use the gspread API in this step to import and read data from the Google sheets we created previously.
- Create dataframe - Generate a dataframe from the imported data using pandas' function.
- Data checking - The data function is then used to determine the number of records and columns. There are three columns in this case, each with 320 records, and all data types are stored as objects. After that, perform a check for missing data in this step in order to eliminate it before proceeding. As illustrated in the slide, it appears to be entirely false, indicating that for our data, no data has been omitted. In case if any data is missing from a row, we will drop it and count the remaining records.
- Combine data - The question and choice columns are combined and stored at the new column name text. This is because we believe that both the question and the choice can be analyzed in conjunction with one another because both contain keywords and can assist in the analysis process.
- Encode topics into numerical values - Encode each topic into numerical values using the digits 0-7 to make it easier to use later.
- Data counting and visualization - Utilize the countplot function to determine and visualize the current number of data in each topic. As for our data, the number of data points input for each topic is the same; this is to avoid bias prediction.
- Data cleaning - Clean the data using the PyThaiNLP library. In this step, remove any unnecessary stopwords, such as the space bar, notations, and numerical values. Then, a new column named text tokens is created to hold the cleaned data. These data will only left with keywords and are ready for analysis. Additionally, do not forget to re-checked the data to ensure there are no missing records as a result of the cleaning process.
- Create wordcloud - While entirely optional, create a wordcloud in order to visualize the density of words in each topic.
- Split data into train and test set - Divide the cleaned data into train(80%) and test(20%) sets. The train set will be used to construct and train the model, while the test set will be used to evaluate the model's performance.
- Word vectorization - The trained set's keywords are then transformed into a vector and plotted in the graph.
- Model building - Finally, use the Scikit-learn library to load the logistic regression model. At this point, we will be able to create a model that categorizes each question according to its topic. However, as the precision score, f1 score, and confusion matrix indicate, the model is not yet very accurate. The topics that are most likely to have an inaccurate prediction are topic 1 and topic 5, which is แรงและการเคลื่อนที่ and อาหารและการดำรงชีวิต
Challenges we ran into
Due to the fact that we only have a limited amount of time, combined with the fact that none of us is very familiar with coding, we must start from scratch and work our way through the process step-by-step. Additionally, because we were only able to collect a limited amount of data, the model was not super accurate at this time. As an example, it may classify topic 1’s question into topic 5. To be more specific, the question classifier may be confused by similar words appearing in both topics, as shown in the example wordcloud (in slides), where some words, such as, พลังงาน กรัม, appear in both topics. As a result, the prediction maybe inaccurate.
Ways to improve the model accuracy
- Increase the size of training data - One of the primary determinants of your model's predictive power is the size of training data. The more the training data it is, the more likely the model will be close to the word accurate.
- Mark word that may make the model confused - If there are word are similar appear in many topics, we can add it as another features and make use of the logistic regression model by calculating and summarizing these new features with the original one to come up with scores and result in a better prediction (just like how aj.attapol's tutorial for logistic regression). Also, develop or improve our own stopwords may help in a better prediction because we can customized everything to fitted best for our data and predict a more accurate solution (other's ready to use stopwords pack may not perfectly fit for our use, result in an inaccurate solution).
- Discover and find other algorithm that better suit the data - Explore more about models and algorithm because there are so much more of these out there. Try to understand and use new stuffs my result in a better model and solution!
Accomplishments that we're proud of
We can actually build a question classification program that is capable of functioning in a limited period of time (considering we are not very familiar with coding). We quite appreciate that it is incredibly beneficial and increases the efficiency of our project's work. Additionally, we are impressed because we believe it has the potential for further development and improvement in order to become even more efficient.
What we learned
Firstly, we have learned how to apply Artificial Intelligence effectively to our project and achieve significantly beneficial results. We have learned more about machine learning and how to use a new library that we had never used before over time. We have also discovered that implementing Artificial Intelligence to boost work efficiency is not that difficult. However, we have noticed that developing an efficient model takes considerable time, data, and significant effort.
What's next for Questions Classification
We will spend more time with it during the semester break to make it more efficient and accurate. We will collect additional data and investigate additional options for advancing our program. In addition, we would like the program to be able to adapt and be used in a broader range of fields, not just question classification.
We also expected the program to be the leading back end system in terms of advancing ScoreLar’s key features. We believe that using this program to categorize the question will result in a more precise analysis and suggest for students.
Built With
- google-colab
- google-spreadsheets
- pandas
- pythainlp
- python
- scikit-learn
Log in or sign up for Devpost to join the conversation.