Inspiration
Inspired by the UCSB professor Stephan Th. Gries and his lectures on linguistics and several cases he worked on where he was able to identify the writer of texts from their linguistics features. Wanted to apply that to a more common field, especially an educational one. I realized that school systems often used AI detectors but certain people have tones more comparable to AI than others. I wanted to make a system that specifically analyzed student work and newer work to compare and check the probably of it being AI. With using work that couldn't be AI you would be able to get an accurate baseline and find what could be AI.
What it does
Linguistics is suited for teachers to that they can create classes, add students and put in writing samples for the students. After 3 initial essays it can make a baseline analysis of the linguistics "footprints" of the student in their writing based off of 28 different parameters like sentence length, grammar errors, transition words, vocabulary complexity and diversity. Then Linguine can analyze newer essays and create a cumulative percentage chance of the essay being AI using both a traditional AI detector and this method. It shows how much the writing deviated from the normal style of the Student.
How we built it
The backend was made in Python using FastAPI, the data was all stored in JSON files for a more simpler file storage. The linguistic part uses spaCy which helps me process the sentence and find the 28 given parameters from each text sample. I also made a deviation scorer that exponentially penalized large differences between the baseline than the submitted essay. Additionally it has different weightage based on the specific linguistic feature based on how necessary I perceived it and also how necessary it was based on the distinguishability. For example m-dashes are an important indicator and would have more priority than having oxford or not oxford commas. The Sapli AI API adds general AI detection giving a linguistic score. The frontend is Reach with React Router and makes a dashboard to manage classes and students as well as add pages for each student to add samples, build the baseline, submit the essay and view history.
Challenges we ran into
The biggest challenge was mainly just making sure that the linguistic feature would work properly, maybe I sound a lot like AI but I initially received single digit percent chances for AI essays to be made by me. The first version used cosine similarity which averaged everything out so I changed it to a power curve. Additionally I had to change the sensitivity and also tweak the weightage system of the linguistics features to make sure that it was as accurate as possible. I used a power curve that would exponentially increase the percentage based on the deviation. Another minor issue was getting the bycrypt and passlib to be compatible and work properly.
Accomplishments that we're proud of
I successfully made this despite the challenges that I faced and that it uses actual personal detection for each student. The 28 feature linguistic fingerprint actually works and the interfaces is reasonably usable for the average teacher.
What we learned
I learned that sensitivity calibration is everything in a scoring system. The math that looks correct on paper ,cosine similarity, can produce completely useless results in practice. We also learned that linguistic fingerprinting is genuinely powerful, the deviations the system flags ,vocabulary diversity jumping 32%, transition words jumping 292%, are exactly the kinds of changes a human teacher would notice if they were reading carefully.
What's next for Linguine
The next step would probably be adding OCR so that teachers can scan documents if they need to and not need ot type in all the essays. Additionally it could be nice to add more features to the app, like improvement in writing over time and how style changes. I also want to make a custom model using linguistic feature instead of cosine deviation so that Linguine doesn’t need Sapling AI API.
Built With
- fastapi
- github
- javascript
- numpy
- passlib+bycrypt
- pydantic
- python
- react
- react-router
- saplingai
- scikit-learn
- spacy
- textstat
- uvicorn
- vite
- vscode
Log in or sign up for Devpost to join the conversation.