We came in wanting to do something none of us had worked on before and each of has interest in machine learning. Inititially we planned on doing a simple plaigerism check program but as we researched available literature we found an author attribution system more interesting to make.
What it does
The program takes a text document, currently pdfs or docx uses over 50 metrics to identify the style of a user's writing. These metrics are then fed into a support vector machine which compares them to an unknown test data. The machine then predicts from available authors which author is the best fit for the document. This is usueful in measuring the progress and similarity of student writing as well as identifying unknown sources of writing from different time periods
How we built it
PyQt5 was used for the GUI, NLTK was used for the support vector machine, and python was used as the underlying language
Challenges we ran into
Determining which characteristics are useful for attribution is not easy and there is a lot of conflicting ideas on which aspects of writing yield the most accurate prediction. It is also our first larger python project and none of us are very familiar with some aspects of the language that came up in the project's libraries
Accomplishments that we're proud of
The support vector machine using our metrics is able to correctly identify our samples 90% of the time.
What we learned
What's next for Checkr
If we continue work on this project we would make a web portal for users and begin building a much larger database of samples.