Plagiarism Detector

Inspiration

We were inspired first by the Word2Vec model developed by Google. After this, and the new public conversation about ChatGPT and plagiarism, we thought it would interesting to develop a tool that can determine the similarity of two documents.

What it does

The backend utilizes a homegrown language processing library which implements a trained word2vec model. Using this, we can break down text documents, remove stop words, tokenize strings, and using word embeddings to calculate the vector value of a string. Then we use a cosine similarity (otherwise known as dot product) to compare two vector values, the very nature of a dot product makes the number of words in each string irrelevant making this a fairly robust method.

First, a user finds two documents they think are similar. Then, the user pastes the contents of the documents into the two textboxes on the home page of the website. After pressing the button, the user is met with a similarity score between the two documents.

If a sentence in either document contains a parenthetical citation, it is not considered when calculating a similarity score. For example, a sentence like the following would not be included: "Some scholars note that '[t]he Martin XB-68 was a supersonic medium tactical bomber with a crew of two that was proposed in 1954 to the United States Air Force' (Wikipedia)".

How we built it

The frontend was built with React. The backend was written in Go and we used a pretrained Word2Vec model.

Challenges we ran into

Towards the end of the hackathon, we had a hard time integrating the frontend with the backend. In particular, we had a hard time sending requests from the frontend to the backend.

Accomplishments that we're proud of

We were able to write a library that handles a Natural Language Processing model. The backend does work which is cool!

What we learned

This is our first time competing in a hackathon! We learned a lot, including how to use React. We also sharpened our skills in Go and collaborating with using Git.

What's next for Plagiarism Detector

If we had more time, we would add the ability to upload documents directly instead of copying and pasting text into text boxes. On top of this, we would add the ability to search the internet for resources that may have been plagiarized.