Our main idea was to create a web service for the user, so that he is able to store his preferred links, in a fast and well-structured way. In particular, once the user pastes a new link, our algorithm classifies the link to one of the 6 predefined categories (Sports, Technology, Education, Finance, Entertainment, Politics). We also created a Chrome extension that can export links from the user's bookmarks and put them directly to the app.
Our algorithm has several steps:
- Preprocessing: Web Scraping, Tokenization, Stemming, Stopwords Removal, Filtering of small words and words with very small and very high term frequencies.
- Feature Extraction: Bag of Words implementation. Metric Used: TF-IDF (Term Frequency, Inverse Document Frequency).
- Classification. We trained our model with the sparse feature vectors of approximately 220 different links that were preprocessed as mentioned above, each of these links having a specific label. When our model was trained, we used it to predict in which one of the 6 different predefined categories the newly inserted link belongs to, and assign it accordingly. We tried two algorithms for our model, the first one being the Naive Bayes Classifier, and the next one being a multiclass Linear Support Vector Machines implementation. From our testing we concluded that the Support Vector Machine approach is more suitable for our case.
Bigger Training dataset, more categories, try some different text mining approaches (Matrix Factorisation, Latent Semantic Analysis). For the preprocessing of our text we could try a different implementation, using ngrams or lemmatisation.