arXiv is the repository to tens of thousands of openly accessible papers in the fields of physics, mathematics, computer science and so on. For machine learning practitioners, researchers, data scientists (and all the other positions pertaining to the fields of machine learning and data science) arXiv is the dosage of research, application, and theoretical know-how.

The pace at which the field of machine learning is evolving is just unreal. There are new papers getting released every day. It really gets challenging to keep up with this pace. Moreover, there are so many sub-categories in each of the primary arXiv categories. For example, in the Computer Science category (cs) there are sub-categories like cs.LG (Machine Learning), cs.CV (Computer Vision and Pattern Recognition), cs.AI (Artificial Intelligence), cs.CL (Computation and Language) and so on. These sub-categories are also called tags.

As mentioned above, a huge number of papers get submitted to arXiv on a daily basis. To maintain smooth and consistent user experience, it is important to accurately categorize the papers. Because of the vastness of the field (machine learning), it can get confusing for an author to decide these categories in the event of submission of a paper. In situations like this, an automatic tag identification can help which would be able to correctly identify the right tags from the title of a paper.

Being a new entrant to the field of Natural Language Processing (NLP), I took this idea to develop on a cool project around it. In this article, I will be walking you through some of the critical aspects of the projects and how Weights and Biases helped me to keep track of my experiments. Let’s get started!

The source code of the project is available here: https://github.com/sayakpaul/Generating-categories-from-arXiv-paper-titles. I have discussed the project in great detail in this article of mine: https://www.wandb.com/articles/generating-tags-from-arvix.

Built With

Share this project:

Updates