Why the project
According to Verisign, there were over 360 million domains registered in 2021. With each domain usually having many, sometimes millions subpages, there are billions of URLs on the internet.
If you want to see the largest data sets on URLs, we invite you to check Common Crawl: https://commoncrawl.org/the-data/get-started/
Its latest data set had the following stats: 2.55 billion web pages or 295 TiB of uncompressed content. Page captures are from 46 million hosts or 37 million registered domains and include 1.3 billion new URLs, not visited in any of our prior crawls.
With billions of URLs, sometimes we want to know what they are about, i.e. what category is the page about. Is it about tech, news, hobbies or we may also want to know more detailed categories, e.g. is it about Audio accessories or Pet Supplies.
This is where the website categorization methods come into play. They are developed for the purpose of taking a given website content and assigning one or more categories to it, based on its content.
If we are dealing with just a few URLs, we can do that ourselves. But usually we want to categorize many websites, even millions, so the manual classification is not the best solution.
I developed a solution to categorize websites with machine learning model, based on IAB taxonomy.
Importance of machine learning
What we need in this case is automated, machine learning approach.
Within machine learning field, text classification can be considered a text classification problem, which is well known in the ML community.
The typical data set that is used for training is the so-called newsgroup data set which consists of around 18000 newsgroups posts on 20 topics split in two subsets.
You can learn more about it here: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html.
This data set is useful when building larger more complex ML models, especially neural nets or ensembles of them as it gives a baseline of performance.
Deployment as API
When deploying the website categorization as API, one can use many possible python services (or one can code it in another language), e.g.:
Useful tool when building API is postman software where you can test your API requests. Here is a tutorial on Postman: https://www.javatpoint.com/postman
Of course, after building the API a good documentation is needed. There are several good software for this, we had good experience with using slate
When building an API it is often useful to make a python wrapper and make it available as pip python package. I did this for website categorization API as well, you can find the respective pypi package here: https://pypi.org/project/websiteclassificationapi/.
Python wrapper are useful for users, because they abstract away the details of actual API implementation.
Once the API or dashboard is code, it is useful to test it how good it handles the incoming traffic and requests from clients. We like the following framework for this purpose: https://locust.io
It enables us to test the usage of thousands of users per seconds and there are nice charts for throughputs and other metrics that can help evaluate how the website will cope with traffic in real-time scenario.
Interesting article about Benefits of URL categorization — ecommerce use case.
Another one on what kind of categories do we assign to URLs?
My project is part of what is called natural language processing field.
There are numerous uses for natural language processing in the real world:
The use of machine translation in anything from digital assistants to video conferencing is a big success story for artificial intelligence.
Using the topic modeling technique, we can find undiscovered subjects in collections of documents. It's an inventive method for discovering information about people's interests and how they differ from one another based on shared interests or worries from massive data sets of people.
Text summary is a fantastic approach to make lengthy documents more readable and interesting. It eliminates all the superfluous information from your writings while keeping the essential ideas from each paragraph or section, which can aid comprehension by keeping readers on their toes. For text summarizing, there are a variety of Python modules available.
Web content filtering (restricting access to websites on internal networks by restricting IPs of domains like shopping stores, social media networks, gaming sites, etc.) and cybersecurity (identifying problematic websites) are just a few of the fields where classification of websites and domains is important.
Online optical character recognition service - this is often the first step companies take when starting their digital initiatives, namely digitizing their documents into a digital format, using an OCR service.
Understanding the feelings and thoughts hidden in your clients' speech is possible with the help of sentiment analysis. It can be utilized in a variety of fields, like as marketing or advertising, to assist you learn more about what consumers truly believe about the goods they have seen marketed.
In the realm of machine learning, natural language processing is a challenging task since it contains lexical and syntactic ambiguity.