Website categorization API

Why the project

According to Verisign, there were over 360 million domains registered in 2021. With each domain usually having many, sometimes millions subpages, there are billions of URLs on the internet.

If you want to see the largest data sets on URLs, we invite you to check Common Crawl: https://commoncrawl.org/the-data/get-started/

Its latest data set had the following stats: 2.55 billion web pages or 295 TiB of uncompressed content. Page captures are from 46 million hosts or 37 million registered domains and include 1.3 billion new URLs, not visited in any of our prior crawls.

With billions of URLs, sometimes we want to know what they are about, i.e. what category is the page about. Is it about tech, news, hobbies or we may also want to know more detailed categories, e.g. is it about Audio accessories or Pet Supplies.

This is where the website categorization methods come into play. They are developed for the purpose of taking a given website content and assigning one or more categories to it, based on its content.

If we are dealing with just a few URLs, we can do that ourselves. But usually we want to categorize many websites, even millions, so the manual classification is not the best solution.

I developed a solution to categorize websites with machine learning model, based on IAB taxonomy.

Importance of machine learning

What we need in this case is automated, machine learning approach.

Within machine learning field, text classification can be considered a text classification problem, which is well known in the ML community.

The typical data set that is used for training is the so-called newsgroup data set which consists of around 18000 newsgroups posts on 20 topics split in two subsets.

You can learn more about it here: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html.

This data set is useful when building larger more complex ML models, especially neural nets or ensembles of them as it gives a baseline of performance.

Deployment as API

When deploying the website categorization as API, one can use many possible python services (or one can code it in another language), e.g.:

Useful tool when building API is postman software where you can test your API requests. Here is a tutorial on Postman: https://www.javatpoint.com/postman

Of course, after building the API a good documentation is needed. There are several good software for this, we had good experience with using slate

When building an API it is often useful to make a python wrapper and make it available as pip python package. I did this for website categorization API as well, you can find the respective pypi package here: https://pypi.org/project/websiteclassificationapi/.

Python wrapper are useful for users, because they abstract away the details of actual API implementation.

Testing implementation

Once the API or dashboard is code, it is useful to test it how good it handles the incoming traffic and requests from clients. We like the following framework for this purpose: https://locust.io

It enables us to test the usage of thousands of users per seconds and there are nice charts for throughputs and other metrics that can help evaluate how the website will cope with traffic in real-time scenario.

Useful articles

Interesting article about Benefits of URL categorization — ecommerce use case.

Another one on what kind of categories do we assign to URLs?

NLP

My project is part of what is called natural language processing field.

There are numerous uses for natural language processing in the real world:

The use of machine translation in anything from digital assistants to video conferencing is a big success story for artificial intelligence.
Using the topic modeling technique, we can find undiscovered subjects in collections of documents. It's an inventive method for discovering information about people's interests and how they differ from one another based on shared interests or worries from massive data sets of people.
Text summary is a fantastic approach to make lengthy documents more readable and interesting. It eliminates all the superfluous information from your writings while keeping the essential ideas from each paragraph or section, which can aid comprehension by keeping readers on their toes. For text summarizing, there are a variety of Python modules available.
Web content filtering (restricting access to websites on internal networks by restricting IPs of domains like shopping stores, social media networks, gaming sites, etc.) and cybersecurity (identifying problematic websites) are just a few of the fields where classification of websites and domains is important.

Redaction is the process of editing documents to remove or obscure sensitive information before publication or sharing. Organizations like the National Archives offer guidance and standards for effective redaction practices. The International Association of Privacy Professionals also provides resources and training for professionals handling data redaction. For automated solutions, the document redaction tool helps securely redact confidential information from digital files.

Anonymization transforms personal data in such a way that individuals cannot be identified, even indirectly. The European Commission outlines strict requirements for anonymization under regulations like the GDPR. Research and frameworks from the Future of Privacy Forum support best practices for balancing privacy and data utility. Companies can utilize the personal data anonymization tool to systematically anonymize sensitive information in datasets.

Content moderation is vital for maintaining safe and welcoming digital communities by detecting and removing harmful or inappropriate content. Non-profits such as the Center for Democracy & Technology advocate for responsible, transparent moderation policies. Organizations like the Internet Watch Foundation work to identify and remove illegal content online, protecting users worldwide. Modern platforms can rely on the AI content moderation service to automate the review and filtering of user-generated content.Also useful is URL Categorization Dataset

Understanding the feelings and thoughts hidden in your clients' speech is possible with the help of sentiment analysis. It can be utilized in a variety of fields, like as marketing or advertising, to assist you learn more about what consumers truly believe about the goods they have seen marketed.

In the realm of machine learning, natural language processing is a challenging task since it contains lexical and syntactic ambiguity.

Built With

css
html
javascript
mysql
python
pytorch
sklearn
tensorflow

Updates

ailstm started this project — Oct 11, 2022 05:43 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.