I inspired because it was a new project based on deep learning and searching those link within time providing u relevant information about web pages object which you searched similar this concept has used in Chat gpt the crawling algorithm that crawls the database of search engines .

Abstract :Search engine database stores a huge amount of information, so searching on the internet is dragging a net across the surface of the ocean that means everytime it is not possible to get relevant information related with our query entered in the search engine. As there is a huge amount of information most of the information is hidden, burried far down on dynamically generated sites and standard search engine fails to find it. Traditional search engine create indices by crawling, it is necessary that the page should be static. Such static pages has been discovered by search engine as, dynamically generated pages cannot be discovered which results in an increment of hidden data. So it is necessary to use a two-stage framework with multithreaded crawler for efficient harvesting a deep web sites. The multithreaded crawler generates a maximum number of searchable forms by avoiding unuseful pages.

Introduction: web crawler is a program that goes around the internet collecting and storing data in a database for further analysis and arrangement. The process of web crawling involves gathering pages from the web and arranging them in such a way that the search engine can retrieve them efficiently. Deep web consist of data that exist on the web but they are inaccessible on text search engine. To locate the deep web databases is became a big challenge, because they are not registered with any search engines, mostly they are sparsely distributed, and keep constantly changing. Considering this problem, previous work has proposed two types of crawlers, generic crawlers and focused crawlers. Generic crawlers fetch all searchable forms but cannot focus on a specific topic. Focused crawlers such as Form-Focused Crawler (FFC) and Adaptive Crawler for Hidden- web Entries (ACHE) can automatically search online databases on a given topic. FFC is designed with link, page, and form classifiers for focused crawling of web forms, and is ex- tended by ACHE consisting an additional components for form filtering and adaptive link learner. The link classifiers in these crawlers play a vital role in achieving higher crawling efficiency than the best-first crawler. The link classifiers are used to predict the distance to the page containing searchable forms. As a result, the crawler fails to search targeted forms. SmartCrawler is a focused crawler consisting of two stages: efficient site locating and balanced in-site exploring. It perfoms site-based locating by reversely searching the known deep web sites for center pages, which can effectively find many data sources for sparse domains. By ranking collected sites and by focusing the crawling on a topic. To improve more accuracy an Adaptive Smart Crawler is proposed. It is a two-stage framework, for efficiently harvesting deep web interfaces. In the first stage, the Crawler will performs site-based searching for center pages with the help of search engines, avoiding visiting a large number of pages. The Crawler ranks websites to prioritize highly relevant ones for a given topic for achieving accuracy for a focused crawl. In the second stage, Crawler will search the searchable forms from the given set of seed sites by classifying topic relevant links and domain specific searchable forms. For obtaining better results we used a multithreaded crawler for crawling various domains simultaneously and for classification purpose we used Naive Bayes classifier as well as Multiclass classifier.It is a real time application which can crawl various domain simultaneously. A two-stage framework,namely Adaptive smart crawler, for efficient harvesting deep web interfaces is used. In the first stage, the crawler will performs site-based searching for center pages with the help of search engines, avoiding visiting a large number of pages. To achieve more accurate results for a focused crawl, The Crawler ranks websites to prioritize highly relevant ones for a given topic. In the second stage, Crawler will achieve fast in-site searching by excavating most relevant links with an adaptive link-ranking. To eliminate bias on visiting some highly relevant links in hidden web directories,a link tree data structure to achieve wider coverage for a website. In the first stage, the Adaptive smart crawler will performs site-based searching for center pages with the help of search engines, avoiding visiting a large number of pages. The Crawler ranks websites and priority is given to highly relevant sites for a given topic so that it can achieve accuracy for a focused crawl. In the second stage, Crawler will search the searchable forms by classifying topic relevant links and domain specific searchable forms. Motivation An effective deep web harvesting framework, i.e a two stage framework, for achieving both wide coverage and high efficiency for a focused crawler is proposed. Based on the observation that deep websites usually contain a few searchable forms,the crawler which is divided into two stages: site locating and locating searchable form stage. The site locating stage helps to achieve wide coverage of sites for a focused crawler, and the second stage can efficiently perform searches for web forms within a site. A two-stage framework namely Adaptive Smart Crawler is proposed to address the problem of searching for hidden-web resources.

future scope As Supervised Learning Method is used for training purpose. In future work will be car- ried out by using Unsupervised Learning Method to improve the percentage of classification accuracy also we can work on postquery approach of form identifier it is related with form submission process such as form filling and processing of form submission results, it makes use of data that is generated from form submission process use.

I didn't feel any challanges because I was learning many new things in this project which itself was giving me a scope to learn and to understand recent things that we are using in AI chat gpt we can use this project for accuracy and pattern matching

Built With

Share this project:

Updates