Improved Web Page Categorization with Semantic-Aware Focused Crawling Using GloVe and TF-IDF

Li Zhang1
1School of Artificial Intelligence, Zhejiang College of Security Technology, Wenzhou, Zhejiang, 325016, China

Abstract

Focused crawlers are targeted to search the internet for web pages on specific topics. Its main task is to collect preprocessed and topic related web pages and ignore irrelevant web pages. Traditional focused crawlers have limited success in achieving multi-text categorization of web pages. Due to the large amount of unstructured data present in web pages, the correct classification of web pages based on a given topic is the main practical challenge for focused crawlers.The main objective of this work is to design an improved focused crawling approach using web page classification. In this paper, a text classification model based on the combination of GloVe word vector model and TF-IDF weighting technique is proposed to improve the accuracy of web page classification. The GloVe-based text classification model is further utilized to guide focused crawlers to classify web pages.The proposed GloVe and TF-IDF text categorization models are validated on 10 different datasets and the results are compared with traditional machine learning algorithms as well as different methods based on Naive Bayes, Bag-of-Words and Word2Vec. According to the experimental results, the proposed text classification model is 7-12% better than traditional machine learning algorithms.

Keywords: focused crawler, GloVe, TF-IDF, Text Classification