A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations
佐野博之, Robin M. E. Swezey, 白松俊, 大囿忠親, 新谷虎松

アブストラクト

In this paper, we describe a Web page segmentation method based on title blocks and show its evaluation. Title blocks are minimum blocks that function as headlines for specific Web content. A typical Web page consists of multiple elements with different types of features, such as main content, navigation panels, copyright and privacy notices, and advertisements. Web page segmentation is the division of the page into visually and semantically cohesive pieces. Our segmentation method is comprised of three steps. First, it divides the page into minimum blocks. Second, it classifies the blocks into two classes, title blocks or non-title blocks. Third, it assembles groups of these blocks into Web content blocks. While the minimum blocks can play many roles, this study focused on blocks that are the titles of various Web content bits. A decision tree learning is used with nine features for each minimum block to extract title blocks from Web pages. Experimental results showed that our segmentation method could divide Web pages that are collected from the news site with 96.1 percent accuracy, independently of amount of content. The results also describes that the method can divide all Web pages that are used in the experiment in less than 1000 milliseconds.