Classifying functions of web blocks based on linguistic features

US 7 895 148B2

drawing #0

Show all 6 drawings

A classification system trains a classifier to classify blocks of the web page into various classifications of the function of the block. The classification system trains a classifier using training web pages. To train a classifier, the classification system identifies the blocks of the training web pages, generates feature vectors for the blocks that include a linguistic feature, and inputs classification labels for each block. The classification system learns the coefficients of the classifier using any of a variety of machine learning techniques. The classification system can then use the classifier to classify blocks of web pages.

PatentSwarm provides a collaborative workspace to search, highlight, annotate, and monitor patent data.

Start free trial Sign in

Tip: Select text to highlight, annotate, search, or share the selection.

Claims

1. A method performed by one or more computing devices for classifying a block of a document based on its function, the method comprising:
identifying blocks of training documents, a block of a document containing words that are displayed when the document is displayed;
for each identified block,
receiving a classification label for the identified block indicating its function; and
generating a feature vector for the identified block, the feature vector including a linguistic feature of a word of the block;
training a classifier using the feature vectors and classification labels to classify blocks of documents based on the feature vectors of the blocks;
classifying a block of a document based on its function by applying the trained classifier to a feature vector for the block; and
when a document will not fit on a display of a device, displaying blocks of the document giving preference to blocks with a certain classification.

Show 9 dependent claims

11. A computing device generating a classifier for classifying blocks of web pages into functional classifications, comprising:
a training data store that includes training web pages, the web pages having blocks, a block of a web page containing text that is displayed when the web page is displayed;
a block identification component that identifies blocks within a web page;
a feature generation component that generates a feature vector for a block of a web page, the feature vector including layout features and linguistic features, the layout features including size of text within a block when the block is displayed;
a labeler component that inputs a classification label for each block of each training web page;
a component that learns coefficients of a classifier using the feature vectors of the training web pages and the label classifications and stores the coefficients in a classifier coefficients store; and
a component that, when a web page will not fit on a display of a device, provides that the blocks of the web page are displayed giving preference to blocks with a certain classification as determined by applying the classifier with the learned coefficients to the blocks of the web page.

Show 6 dependent claims

18. A computer-readable storage medium encoded with instructions for controlling a computing device to classify blocks of web pages based on their function, by a method comprising:
identifying blocks of training web pages, each block of a web page containing text that is displayed when the web page is displayed;
for each identified block,
receiving a classification label for the identified block, the classifications including information and non-information; and
generating a feature vector for the identified block, the feature vector including a linguistic feature and a layout feature, the linguistic feature based on parts of speech of words within the text of the block, the parts of speech of words within the text of the block identified by submitting the text of the block to a natural language processor;
training a classifier using the feature vectors and classification labels; and
classifying a block of a web page as information or non-information by applying the trained classifier to a feature vector for the block
so that when the web page will not fit on the display of a device, blocks of the web page are displayed giving preference to blocks with a certain classification.

Show dependent claim

Description

BACKGROUND

Many search engine services, such as Google and Yahoo, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (i.e., a query) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by crawling the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service identifies web pages that may be related to the search request based on how well the keywords of a web page match the words of the query. The search engine service then displays to the user links to the identified web pages in an order that is based on a ranking that may be determined by their relevance to the query, popularity, importance, and/or some other measure.

Whether the web pages of a search result are of interest to a user depends, in large part, on how well the keywords identified by the search engine service represent the primary topic of a web page. Because a web page may contain many different types of information, it may be difficult to discern the primary topic of a web page. For example, many web pages contain advertisements that are unrelated to the primary topic of the web page. A web page from a news web site may contain an article relating to an international political event and may contain noise information such as an advertisement for a popular diet, an area related to legal notices, and a navigation bar. It has been traditionally very difficult for a search engine service to identify what information on a web page is noise information and what information relates to the primary topic of the web page. As a result, a search engine service may select keywords based on noise information, rather than the primary topic of the web page. For example, a search engine service may map a web page that contains a diet advertisement to the keyword diet, even though the primary topic of the web page relates to an international political event. When a user then submits a search request that includes the search term diet, the search engine service may return the web page that contains the diet advertisement, which is unlikely to be of interest to the user.

Many information retrieval and mining applications, such as search engine services as described above, depend in part on the ability to divide a web page into blocks and classify the functions of the blocks. These applications include classification, clustering, topic extraction, content summarization, and ranking of web pages. The classification of the function of a block can also be used in fragment-based caching in which caching policies are based on individual fragments. The classification of the function of blocks can also be used to highlight blocks that may be of interest to users. The classification of the function of blocks is particularly useful when a web page is displayed on a screen with a small size, such as that of a mobile device.

SUMMARY

A method and system for classifying blocks of a web document based on linguistic features is provided. A classification system trains a classifier to classify blocks of the web page into various classifications of the function of the block. These classifications may include an information classification and a non-information classification. The classification system trains a classifier using training web pages. To train a classifier, the classification system identifies the blocks of the training web pages, generates feature vectors for the blocks that include a linguistic feature, and inputs classification labels for each block. The classification system learns the coefficients of the classifier using any of a variety of machine learning techniques. The classification system can then use the classifier to classify blocks of web pages. To classify the blocks of a web page, the classification system identifies the blocks of the web page and generates a feature vector for each block. The classification system then applies the classifier to the feature vector of a block to generate a classification of the block.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating components of the classification system in one embodiment.

FIG. 2 is a flow diagram that illustrates the processing of the generate block classifier component of the classification system in one embodiment.

FIG. 3 is a flow diagram that illustrates the processing of the generate feature vector component of the classification system in one embodiment.

FIG. 4 is a flow diagram that illustrates the processing of the extract linguistic features component of the classification system in one embodiment.

FIG. 5 is a flow diagram that illustrates the processing of the classify block component of the classification system in one embodiment.

DETAILED DESCRIPTION

A method and system for classifying blocks of a web document based on linguistic features is provided. In some embodiments, a classification system trains a classifier to classify blocks of the web page into various classifications of the function of the block. These classifications may include an information classification and a non-information classification. An information classification may indicate that a block relates to the primary topic of the web page, and a non-information classification may indicate that the block contains noise information such as an advertisement. The classification system trains a classifier using training web pages. To train a classifier, the classification system identifies the blocks of the training web pages, generates feature vectors for the blocks, and inputs classification labels for each block. For example, the feature vector of a block may include linguistic features based on the parts of speech or the capitalization of the words within the text of the block. The feature vector of a block may also include layout features such as size of the block, position of the block within the web page, and so on. The classification system learns the coefficients of the classifier using any of a variety of machine learning techniques such as a support vector machine or linear regression. The classification system can then use the classifier to classify blocks of web pages. To classify the blocks of a web page, the classification system identifies the blocks of the web page and generates a feature vector for each block. The classification system then applies the classifier to the feature vector of a block to generate a classification of the block.

The classification system uses linguistic features to help classify the function of a block because developers of web pages tend to use different linguistic features within blocks having different functions. For example, a block with a navigation function will likely have very short phrases with no sentences. In contrast, a block with a function of providing the text of the primary topic of a web page will likely have complex sentences. Also, a block that is directed to the primary topic of a web page may have named entities, such as persons, locations, and organizations. In addition to linguistic features, the classification system may use term features in recognition that certain terms occur in blocks with certain functions. For example, a block with the terms copyright, privacy, rights, reserved, and so on may be a block with a copyright notice function. A block with the terms sponsored link, ad, or advertisement may have an advertisement function. The linguistic features and term features may be considered to be non-layout features. One skilled in the art will appreciate that the classification system can use any combination of layout and non-layout features depending on the classification objectives of the system.

The classification system may be used to classify web pages based on any hierarchical or non-hierarchical of classifications with any number of classifications. For example, the classification system may use two classifications to classify web pages as having a certain function (e.g., information) or not (e.g., noise). As another example, the classification system may use five classifications: information, interaction, navigation, advertisement, and other. The information classification indicates that the content of the block is related to the primary topic of the web page. The interaction classification indicates that the block is an area of the web page for a user to interact, such as an input field for a user query or for information submission. The navigation classification indicates that the content of the block provides a navigation guide to the user, such as a navigation bar or a content index. The advertisement classification indicates that the content of the block is an advertisement. The other classification indicates that the block has none of the other functions. The other classification may include copyright blocks, contact blocks, decoration blocks, and so on.

In some embodiments, a classification system may use layout features that can be categorized as spatial features, presentation features, tag features, and hyperlink features. The spatial features relate to the size and location of a block within the web page. For example, copyright block may typically be located at the lower portion of a web page. The presentation features relate to how the content of the block is presented. For example, a presentation features may include font size, number of images in a block, number of words within a block, and so on. The tag features indicate the types of tags used in the markup language describing the block. For example, the tags' form and input may indicate that the function of the block is interaction. The hyperlink features may indicate that the block is a navigation block. Various layout features used by the classification system are described in Table 1.

TABLE 1
Layout Features
Category
spatial features:
1. x and y coordinates of the center point of a block/page
2. width and height of a block/page
presentation features:
1. maximum font size of the inner text in a block/page
2. maximum font weight of the inner text in a block/page
3. number of words in the inner text in a block/page
4. number of words in the anchor text in a block/page
5. number of images in a block/page
6. total size of images in pixels in a block/page
7. total size of form fields in pixels in a block/page
tag features:
1. number of form and input tags in a block/page:
<form>, <input>, <option>, <selection>, etc.
2. number of table tags in a block/page:
<table>, <tr>, <td>
3. number of paragraph tags in a block/page: <p>
4. number of list tags in a block/page:
<li>, <dd>, <dt>
5. number of heading tags in a block/page:
<h1>, <h2>, <h1>
hyperlink features:
1. total number of hyperlinks in a block/page
2. number of intrasite hyperlinks in a block/page
3. number of inter-site hyperlinks in a block/page
4. number of hyperlinks on anchor text in a block/page
5. number of hyperlinks on images in a block/page

Citations

US 7,607,082 B2 - Categorizing page block functionality to improve document layout for browsing
Categorizing page block functionality to improve document layout for browsing is described. In one aspect, document content is analyzed with respect to multiple block function...

US 7,058,633 B1 - System and method for generalized URL-rewriting
A URL re-writing system and method in a network examines and modifies HTML data and its embedded URLs. The re-writing system can be implemented in...

US 2005 66,269 A1 - Information block extraction apparatus and method for Web pages
A method and apparatus for identifying coherent areas within a Web page. First, a Web page is parsed into an HTML DOM tree and an...

US 2006 294,199 A1 - Systems and Methods for Providing A Foundational Web Platform
A Web platform web-application framework in which functional block components are loaded as library elements at the time a website is accessed provides increased quality,...

US 2006 149,726 A1 - Segmentation of web pages
A system and method for segmentation of web pages is provided. In one embodiment, the invention is a method. The method includes receiving a request...

US 2005 125,725 A1 - System and method for facilitating creation of a group activity publication
A computer-implemented method for facilitating group management functionality comprises receiving a request for creating a group activity publication layout, accessing group activity publication layout information...

US 7,401,079 B2 - System and method for transcoding digital content
A system and method for transcoding digital content (e.g. web content) by correctly employing one annotation for multiple digital contents. This can efficiently reduce the...

US 6,970,602 B1 - Method and apparatus for transcoding multimedia using content analysis
A method and apparatus for selecting at least one transcoding method for manipulating multimedia data for delivery on the basis of analysis of the content...

US 2006 123,042 A1 - Block importance analysis to enhance browsing of web page search results
Systems and methods for block importance analysis to enhance browsing of web page search results are described. In one aspect, a server analyzes content of...

US 2005 246,296 A1 - Method and system for calculating importance of a block within a display page
A method and system for identifying the importance of information areas of a display page. An importance system identifies information areas or blocks of a...

US 2005 289,133 A1 - Methods and systems for managing data
Systems and methods for managing data, such as metadata. In one exemplary method, metadata from files created by several different software applications are captured, and...

US 7,428,700 B2 - Vision-based document segmentation
Vision-based document segmentation identifies one or more portions of semantic content of a document. The one or more portions are identified by identifying a plurality...

US 2006 282,758 A1 - System and method for identifying segments in a web resource
A robust, lightweight, bottom-up segmentation method for Internet content. According to the present invention, individual segments are created based upon weights assigned according to document...

US 2006 31,202 A1 - Method and system for extracting web query interfaces
A computer program product being embodied on a computer readable medium for extracting semantic information about a plurality of documents being accessible via a computer...

US 2006 150,094 A1 - Web companion
Embodiments of the present disclosure provide systems and methods for presenting network content, such as web documents. Briefly described, in architecture, one embodiment of the...

US 7,131,063 B2 - Method and system for delivering dynamic information in a network
A method and system for delivering dynamic web pages, for example including a report resulting from a database query, in the INTERNET. The query is...

US 2006 107,205 A1 - Determining a main content area of a page
A method, a computer program, a computer program product, a device and a system for determining a main content area of a page, determines which...

US 7,322,007 B2 - Electronic document modification
Electronic document design methods and computer programs allowing a user to separately control and modify layout and the design components of an electronic document. A...

PatentSwarm provides a collaborative workspace to search, highlight, annotate, and monitor patent data.

Start free trial Sign in