Wrapper (data mining)
Encyclopedia
Wrapper in data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

 is a program that extracts content of a particular information source and translates it into a relational form
Relational model
The relational model for database management is a database model based on first-order predicate logic, first formulated and proposed in 1969 by Edgar F...

. Many web pages present structured data - telephone directories, product catalogs, etc. formatted for human browsing using HTML language. Structured data are typically descriptions of objects retrieved from underlying databases and displayed in Web pages following some fixed templates. Software systems using such resources must translate HTML content into a relational form. Wrappers are commonly used as such translators. Formally, a wrapper is a function from a page to the set of tuples
Tuple
In mathematics and computer science, a tuple is an ordered list of elements. In set theory, an n-tuple is a sequence of n elements, where n is a positive integer. There is also one 0-tuple, an empty sequence. An n-tuple is defined inductively using the construction of an ordered pair...

 it contains.

Wrapper generation

There are two main approaches to wrapper generation: wrapper induction and automated data extraction.
Wrapper induction uses supervised learning to learn data extraction rules from manually labeled training examples. The disadvantages of wrapper induction are
  • the time-consuming manual labeling process and
  • the difficulty of wrapper maintenance.

Due to the manual labeling effort, it is hard to extract data from a large number of sites as each site has its own templates and requires separate manual labeling for wrapper learning.
Wrapper maintenance is also a major issue because whenever a site changes the wrappers built for the site
become obsolete. Due to these shortcomings, researchers have studied automated wrapper generation using
unsupervised pattern mining. Automated extraction is possible because most Web data objects follow fixed
templates. Discovering such templates or patterns enables the system to perform extraction automatically.

Wrapper generation on the Web is an important problem with a wide range of applications. Extraction of such data enables one to integrate data/information from multiple Web sites to provide value-added services, e.g., comparative shopping, object search, and information integration.

See also

  • Business intelligence
    Business intelligence
    Business intelligence mainly refers to computer-based techniques used in identifying, extracting, and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes....

     (section Semi-structured or unstructured data)
  • Web scraping
    Web scraping
    Web scraping is a computer software technique of extracting information from websites...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK