XML-Retrieval
Encyclopedia
XML Retrieval, or XML Information Retrieval, is the content-based retrieval of documents structured with XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 (eXtensible Markup Language). As such it is used for computing relevance
Relevance (information retrieval)
In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user.-Types:...

 of XML documents.

Queries

Most XML retrieval approaches do so based on techniques from the information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

 (IR) area, e.g. by computing the similarity between a query consisting of keywords (query terms) and the document. However, in XML-Retrieval the query can also contain structural
Data structure
In computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks...

 hints
Hint (SQL)
In database query operations, various SQL implementations use hints as additions to the SQL standard that instruct the database engine on how to execute the query...

. So-called "content and structure" (CAS) queries enable users to specify what structure the requested content can or must have.

Exploiting XML structure

Taking advantage of the self-describing
Self-documenting
In computer programming, self-documenting is a common descriptor for source code that follows certain loosely-defined conventions for naming and structure...

 structure of XML documents can improve the search for XML documents significantly. This includes the use of CAS queries, the weighting of different XML elements differently and the focused retrieval of subdocuments.

Ranking

Ranking in XML-Retrieval can incorporate both content relevance and structural similarity, which is the resemblance between the structure given in the query and the structure of the document. Also, the retrieval units resulting from an XML query may not always be entire documents, but can be any deeply nested XML elements, i.e. dynamic documents. The aim is to find the smallest retrieval unit that is highly relevant. Relevance can be defined according to the notion of specificity, which is the extent to which a retrieval unit focuses on the topic of request.

Existing XML search engines

An overview of two potential approaches is available. The INitiative for the Evaluation of XML-Retrieval (INEX) was founded in 2002 and provides a platform for evaluating such algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

s. Three different areas influence XML-Retrieval:

Traditional XML query languages

Query language
Query language
Query languages are computer languages used to make queries into databases and information systems.Broadly, query languages can be classified according to whether they are database query languages or information retrieval query languages...

s such as the W3C standard XQuery
XQuery
- Features :XQuery provides the means to extract and manipulate data from XML documents or any data source that can be viewed as XML, such as relational databases or office documents....

 supply complex queries, but only look for exact matches. Therefore, they need to be extended to allow for vague search with relevance computing. Most XML-centered approaches imply a quite exact knowledge of the documents' schemas
Database schema
A database schema of a database system is its structure described in a formal language supported by the database management system and refers to the organization of data to create a blueprint of how a database will be constructed...

.

Databases

Classic database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

 systems have adopted the possibility to store semi-structured data
Semi-structured model
The semi-structured model is a database model. In this model, there is no separation between the data and the schema, and the amount of structure used depends on the purpose.The advantages of this model are the following:...

 and resulted in the development of XML database
XML database
An XML database is a data persistence software system that allows data to be stored in XML format. This data can then be queried, exported and serialized into the desired format.Two major classes of XML database exist:...

s. Often, they are very formal, concentrate more on searching than on ranking, and are used by experienced users able to formulate complex queries.

Information retrieval

Classic information retrieval models such as the vector space model
Vector space model
Vector space model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings...

 provide relevance ranking, but do not include document structure; only flat queries are supported. Also, they apply a static document concept, so retrieval units usually are entire documents. They can be extended to consider structural information and dynamic document retrieval. Examples for approaches extending the vector space models are available: they use document subtrees (index terms plus structure) as dimensions of the vector space.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK