All Topics  
Controlled vocabulary

 

   Email Print
   Bookmark   Link






 

Controlled vocabulary



 
 
Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing
Subject indexing

Subject indexing is the act of describing a document by keyword to indicate what the document is about or to summarize its content . Indices are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents within a field of knowledge....
 schemes, subject headings, thesauri and taxonomies. Controlled vocabulary schemes mandate the use of predefined, authorised terms that have been preselected by the designer of the vocabulary, in contrast to natural language vocabularies, where there is no restriction on the vocabulary.

In library and information science
In library and information science controlled vocabulary is a carefully selected list of words and phrase
Phrase

In grammar, a phrase is a group of words that functions as a single unit in the syntax of a Sentence .For example the house at the end of the street is a phrase....
s, which are used to tag
Tag (metadata)

A tag is a non-hierarchical index term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching....
 units of information (document or work) so that they may be more easily retrieved by a search..






Discussion
Ask a question about 'Controlled vocabulary'
Start a new discussion about 'Controlled vocabulary'
Answer questions from other users
Full Discussion Forum



Encyclopedia


Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing
Subject indexing

Subject indexing is the act of describing a document by keyword to indicate what the document is about or to summarize its content . Indices are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents within a field of knowledge....
 schemes, subject headings, thesauri and taxonomies. Controlled vocabulary schemes mandate the use of predefined, authorised terms that have been preselected by the designer of the vocabulary, in contrast to natural language vocabularies, where there is no restriction on the vocabulary.

In library and information science


In library and information science controlled vocabulary is a carefully selected list of words and phrase
Phrase

In grammar, a phrase is a group of words that functions as a single unit in the syntax of a Sentence .For example the house at the end of the street is a phrase....
s, which are used to tag
Tag (metadata)

A tag is a non-hierarchical index term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching....
 units of information (document or work) so that they may be more easily retrieved by a search.. Controlled vocabularies solve the problems of homographs, synonyms and polysemes by ensuring that each concept is described using only one authorized term and each authorized term in the controlled vocabulary describes only one concept. In short, controlled vocabularies reduce ambiguity inherent in normal human languages where the same concept can be given different names and ensure consistency.

For example, in the Library of Congress Subject Heading (a subject heading system that uses controlled vocabulary), authorised terms (subject headings in this case) have to be chosen to handle choices between variant spellings of the same concept (American versus British), choice among scientific and popular terms (Cockroaches versus Periplaneta americana), choices between synonyms (automobile versus cars) among other difficult issues.

Choices of authorised terms are based on the principles of user warrant (what terms users are likely to use), literary warrant (what terms are generally used in the literature and documents), structural warrant (terms chosen by considering the structure, scope of the controlled vocabulary).

Controlled vocabularies also typically handle the problem of homographs, with qualifiers. For example, the term "pool" has to be qualified to refer to either swimming pool, or the game pool to ensure that each authorised term or heading refers to only one concept.

There are two main kinds of controlled vocabulary tools used in libraries: subject headings and thesauri. While the differences between the two are diminishing, there are still some minor differences.

Historically subject headings were designed to describe books in library catalogs by catalogers while thesauri were used by indexers to apply index terms to documents and articles. Subject headings tend to be broader in scope describing whole books, while thesauri tend to be more specialised covering very specific disciplines. Also because of the card catalog system, subject headings tend to have terms that are in indirect order (though with the rise of automated systems this is being removed), while thesaurus terms are always in direct order. Subject headings also tend to use more pre-co-ordination of terms such that the designer of the controlled vocabulary will combine various concepts together to form one authorised subject heading. (e.g., children and terrorism) while thesauri tend to use singular direct terms. Lastly thesauri list not only equivalent terms but also narrower, broader terms and related terms among various authorised and non-authorised terms, while historically most subject headings did not.

For example Library of Congress Subject Heading itself did not have much syndetic
Syndetic

Syndetic may refer one of the following*Syndetic set, in mathematics*Syndetic coordination, in linguistics* Syndetic structure, the commonly part of library cataloging...
 structure until 1943, and it was not until 1985 when it began to adopt the thesauri type term "Broader term" and "Narrow term".

The terms
Terminology

Terminology is the study of terms and their use. Terms are words and compound words that are used in specific contexts. Not to be confused with "terms" in colloquial usages, the shortened form of technical terms which are defined within a Academic discipline or speciality field....
 are chosen and organized by trained professionals (including librarians and information scientists) who possess expertise in the subject area. Controlled vocabulary terms can accurately describe what a given document is actually about, even if the terms themselves do not occur within the document's text. Well known subject heading systems include the Library of Congress system, MESH, and Sears. Well known thesauri include the Art and Architecture Thesaurus and the ERIC Thesaurus.

Choosing authorized terms to be used is a tricky business, besides the areas already considered above, the designer has to consider the specific
SPECIFIC

Sorry, no overview for this topic
ity of the term chosen, whether to use direct entry, inter consistency and stability of the language. Lastly the amount of pre-co-ordinate (in which case the degree of enumeration versus synthesis becomes an issue) and post co-ordinate in the system is another important issue

Controlled vocabularies tagged to documents are metadata
Metadata

Metadata is "data about other data", of any sort in any media. An item of metadata may describe an individual datum, or content item, or a collection of data including multiple content items and hierarchical levels, for example a database schema....
.

Indexing languages


There are three main types of indexing languages.

  • Controlled indexing language - Only approved terms can be used by the indexer to describe the document


  • Natural language indexing language - Any term from the document in question can be used to describe the document.


  • Free indexing language - Any term (not only from the document) can be used to describe the document.


When indexing a document, the indexer also has to choose the level of indexing exhaustivity, the level of detail in which the document is described. For example using low indexing exhaustivity, minor aspects of the work will not be described with index terms. In general the higher the indexing exhaustivity, the more terms indexed for each document.

In recent years free text search as a means of access to documents has become popular. This involves using natural language indexing with an indexing exhaustively set to maximum (every word in the text is indexed). Many studies have been done to compare the efficiency and effectiveness of free text searches against documents that have been indexed by experts using a few well chosen controlled vocabulary descriptors.

Controlled vocabularies are often claimed to improve the accuracy of free text searching, such as to reduce irrelevant
Relevance (information retrieval)

In the context of information science and information retrieval, relevance denotes how well a retrieved set of documents meets the information need of the user....
 items in the retrieval list. These irrelevant items (false positives) are often caused by the inherent ambiguity of natural language
Natural language

In the philosophy of language, a natural language is a language that is spoken, Sign language, or writing by humans for general-purpose communication, as distinguished from formal languages and from constructed languages....
. Take the English word football for example. Football is the name given to a number of different team sport
Team sport

Team sport refers to sports that are practiced between opposing teams, where the players interact directly and simultaneously between them to achieve an objective....
s. Worldwide the most popular of these team sports is Association football
Football (soccer)

Association football, more commonly known as football or soccer, is a team sport played between two teams of eleven players, and is widely considered to be the most popular sport in the world....
, which also happens to be called soccer in several countries. The English language
English language

English is a West Germanic language that originated in Anglo-Saxon England and has lingua franca status in many parts of the world as a result of the military, economic, scientific, political and cultural influence of the British Empire in the 18th, 19th and early 20th centuries and that of the United States from the mid 20th century onwa...
 word football
Football (word)

The English language word football may mean any one of several team sports , depending on the national or regional origin and location of the person using the word....
 is also applied to Rugby football
Rugby football

Rugby football may refer to a number of sports through history descended from a common form of football developed in different areas of England....
 (Rugby union
Rugby union

Rugby union is a competitive outdoor contact sport, played with an oval ball, by two teams of 15 players. It is one of the two main codes of rugby football, the other being rugby league....
 and rugby league
Rugby league

Rugby league football is a competitive Full-contact sport team sport played with a spheroid-shaped ball by two teams of thirteen on a rectangular grass field....
), American football
American football

American football, known in the United States and Canada simply as football, is a competitive team sport known for mixing strategy with physical play....
, Australian rules football
Australian rules football

Australian football, or simply known as football, footy, Aussie rules or as AFL, is a team sport played between two teams of 18 players with a football in the shape of a prolate spheroid....
, Gaelic football
Gaelic football

Gaelic football , commonly referred to as "football", "Gaelic", or "Gah" is a form of football played mainly in Ireland. It is, together with hurling, one of the two most popular spectator sports in Ireland today....
, and Canadian football
Canadian football

Canadian football is a form of gridiron football played chiefly in Canada in which two teams of twelve players each compete for territorial control of a field of play long and wide , attempting to advance a pointed prolate spheroid ball into the opposing team's scoring area ....
. A search for football therefore will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by tagging
Tag (metadata)

A tag is a non-hierarchical index term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching....
 the documents in such a way that the ambiguities are eliminated.

Compared to free text searching, the use of a controlled vocabulary can dramatically increase the performance of an information retrieval system, if performance is measured by precision (the percentage of documents in the retrieval list that are actually relevant
Relevance

Relevance is a term used to describe how pertinent, connected, or applicable something is to a given matter. A thing is relevant if it serves as a means to a given purpose....
 to the search topic).

In some cases controlled vocabulary can enhance recall as well, because unlike natural language schemes, once the correct authorised term is searched, you don't need to worry about searching for other terms that might be synonyms of that term.

However, a controlled vocabulary search may also lead to unsatisfactory recall
Recall (information retrieval)

Recall in Information Retrieval is the fraction of the documents that are relevant to the query that are successfully retrieved.For example for text search on a set of documents recall is the number of correct results divided by the number of results that should have been returned...
, in that it will fail to retrieve some documents that are actually relevant to the search question.

This is particularly problematic when the search question involves terms that are sufficiently tangential to the subject area such that the indexer might have decided to tag it using a different term (but the searcher might consider the same). Essentially, this can be avoided only by an experienced user of controlled vocabulary whose understanding of the vocabulary coincides with the way it is used by the indexer.

Another possibility is that the article is just not tagged by the indexer because indexing exhaustivity is low. For example an article might mention football as a secondary focus, and the indexer might decide not to tag it with "football" because it is not important enough compared to the main focus. But it turns out that for the searcher that article is relevant and hence recall fails. A free text search would automatically pick up that article regardless.

On the other hand free text searches have high exhaustivity (you search on every word) so it has potential for high recall (assuming you solve the problems of synonyms by entering every combination) but will have much lower precision.

Controlled vocabularies are also quickly out-dated and in fast developing fields of knowledge, the authorised terms available might not be available if they are not updated regularly. Even in the best case scenario, controlled language is often not as specific as using the words of the text itself. Indexers trying to choose the appropriate index terms might mis-interpret the author, while a free text search is in no danger of doing so, because it uses the author's own words.

The use of controlled vocabularies can be costly compared to free text searches because human experts or expensive automated systems are necessary to index each entry. Furthermore, the user has to be familiar with the controlled vocabulary scheme to make best use of the system. But as already mentioned, the control of synonyms, homographs can help increase precision.

Numerous methodologies have been developed to assist in the creation of controlled vocabularies, including faceted classification
Faceted classification

A faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, pre-determined, taxonomy order....
, which enables a given data record or document to be described in multiple ways.

Applications

Controlled vocabularies, such as the Library of Congress Subject Headings
Library of Congress Subject Headings

The Library of Congress Subject Headings comprise a thesaurus of subject headings, maintained by the United States Library of Congress, for use in bibliographic records....
, are an essential component of bibliography
Bibliography

Bibliography , as a practice, is the academic study of books as physical, cultural objects; in this sense, it is also known as bibliology ....
, the study and classification of books. They were initially developed in library and information science. In the 1950s, government agencies began to develop controlled vocabularies for the burgeoning journal literature in specialized fields; an example is the Medical Subject Headings
Medical Subject Headings

Medical Subject Headings is a huge controlled vocabulary for the purpose of index journal articles and books in the life sciences; it can also serve as a thesaurus that facilitates searching....
 (MeSH) developed by the U.S. National Library of Medicine
United States National Library of Medicine

The United States National Library of Medicine , operated by the United States federal government, is the world's largest medical library. The collections of the National Library of Medicine include more than seven million books, journals, technical reports, manuscripts, microfilms, photographs, and images on medicine and related science...
. Subsequently, for-profit firms (called Abstracting and indexing services) emerged to index the fast-growing literature in every field of knowledge. In the 1960s, an online bibliographic database industry developed based on dialup X.25
X.25

X.25 is an ITU-T standard network layer protocol for Packet switched network wide area network communication. An X.25 WAN consists of Packet switching nodes as the networking hardware, and leased lines, Plain old telephone service connections or ISDN connections as physical links....
 networking. These services were seldom made available to the public because they were difficult to use; specialist librarians called search intermediaries handled the searching job. In the 1980s, the first full text databases appeared; these databases contain the full text of the index articles as well as the bibliographic information. Online bibliographic databases have migrated to the Internet and are now publicly available; however, most are proprietary and can be expensive to use. Students enrolled in colleges and universities may be able to access some of these services without charge; some of these services may be accessible without charge at a public library.

In large organizations, controlled vocabularies may be introduced to improve technical communication
Technical communication

Technical communication is the process of conveying technical information through writing, speech, and other mediums to a specific audience. Information is usable if the intended audience can perform an action or make a decision based on it ....
. The use of controlled vocabulary ensures that everyone is using the same word to mean the same thing. This consistency of terms is one of the most important concepts in technical writing
Technical writing

Technical writing, a form of technical communication, is a style of formal writing and is used in fields as diverse as computer hardware and software, chemistry, the aerospace, robotics, finance, consumer electronics, and biotechnology....
 and knowledge management
Knowledge management

Knowledge Management comprises a range of Best practice used in an organisation to identify, create, represent, distribute and enable adoption of insights and experiences....
, where effort is expended to use the same word throughout a document
Document

A document is a bounded physical representation of body of information designed with the capacity to communication. A document may manifest symbolic, diagrammatic or sensory-representational information....
 or organization
Organization

An organization is a social arrangement which pursues collective goals, which controls its own performance, and which has a boundary separating it from its environment....
 instead of slightly different ones to refer to the same thing.

Web searching could be dramatically improved by the development of a controlled vocabulary for describing Web pages; the use of such a vocabulary could culminate in a Semantic Web
Semantic Web

The Semantic Web is an evolving extension of the World Wide Web in which the semantics of information and services on the web is defined, making it possible for the web to understand and satisfy the requests of people and machines to use the web content....
, in which the content of Web pages is described using a machine-readable metadata
Metadata

Metadata is "data about other data", of any sort in any media. An item of metadata may describe an individual datum, or content item, or a collection of data including multiple content items and hierarchical levels, for example a database schema....
 scheme. One of the first proposals for such a scheme is the Dublin Core
Dublin Core

The Dublin Core metadata element set is a standard for cross-domain information Resource description. It provides a simple and standardised set of conventions for describing things online in ways that make them easier to find....
 Initiative.

It is unlikely that a single metadata scheme will ever succeed in describing the content of the entire Web. To create a Semantic Web, it may be necessary to draw from two or more metadata systems to describe a Web page's contents. The eXchangeable Faceted Metadata Language
EXchangeable Faceted Metadata Language

eXchangeable Faceted Metadata Language is an open XML specification for defining and sharing faceted classification schemes....
 (XFML) is designed to enable controlled vocabulary creators to publish and share metadata systems. XFML is designed on faceted classification
Faceted classification

A faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, pre-determined, taxonomy order....
 principles.

See also

  • Authority control
    Authority control

    Authority control is a term used in library and information science to refer to the practice of creating and maintaining headings for bibliographic material in a library catalog....
  • Controlled natural language
    Controlled natural language

    Controlled natural languages are subsets of natural languages, obtained byrestricting the grammar and vocabulary in orderto reduce or eliminate ambiguity and complexity....
  • Faceted classification
    Faceted classification

    A faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, pre-determined, taxonomy order....
  • Full text search
    Full text search

    In text retrieval, full text search refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user....
  • Information retrieval
    Information retrieval

    Information retrieval is the science of searching for documents, for information within documents and for Metadata about documents, as well as that of searching relational databases and the World Wide Web....
  • Metadata
    Metadata

    Metadata is "data about other data", of any sort in any media. An item of metadata may describe an individual datum, or content item, or a collection of data including multiple content items and hierarchical levels, for example a database schema....
    : Metadata registry
    Metadata registry

    A metadata registry is a central location in an organization where metadata definitions are stored and maintained in a controlled method....
  • Nomenclature
    Nomenclature

    Nomenclature can refer to a system of names or terms, or the rules used for forming the names, as used by an individual or community, especially those used in a particular science or art....
  • Ontology (computer science)
    Ontology (computer science)

    In computer science and information science, an ontology is a formal representation of a set of concepts within a Domain of discourse and the relationships between those concepts....
  • Semantic spectrum
    Semantic spectrum

    The semantic spectrum is a series of increasingly precise or rather semantics expressive definitions for data elements in knowledge representations, especially for machine use....
  • Subject indexing
    Subject indexing

    Subject indexing is the act of describing a document by keyword to indicate what the document is about or to summarize its content . Indices are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents within a field of knowledge....
  • Terminology
    Terminology

    Terminology is the study of terms and their use. Terms are words and compound words that are used in specific contexts. Not to be confused with "terms" in colloquial usages, the shortened form of technical terms which are defined within a Academic discipline or speciality field....
    • Technical terminology
      Technical terminology

      Technical terminology is the specialized vocabulary of a field, the nomenclature. These terms have specific definitions within the field, which is not necessarily the same as their meaning in common use....
  • Text retrieval
  • Thesaurus
    Thesaurus

    A thesaurus is a work that contains synonyms and sometimes antonyms, in contrast to a dictionary, which contains definitions and pronunciations....
  • Universal Data Element Framework
    Universal Data Element Framework

    The Universal Data Element Framework provides the foundation for building an enterprise-wide controlled vocabulary. It is a standard way of indexing enterprise information that can produce big cost savings....
  • Vocabulary-based transformation
    Vocabulary-based transformation

    In metadata, a vocabulary-based transformation is a transformation aided by the use of a semantic equivalence statements within a controlled vocabulary....


External links

  • — explains how controlled vocabularies are useful in describing images and information for classifying content in electronic databases.
  • - a basic introduction to CV concepts aimed at those developing them for use with business oriented asset libraries.